EP2215632B1 - Method, device and computer program code means for voice conversion - Google Patents

Method, device and computer program code means for voice conversion Download PDF

Info

Publication number
EP2215632B1
EP2215632B1 EP08804436A EP08804436A EP2215632B1 EP 2215632 B1 EP2215632 B1 EP 2215632B1 EP 08804436 A EP08804436 A EP 08804436A EP 08804436 A EP08804436 A EP 08804436A EP 2215632 B1 EP2215632 B1 EP 2215632B1
Authority
EP
European Patent Office
Prior art keywords
glottal
parameters
vocal tract
converted
lsf
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Not-in-force
Application number
EP08804436A
Other languages
German (de)
French (fr)
Other versions
EP2215632A1 (en
Inventor
María Arantzazu DEL POZO ECHEZARRETA
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fundacion Centro de Tecnologias de Interaccion Visual y Comunicaciones Vicomtech
Original Assignee
Fundacion Centro de Tecnologias de Interaccion Visual y Comunicaciones Vicomtech
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fundacion Centro de Tecnologias de Interaccion Visual y Comunicaciones Vicomtech filed Critical Fundacion Centro de Tecnologias de Interaccion Visual y Comunicaciones Vicomtech
Publication of EP2215632A1 publication Critical patent/EP2215632A1/en
Application granted granted Critical
Publication of EP2215632B1 publication Critical patent/EP2215632B1/en
Not-in-force legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/057Time compression or expansion for improving intelligibility
    • G10L2021/0575Aids for the handicapped in speaking

Definitions

  • the present invention relates to methods and systems for voice conversion.
  • Voice Conversion aims at transforming a source speaker's speech to sound like that of a different target speaker. Text-to-speech synthesisers, dialogue systems and speech repair are among the numerous applications which can greatly benefit from the development of voice conversion technology.
  • the most widely used speech signal representations are the Source-Filter Model and the Sinusoidal Model.
  • the Source-Filter representation ( G. Fant, Acoustic Theory of Speech Production, ISBN 9027916004 ) is based on a simple production model composed of a glottal source waveform exciting a time-varying filter loaded at its output by the radiation of the lips.
  • the main challenge in Source-Filter modelling is the estimation of the glottal waveform and vocal tract filter parameters from the speech signal.
  • the Liljencrants-Fant (LF) model ( The LF-model revisited. Transformations and frequency domain analysis, STL-QPSR, vol. 36, number 2-3, 1995, pages 119-156 ) has become the model of choice for research on the glottal source. It has been shown to be capable of modelling a wide range of naturally occurring phonations and the effects of its parameter variations are well understood. It exploits the linearity and time-invariance properties of the Source-Filter representation and assumes the commutation of the vocal tract and lip radiation filters to combine the modelling of the source excitation and lip radiation in the parameterisation of the derivative of the glottal waveform.
  • LF Liljencrants-Fant
  • Linear Prediction is one popular technique used to obtain a combined parameterisation of the glottal source, vocal tract and lip radiation components in a unique all-pole filter H (z) .
  • a filter is then excited, as shown in Figure 1 , by a sequence of impulses spaced at the fundamental period To during voiced speech and by white Gaussian noise during unvoiced speech.
  • the LP error or residual would be a train of impulses spaced at the voiced excitation instants and the impulse/noise voice source modelling would be accurate.
  • the LP residual looks more like a white noise signal with larger values around the instants of excitation.
  • H. Lu et. al. have proposed a convex optimization method to automatically estimate the vocal tract filter and glottal waveform jointly (Joint estimation of vocal tract filter and glottal source waveform via convex optimization, Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999 ).
  • the better modelling of the glottal source employed by this approach results in speech which has better quality than that of LP.
  • the parameterisation of the glottal waveform allows its parametric modification, which can be exploited in voice conversion applications.
  • Sinusoidal Models assume the speech waveform to be composed of the sum of a small number of sinusoids with time-varying amplitudes, frequencies and phases. Such modelling was mainly developed by McAulay and Quatieri (Speech Analysis/Synthesis Based on a Sinusoidal Representation, IEEE Transactions on Acoustics, Speech and Signal Processing, pp. 744-754, 1986 ) in the mid-1980's and has been shown to be capable of producing high quality speech even after pitch and time-scale transformations. However, because of the high number of sinusoidal amplitudes, frequencies and phases involved, sinusoidal modelling results less flexible than the source-filter representation to modify spectral features.
  • state-of-the-art voice conversion (VC) implementations mainly employ variations and extensions of the original sinusoidal model.
  • VC state-of-the-art voice conversion
  • they generally adopt a source-filter formulation based on LP to carry out spectral transformations.
  • Spectral envelopes are generally encoded in line spectral frequencies (LSF) for voice conversion, since LSFs have been shown to possess very good linear interpolation characteristics and to relate well to formant location and bandwidth. Because the frequency resolution of the human ear is greater at low frequencies than at high frequencies, spectral envelopes are often warped to a non-linear scale, e.g. the Bark scale, taking the non-uniform sensitivity of the human ear into account. Usually, only spectral envelopes of voiced speech segments are transformed, since unvoiced sounds contain little vocal tract information and their spectral envelopes present high variations. Among the existing different spectral envelope conversion techniques, continuous probabilistic linear transformations have been found to be the most robust and efficient approach.
  • LSF line spectral frequencies
  • LP residual conversion sinusoidal VC systems have developed residual prediction and selection methods ( D. Suendermann, A. Bonafonte, H. Ney, and H. Hoege, A study on residual prediction techniques for voice conversion, in Proc.ICASSP, 2005, pp. 13-16 .) based on the correlation between spectral envelope and LP residuals. These methods reintroduce the target spectral detail lost after envelope conversion. Because residuals contain the errors introduced by the LP parameterisation, residual prediction techniques have been found to improve conversion performance. However, LP residuals do not constitute an accurate model of the voice source and residual prediction alone is not capable of modifying the quality of the voice source. This prevents their use in applications requiring voice quality modifications such as, for example, speech repair.
  • the stage of training comprises, given a training database of parallel source and target data, for each pitch period of said training database: modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, to obtain a set of LF parameters, said set of LF parameters comprising an excitation strength parameter and a set of T-parameters modelling a glottal waveform, and a set of all-pole vocal tract filter coefficients; converting said T-parameters into R-parameters; converting said all-pole vocal tract filter coefficients into line spectral frequencies in Bark scale; defining a glottal vector to be converted; defining a vocal tract vector to be converted, said vocal tract vector comprising said line spectral frequencies in Bark scale; applying wavelet denoising to obtain an estimate of a glottal aspiration noise.
  • the stage of training also comprises, from the set of vocal tract vectors obtained for each pitch period of the said training database, estimating a vocal tract continuous probabilistic linear transformation function using the least square error criterion.
  • the previous stage of modelling further comprises the steps of modelling said aspiration noise estimate by modulating zero mean unit variance Gaussian noise with the said modelled glottal waveform and adjusting its energy to match that of the said aspiration noise estimate.
  • the glottal vector to be converted comprises said excitation strength parameter, said R-parameters and said energy of the aspiration noise estimate.
  • a given test speech waveform is modelled and transformed into a set of converted parameters.
  • a converted speech waveform is synthesised from the said set of converted parameters.
  • the stage of training further comprises: from the set of glottal vectors obtained for each pitch period of the said training database, estimating a glottal waveform continuous probabilistic linear transformation function using the least square error criterion.
  • the step of modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model preferably comprises the steps of: modelling the glottal waveform using the Rosenberg-Klatt model; using convex optimization to obtain a set of Rosenberg-Klatt glottal waveform parameters and the all-pole vocal tract filter coefficients, wherein said step of using convex optimization comprises a step of adaptive pre-emphasis for estimating and removing a spectral tilt filter contribution from the speech waveform before convex optimization.
  • step of modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model further comprises the steps of: obtaining a derivative glottal waveform by inverse filtering said pitch period using said all-pole vocal tract filter coefficients; fitting said set of LF parameters to the said inverse filtered derivative glottal waveform by direct estimation and constrained non-linear optimization.
  • the stage of conversion preferably comprises, for each pitch period of said test speech waveform: obtaining a glottal vector to be converted, said glottal vector comprising an excitation strength parameter, a set of R-parameters and the energy of the said aspiration noise estimate; obtaining a vocal tract vector to be converted, said vocal tract vector comprising a set of line spectral frequencies in Bark scale; applying said vocal tract continuous probabilistic linear transformation function estimated during the training stage to obtain a converted vocal tract parameter vector; transforming said glottal vector using said glottal waveform continuous probabilistic linear transformation function estimated during the training stage, thus obtaining a converted glottal vector comprising a set of converted parameters.
  • those stages of obtaining a glottal vector to be converted and a vocal tract vector to be converted further comprise the steps of: modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, to obtain a set of LF parameters, said set of LF parameters comprising an excitation strength parameter and a set of T-parameters modelling a glottal waveform, and a set of all-pole vocal tract filter coefficients; converting said all-pole vocal tract filter coefficients into line spectral frequencies in Bark scale; converting said T-parameters into R-parameters; defining a glottal vector to be converted; and defining a vocal tract vector to be converted.
  • the stage of conversion further comprises a step of post-filtering said converted vocal tract parameter vector.
  • the stage of synthesis in which said converted speech waveform is synthesised from the said set of converted parameters, preferably comprises the steps of: interpolating the trajectories of said converted parameters of each pitch period, thus obtaining a set of interpolated parameters comprising interpolated R-parameters, interpolated energy and interpolated vocal tract vector; converting said interpolated vocal tract vector into an all-pole filter coefficient vector; converting said interpolated R-parameters into interpolated T-parameters; for each frame of said test speech waveform, generating an excitation signal.
  • the stage of generating an excitation signal comprises, for each of said frames: if said frame is voiced: from said interpolated T-parameters and said excitation strength parameter, generating glottal waveform; from said interpolated aspiration noise energy parameter, generating interpolated aspiration noise; generating said voiced excitation signal by adding said interpolated glottal waveform and said interpolated aspiration noise. And, if said frame is unvoiced: generating said unvoiced excitation signal from a Gaussian noise source.
  • the stage of synthesis further comprises: generating a synthetic contribution of each frame by filtering said excitation signal with said interpolated all-pole filter coefficient vector; multiplying said synthetic contribution by a Hamming window, overlapping and adding, in order to generate the converted speech signal.
  • the present invention also provides a method applicable to voice quality transformations, such as tracheoesophageal speech repair, which comprises at least some of the above-mentioned method steps.
  • the term “approximately” and terms of its family should be understood as indicating values or forms very near to those which accompany the aforementioned term. That is to say, a deviation within reasonable limits from an exact value or form should be accepted, because the expert in the technique will understand that such a deviation from the values or forms indicated is inevitable due to measurement inaccuracies, etc. The same applies to the term “nearly”.
  • pitch period means a segment of a speech waveform which comprises a period of the fundamental frequency.
  • frame means a segment of a speech waveform, which corresponds to a pitch period in voiced parts and to a fixed amount of time in unvoiced parts. In a preferred embodiment of the present invention, which should not be interpreted as a limitation to the present invention, a frame corresponds to 10 ms in unvoiced parts.
  • source data refers to a collection of speech waveforms uttered by a source speaker
  • target data refers to a collection of speech waveforms uttered by a target speaker
  • parallel source and target data refers to a collection of speech waveforms uttered both by the source and the target speakers.
  • JEAS Joint Estimation Analysis Synthesis
  • Figure 2 shows a schematic diagram of the JEAS model. It is based on a general Source-Filter representation. It employs white Gaussian and amplitude-modulated white Gaussian noise to model the Turbulence and Aspiration Noise components respectively, a digital differentiator for Lip Radiation and an all-pole filter to represent the Vocal Tract. Besides, the Liljencrants-Fant (LF) model is adopted to better capture the characteristics of the derivative glottal wave. Then, in order to estimate the different model component parameterisations from the speech wave, a joint voice source and vocal tract parameter estimation technique based on Convex Optimization is applied.
  • LF Liljencrants-Fant
  • the present method adopts the well-known LF model, which is a four-parameter time-domain model of one cycle of the derivative glottal waveform.
  • Typical LF pulses corresponding to glottal and derivative glottal waves are shown in Figure 5 .
  • g n ⁇ E G ⁇ ⁇ ⁇ ⁇ n sin ⁇ g ⁇ n 0 ⁇ n ⁇ T e - E e ⁇ ⁇ T ⁇ ⁇ ⁇ e - ⁇ ⁇ n - T e - e ⁇ ⁇ T c - T e T e ⁇ n ⁇ T c
  • the model consists of two segments: the first one characterises the derivative glottal waveform from the instant of glottal opening to the instant of main excitation T e , where the amplitude reaches the maximum negative value -E e .
  • E 0 is a scaling factor used to ensure that the signal has a zero mean.
  • E e is closely related to the strength of the source excitation and the main determinant of the intensity of the speech signal. Its variation affects the overall harmonic amplitudes, except the very lowest components which are more determined by the shape of the pulse.
  • the second segment models the closing or return phase from the main excitation T e to the instant of full closure T c using an exponential function.
  • the duration of the return phase is thus determined by T c - T e .
  • the main parameter characterising this segment is T a , which represents the "effective duration" of the return phase. This is defined by the duration from T e to the point where a tangent fitted at the start of the return phase crosses zero.
  • T 0 corresponds to the fundamental period.
  • T c is made to coincide with the opening of the following pulse. This fact might suggest that the model does not account for the closed phase of the glottal waveform. However, for reasonably small values of T a , the exponential function will fit closely to the zero line providing a closed phase without the need for additional control parameters.
  • R g , R k , R a R-parameters ( R g , R k , R a ), which are normalised respect to T 0 and correlate with the most salient glottal phenomena, i.e. the glottal pulse width and the skewness and abruptness of closure.
  • R g T 0 2 ⁇ T p ;
  • R k T c - T p T p ;
  • R c T c T 0
  • R g is a normalised version of the glottal formant frequency Fg , which is defined as the inverse of twice the duration of the opening phase T p .
  • R k is the LF parameter which captures glottal asymmetry. It is defined as the ratio between the times of the opening and closing branches of the glottal pulse, and the larger its value, the more symmetrical the pulse is.
  • OQ is positively correlated with R k and negatively correlated with R g .
  • the R a parameter corresponds to the effective "return time" T a normalised by the fundamental period and captures the differences relating to the spectral tilt.
  • Source-Filter deconvolution is to obtain estimates of the glottal source and vocal tract filter components from the speech wave.
  • Inverse Filtering (IF) was the most commonly employed deconvolution method. It is based on calculating a vocal tract filter transfer function, whose inverse is used to obtain a glottal waveform estimate which can then be parameterised.
  • a different approach involves modelling both glottal source and vocal tract filter, and developing techniques to jointly estimate the source and tract model parameters from the speech wave.
  • Joint Estimation methods are fully automatic. This is an important condition that a mathematical model aimed at analysis, synthesis and modification of the speech signal should meet. Due to the characteristics of the mathematical voice source and vocal tract descriptions, such an approach is a complex nonlinear problem. For this reason, the use of LP has been deployed more widely as a simpler method to obtain a direct and efficient source-filter parameterisation of the speech signal. Its poor modelling of the voice source has not limited its application in the context of speech coding and to efficiently represent the speech spectrum with a small number of parameters. However, it has prevented its use in speech synthesis and transformation applications. Advances in voice conversion and Hidden Markov Model (HMM) speech synthesis in the last few years have emphasized the importance of refined vocoding and thus, the problem of automatic joint estimation of voice source and vocal tract filter parameters has gained renewed interest.
  • HMM Hidden Markov Model
  • the method employed to obtain the JEAS voice source and vocal tract model parameters from the speech wave follows the second deconvolution approach and is based on the joint estimation of the vocal tract filter and the glottal waveform proposed by Lu and Smith (Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999 ).
  • voiced and unvoiced speech segments are processed differently due to their diverse source characteristics. While the voice source in voiced speech is represented by a combination of the LF and aspiration noise models, white Gaussian noise is used to excite the vocal tract filter in unvoiced frames (see Figure 2 ). Their different modelling requires a preprocessing step where the voiced and unvoiced speech sections are determined and the glottal closure instants (GCI) of the voiced segments are estimated. Then, the voice source and vocal tract parameters are obtained through joint source-filter estimation and LF re-parameterisation in voiced sections (V) and through standard autocorrelation LP and Gaussian noise energy matching in unvoiced portions ( U ).
  • GCI glottal closure instants
  • An algorithm such as the well-known Dynamic Programming Projected Phase-Slope Algorithm (DYPSA) is used for GCI estimation. It employs the group-delay function in combination with a phase-slope projection method to determine GCI candidates, plus N-best dynamic programming to select the most likely candidates according to a cost function which takes waveform similarity, pitch deviation, normalised energy and deviation from the ideal phase-slope into account.
  • DYPSA Dynamic Programming Projected Phase-Slope Algorithm
  • the voicing decision is made based on energy, zero-crossing and GCI information. Voiced segments are then processed pitch-synchronously, while unvoiced frames are periodically extracted. In a particular embodiment, they are extracted every 10ms.
  • the method employed by the invention to obtain the JEAS voice source and vocal tract model parameters involves using a voice source model simple enough to allow the source filter deconvolution to be formulated as a Convex Optimization problem. Then, the derivative glottal waveform obtained by inverse filtering (IF) with the estimated filter coefficients is reparameterised by LF model fitting.
  • IF inverse filtering
  • the success of the technique lies in providing a derivative glottal waveform constraint when estimating the vocal tract filter. Because of this, the resulting IF derivative glottal waveform is closer to the true glottal excitation and its fitting to an LF model is less error prone.
  • the joint estimation algorithm models the voice source using the well-known Rosenberg-Klatt (RK) model, which consists of a basic voicing waveform describing the shape of the derivative glottal wave and a low-pass filter, 1 1 - ⁇ - 1 , with ⁇ >0, as shown in Figure 6 .
  • RK Rosenberg-Klatt
  • OQ being the open-quotient, i.e. the fraction of the pitch period in which the glottis is open.
  • Source-filter deconvolution via convex optimization is accomplished by minimising the squared error between the modelled and the true derivative glottal waveforms.
  • the derived quadratic program can be solved using a number of existing iterative numerical algorithms.
  • the quadratic programming function of the MATLAB Optimization Toolbox has been employed.
  • the result of the minimization problem is the simultaneous estimation of the RK model parameters a and b and the all-pole filter coefficients ⁇ k .
  • Figure 7 shows a joint estimation example for one pitch period.
  • the described joint estimation process assumes that the closed and open-phases are defined, while in practice the parameter which delimits the end of the closed-phase and the beginning of the open-phase n c is unknown. Its optimal value is found by uniformly sampling the possible n c values (empirically shown to vary from 0% to 60% of the pitch period T o ), solving the quadratic problem at each sampled n c value and choosing the estimate resulting in minimum error.
  • the basic RK voicing waveform of equation (7) does not explicitly model the return phase of the derivative glottal waveform and changes abruptly at the glottal closure instants. For this reason, a low-pass filter is added to the basic model, with the purpose of reducing the abruptness of glottal closure.
  • the filter coefficient ⁇ is responsible for controlling the tilt of the source spectrum.
  • the spectral tilt filter is separated from the source model and incorporated to the vocal tract model by adding an extra pole to the all-pole filter as shown in Figure 9 .
  • the vocal tract filter coefficients estimated using this formulation also encode the spectral slope information of the voice source.
  • the derivative glottal waveforms obtained using this approach fail to adequately capture the variations in the return phase of the glottal source.
  • the present invention uses adaptive pre-emphasis to estimate and remove the spectral tilt filter contribution from the speech wave before convex optimization.
  • Order one LP analysis and IF is applied to estimate and remove the spectral slope from the speech frames under analysis.
  • the effect of adaptive pre-emphasis is illustrated in Figure 10: a) Speech spectrum and estimated spectral envelope, b) IF derivative glottal wave and fitted LF waveform, c) IF derivative glottal wave spectrum and fitted LF wave spectrum.
  • the vocal tract filter envelope estimates obtained this way do not encode source spectral tilt characteristics, which are reflected in the closing phase of the resulting derivative glottal waveforms instead. This improves the fitting of the return phase of the LF model and thus, of the high frequencies of the glottal source.
  • the LF model is capable of more accurately describing the glottal derivative waveform than the RK model.
  • its more complex nonlinear formulation fails to fulfil the convexity condition and prevents its use in the joint voice source and vocal tract filter parameter estimation algorithm.
  • the RK model is employed during source-filter deconvolution and the LF model is then used to re-parameterise the derivative glottal wave obtained by inverse filtering the speech waveform with the jointly estimated filter coefficients.
  • LF model fitting is carried out in two steps. First, initial estimates of the LF T-parameters ( T p , T e , T a , T c ) and the glottal excitation strength E e are obtained from the time-domain IF voice source waveform by conventional direct estimation methods. Then, their values are refined using the conventional constrained nonlinear optimization technique. The overall procedure is as follows.
  • Figure 11 shows an example of LF fitting in normal, breathy and pressed phonations.
  • Wavelet denoising is used to extract the glottal aspiration noise from the IF derivative glottal wave estimate.
  • the wavelet denoising technique used is Wavelet Packet Analysis, which has been found to obtain more reliable aspiration noise estimates compared to other techniques employed to identify and separate the periodic and aperiodic components of quasi-periodic signals, such as frequency transform analysis or periodic prediction.
  • Wavelet Packet Analysis is preferably performed at level 4 with the 7th order Daubechies wavelet, using soft-thresholding and the Stein Unbiased Risk Estimate threshold evaluation criteria.
  • Figure 12 shows a typical denoising result: a) original and denoised IF derivative glottal wave, b) noise estimate.
  • the aspiration noise estimate obtained for a particular pitch period during JEAS analysis is parameterised as follows. First, zero mean unit variance Gaussian noise is modulated with the already fitted LF waveform for that pitch period. Then, its energy is adjusted to match that (ANE) of the aspiration noise estimate. Because using a spectral shaping filter has informally been found not to make a perceptual difference, it is not included in the parameterisation.
  • Figure 15 depicts a diagram of the employed aspiration noise modelling approach.
  • the source parameters R g , R k , R a , ANE
  • R g , R k , R a , ANE smoothed LF derivative glottal waveforms lf (n) and amplitude modulated aspiration noise estimates an ( n ) to be used as the filter excitation e ( n ) for resynthesis.
  • Figure 16 shows a schematic diagram of the employed overlap-add synthesis scheme.
  • Both pitch and time-scale transformations are based on a parameter trajectory interpolation approach, where the first task involves calculating the number of frames in a particular segment required to achieve the desired modifications. Once the modified number of frames has been calculated, frame size contours, excitation and vocal tract parameter trajectories are resampled at the modified number of frames using, for example, cubic spline interpolation. Because JEAS modelling is pitch-synchronous, the frame sizes correspond with the pitch periods in voiced segments while they are fixed in unvoiced segments. Due to their better interpolation characteristics, LSF coefficients and R-parameters are employed during pitch and time-scale transformations to represent the vocal tract and glottal source respectively, in addition to aspiration ANE and Gaussian GNE noise energies.
  • Pitch can be altered by simply multiplying the fundamental period contour by a scaling factor.
  • JEAS Modelling i.e. the parameterisation of the voice source
  • the source characteristics can be also transformed to match the target.
  • this also avoids the need for conventional residual prediction methods.
  • the JEAS parameterisation does not involve a magnitude and phase division of the spectrum, the artifacts due to converted magnitude and phase mismatches are not produced and, thus, the use of additional techniques, such as phase prediction, is not required.
  • the jointly estimated JEAS all-pole vocal tract filter coefficients ⁇ 1 ... ⁇ p ⁇ are converted to Bark scaled LSF parameters for the transformation of the JEAS spectral envelopes.
  • the linear frequency response of the jointly estimated vocal tract filter is calculated. This is resampled according to the Bark scale using, for example, the well-known cubic spline interpolation technique.
  • the warped all-pole filter coefficients are then computed by applying, for example, the conventional Levinson-Durbin algorithm to the autocorrelation sequence of the warped power spectrum. Then, the filter coefficients are transformed into LSF for conversion.
  • a continuous probabilistic linear transformation function is employed to convert the LSF spectral envelopes.
  • Gaussian Mixture Models are used to describe the source and target glottal feature vector spaces, classify them into M classes and train class specific linear transformations.
  • the new LSF parameters are transformed to all-pole filter coefficients and resampled back to the linear scale before synthesis. Because the use of linear transformations broadens the formants of the converted speech, a perceptual post-filter is applied to narrow the formant bandwidths, deepen the spectral valleys and sharpen the formant peaks.
  • Figure 18 illustrates the JEAS vs. conventional PSHM spectral envelopes, where it can be seen that the PSHM envelopes capture the spectral tilt but in JEAS, it is encoded by the glottal waveforms instead.
  • both methods manage to represent the most important formants, small differences exist in their amplitudes, frequencies and/or bandwidths.
  • the glottal waveform morphing approach adopted within JEAS voice conversion employs Continuous Probabilistic Linear Transformations to map glottal LF parameters of different modal male and female speakers, which are the most commonly used speaker types in voice conversion applications.
  • Continuous probabilistic linear transformations have been chosen for being the most robust and efficient approach found to convert spectral envelopes.
  • the limitations of the codebook-based conversion methods for envelope transformations i.e. the discontinuities caused by the use of a discrete number of codebook entries, can also be extrapolated to the modification of glottal waveforms.
  • the use of continuous probabilistic modelling and transformations is expected to achieve better glottal conversions too.
  • the feature vectors employed to convert the glottal source characteristics are derived from the JEAS model parameters linked to the voice source of every pitch period, i.e. the glottal excitation strength E e and T-parameters (T p , T e , T a , T c ) obtained from the LF fitting procedure and the energy (ANE) of the aspiration noise estimate used to adjust that of the modelled pitch-synchronous amplitude modulated Gaussian noise.
  • Figure 19 shows the described glottal conversion approach is capable of bringing the source feature vector parameter contours closer to the target which, as a consequence, also produces converted glottal waveforms more similar to the target.
  • Figure 19 shows the linear transformation of LF Glottal Waveforms: a) source, target and converted derivative glottal LF waves; b) source, target and converted trajectories of the glottal feature vector parameters (E e , R g , R k, R a , ANE).
  • the speech data was recorded using a 'mimicking' approach which resulted in a natural time-alignment between the identical sentences produced by the different speakers and factor out the prosodic cues of speaker identity to some extent.
  • Glottal closure instants derived from laryngograph signals are also provided for each sentence, and have been used for both PSHM and JEAS pitch synchronous analysis.
  • Four different voice conversion experiments have been investigated: male-to-male (MM), male-to-female (MF), female-to-male (FM) and female-to-female (FF) transformations.
  • the first 120 sentences are used for training and the remaining 30 for testing each speaker pair conversion.
  • LSF spectral vectors of order 30 have been employed throughout the conversion experiments, to train 8 linear spectral envelope transforms between each source and target speaker pair using the parallel VOICES training data. This number has been chosen for being capable of achieving small spectral distortion ratios while still generalising to the test data. Aligned source-target vector pairs were obtained by applying forced alignment to mark sub-phone boundaries and using Dynamic Time Warping to further constrain their time alignment. For residual and phase prediction, target GMMs and codebooks of 40 classes and entries have been built. Finally, glottal waveform conversions have also been carried out using 8 linear transforms per speaker pair. Objective and subjective evaluations have been used to compare the performance of the two methods.
  • lsf src -( t ), lsf tgt ( t ) and lsf conv ( t ) are the source, target and converted LSF vectors respectively and the summation is computed over the time-aligned test data, L being the total number of test vectors after time alignment. Note that a 100% distortion ratio corresponds to the distortion between the source and the target.
  • Similar objective distortion measures can also be used to evaluate the conversion of the voice source characteristics, i.e. Residual Prediction and Glottal Waveform Conversion in the PSHM and JEAS implementations respectively.
  • Residual Prediction reintroduces the target spectral details not captured by spectral envelope conversion, bringing as a result the converted speech spectra closer to the target.
  • Glottal Waveform Conversion maps time-domain representations of the glottal waveforms which in the frequency domain result in better matching glottal formants and spectral tilts of the converted spectra. Whilst the methods differ, their spectral effect is similar, i.e, they aim to reduce the differences between the converted and the target speech spectra.
  • LSD log spectral distances
  • a distortion ratio R LSD similar to R LSF can be used to compare the converted-to-target log spectral distances with and without voice source conversion
  • a 100% ratio corresponds to the distortion between spectral envelope converted spectra without voice source transformation and the target spectra.
  • Figure 21 illustrates R LSD ratios computed for Residual Prediction and Glottal Waveform Conversion on the test set. Results show that both voice source conversion techniques manage to reduce the distortions between the converted and target speech spectra. Residual Prediction performs slightly better, mainly because the algorithm is designed to predict residuals which minimise the log spectral distance represented in R LSD . In contrast, glottal waveform conversion is trained to minimise the glottal parameter conversion error over the training data and not the log spectral distance. Nevertheless, both methods are successful in bringing the converted spectra close to the target.
  • the first part was an ABX test in which subjects were presented with PSHM-converted (A), JEAS-converted (B) and target (X) utterances and were asked to choose the speech sample A or B they found sounded more like the target X in terms of speaker identity.
  • Spectral envelopes and voice source characteristics were transformed with the methods described above for each system, i.e. spectral envelope conversion, residual and phase prediction were used for PSHM transformations and spectral envelope and glottal waveform conversion for JEAS transformations.
  • the prosody of the target was employed to synthesise the converted sentences in order to normalise the pitch, duration and energy differences between source and target speakers for the perceptual comparison.
  • Figure 22 shows the results of the ABX test.
  • the JEAS-converted samples are preferred over the PSHM-converted ones overall, but the preference difference varies depending on the type of conversion, being for example almost the same for FM transformations.
  • the 'NO STRONG PREFERENCE' (NSP) option has been selected almost as often as the JEAS-converted utterances in general, which reveals that subjects found it really difficult to distinguish between conversion systems in terms of speaker identity.
  • NSP STRONG PREFERENCE'
  • the second listening test aimed at determining which system produces speech with a higher quality. Subjects were presented with PSHM and JEAS converted speech utterance pairs and asked to choose the one they thought had a better speech quality. Results are illustrated in Figure 23 . There is a clear preference for the sentences converted using the JEAS method, chosen 75.7% of the time on average, which stems from the clearly distinguishable quality difference between the PSHM and JEAS transformed samples. Utterances obtained after PSHM conversion have a 'noisy' quality caused by phase discontinuities which still exist despite Phase Prediction. Comparatively, JEAS converted sentences sound much smoother. This quality difference is also thought to have slightly biased the preference for JEAS conversion in the ABX test.
  • the method and device of voice conversion of the present invention is applicable to frameworks requiring voice quality transformations.
  • its use to repair the deviant voice source characteristics of tracheoesophageal speech can be mentioned.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Numerical Control (AREA)
  • Auxiliary Devices For Music (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

A method of converting a source speakers speech signal into a converted speech signal, which comprises a stage of training using a given database of parallel source and target data. For each pitch period modelling a glottal waveform and a vocal tract filter to obtain a set of parameters comprising an excitation strength, parameters modelling a glottal waveform, and all-pole vocal tract filter coefficients. Defining a glottal vector to be converted; defining a vocal tract vector to be converted, obtaining an estimate of a glottal aspiration noise and estimating a vocal tract transformation function. The stage of modelling comprises: modelling said aspiration noise estimate by- modulating Gaussian noise with the said modelled glottal waveform and adjusting its energy to match that of the said aspiration noise estimate. The method further comprises a stage of conversion and a stage of synthesis.

Description

    FIELD OF THE INVENTION
  • The present invention relates to methods and systems for voice conversion.
  • STATE OF THE ART
  • Voice Conversion aims at transforming a source speaker's speech to sound like that of a different target speaker. Text-to-speech synthesisers, dialogue systems and speech repair are among the numerous applications which can greatly benefit from the development of voice conversion technology.
  • The most widely used speech signal representations are the Source-Filter Model and the Sinusoidal Model. The Source-Filter representation (G. Fant, Acoustic Theory of Speech Production, ISBN 9027916004) is based on a simple production model composed of a glottal source waveform exciting a time-varying filter loaded at its output by the radiation of the lips. The main challenge in Source-Filter modelling is the estimation of the glottal waveform and vocal tract filter parameters from the speech signal.
  • Among the existing glottal waveform parameterisations, the Liljencrants-Fant (LF) model (The LF-model revisited. Transformations and frequency domain analysis, STL-QPSR, vol. 36, number 2-3, 1995, pages 119-156) has become the model of choice for research on the glottal source. It has been shown to be capable of modelling a wide range of naturally occurring phonations and the effects of its parameter variations are well understood. It exploits the linearity and time-invariance properties of the Source-Filter representation and assumes the commutation of the vocal tract and lip radiation filters to combine the modelling of the source excitation and lip radiation in the parameterisation of the derivative of the glottal waveform.
  • Linear Prediction (LP) is one popular technique used to obtain a combined parameterisation of the glottal source, vocal tract and lip radiation components in a unique all-pole filter H(z) . Such a filter is then excited, as shown in Figure 1, by a sequence of impulses spaced at the fundamental period To during voiced speech and by white Gaussian noise during unvoiced speech. If the speech signal were truly the response of an all-pole filter, the LP error or residual would be a train of impulses spaced at the voiced excitation instants and the impulse/noise voice source modelling would be accurate. In practice, however, the LP residual looks more like a white noise signal with larger values around the instants of excitation. While exciting the LP filter with the LP residual results in speech that is indistinguishable from the original, using an impulse train as the voiced excitation produces speech with a very buzzy quality. The strength of LP lies in its ability to automatically estimate a set of filter coefficients which compactly represent the envelope of the speech spectrum, making it popular in applications where the spectral characteristics of the speech wave need to be captured with a small number of parameters. Its main drawback, on the other hand, stems from the over-simplified modelling of the glottal waveform which prevents its use in systems requiring high-quality speech outputs.
  • As an alternative to LP, H. Lu et. al. have proposed a convex optimization method to automatically estimate the vocal tract filter and glottal waveform jointly (Joint estimation of vocal tract filter and glottal source waveform via convex optimization, Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999). The better modelling of the glottal source employed by this approach results in speech which has better quality than that of LP. In addition, the parameterisation of the glottal waveform allows its parametric modification, which can be exploited in voice conversion applications.
  • Sinusoidal Models assume the speech waveform to be composed of the sum of a small number of sinusoids with time-varying amplitudes, frequencies and phases. Such modelling was mainly developed by McAulay and Quatieri (Speech Analysis/Synthesis Based on a Sinusoidal Representation, IEEE Transactions on Acoustics, Speech and Signal Processing, pp. 744-754, 1986) in the mid-1980's and has been shown to be capable of producing high quality speech even after pitch and time-scale transformations. However, because of the high number of sinusoidal amplitudes, frequencies and phases involved, sinusoidal modelling results less flexible than the source-filter representation to modify spectral features.
  • In order to obtain high-quality converted speech, state-of-the-art voice conversion (VC) implementations mainly employ variations and extensions of the original sinusoidal model. In addition, they generally adopt a source-filter formulation based on LP to carry out spectral transformations.
  • Spectral envelopes are generally encoded in line spectral frequencies (LSF) for voice conversion, since LSFs have been shown to possess very good linear interpolation characteristics and to relate well to formant location and bandwidth. Because the frequency resolution of the human ear is greater at low frequencies than at high frequencies, spectral envelopes are often warped to a non-linear scale, e.g. the Bark scale, taking the non-uniform sensitivity of the human ear into account. Usually, only spectral envelopes of voiced speech segments are transformed, since unvoiced sounds contain little vocal tract information and their spectral envelopes present high variations. Among the existing different spectral envelope conversion techniques, continuous probabilistic linear transformations have been found to be the most robust and efficient approach. These can be obtained through least square error minimisation of parallel source and target training databases or using more general maximum likelihood transformation frameworks (Ye, H. and Young, S. Quality-enhanced Voice Morphing using Maximum Likelihood Transformations, IEEE Audio Speech and Language Processing, vol. 14, no. 4, pp. 1301-1312, 2006). One problem all spectral envelope conversion methods share is the broadening of the spectral peaks, expansion of the formant bandwidths and over-smoothing caused by the averaging effect of the parameter interpolations. This phenomenon makes the converted speech sound slightly muffled. In order to solve this issue, post-filtering is often applied as a post-processing stage to narrow formant bandwidths and suppress the noise in the spectral valleys as in, for example, Ye, H. and Young, S., Quality-enhanced Voice Morphing using Maximum Likelihood Transformations, IEEE Audio Speech and Language Processing, vol. 14, no. 4, pp. 1301-1312, 2006.
  • As for LP residual conversion, sinusoidal VC systems have developed residual prediction and selection methods (D. Suendermann, A. Bonafonte, H. Ney, and H. Hoege, A study on residual prediction techniques for voice conversion, in Proc.ICASSP, 2005, pp. 13-16.) based on the correlation between spectral envelope and LP residuals. These methods reintroduce the target spectral detail lost after envelope conversion. Because residuals contain the errors introduced by the LP parameterisation, residual prediction techniques have been found to improve conversion performance. However, LP residuals do not constitute an accurate model of the voice source and residual prediction alone is not capable of modifying the quality of the voice source. This prevents their use in applications requiring voice quality modifications such as, for example, speech repair.
  • The patent application WO 2008/018653 A1 discloses a further voice conversion technique using the Liljencrants-Fant parameters of the glottal wave.
  • SUMMARY OF THE INVENTION
  • It is therefore an object of the present invention to provide a method of voice conversion based on a source-filter model which uses a representation of the glottal source more accurate than LP residuals. This allows the use of continuous probabilistic linear transformations for the conversion of the voice source.
  • In particular, it is an object of the present invention a method of converting a source speaker's speech signal into a converted voice signal, which comprises a stage of training, a stage of conversion and a stage of synthesis.
  • The stage of training comprises, given a training database of parallel source and target data, for each pitch period of said training database: modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, to obtain a set of LF parameters, said set of LF parameters comprising an excitation strength parameter and a set of T-parameters modelling a glottal waveform, and a set of all-pole vocal tract filter coefficients; converting said T-parameters into R-parameters; converting said all-pole vocal tract filter coefficients into line spectral frequencies in Bark scale; defining a glottal vector to be converted; defining a vocal tract vector to be converted, said vocal tract vector comprising said line spectral frequencies in Bark scale; applying wavelet denoising to obtain an estimate of a glottal aspiration noise.
  • The stage of training also comprises, from the set of vocal tract vectors obtained for each pitch period of the said training database, estimating a vocal tract continuous probabilistic linear transformation function using the least square error criterion.
  • The previous stage of modelling further comprises the steps of modelling said aspiration noise estimate by modulating zero mean unit variance Gaussian noise with the said modelled glottal waveform and adjusting its energy to match that of the said aspiration noise estimate. Besides, the glottal vector to be converted comprises said excitation strength parameter, said R-parameters and said energy of the aspiration noise estimate.
  • In the stage of conversion, a given test speech waveform is modelled and transformed into a set of converted parameters.
  • In the stage of synthesis, a converted speech waveform is synthesised from the said set of converted parameters.
  • Preferably, the stage of training further comprises: from the set of glottal vectors obtained for each pitch period of the said training database, estimating a glottal waveform continuous probabilistic linear transformation function using the least square error criterion.
  • The step of modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, preferably comprises the steps of: modelling the glottal waveform using the Rosenberg-Klatt model; using convex optimization to obtain a set of Rosenberg-Klatt glottal waveform parameters and the all-pole vocal tract filter coefficients, wherein said step of using convex optimization comprises a step of adaptive pre-emphasis for estimating and removing a spectral tilt filter contribution from the speech waveform before convex optimization. Besides, that step of modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, further comprises the steps of: obtaining a derivative glottal waveform by inverse filtering said pitch period using said all-pole vocal tract filter coefficients; fitting said set of LF parameters to the said inverse filtered derivative glottal waveform by direct estimation and constrained non-linear optimization.
  • The stage of conversion preferably comprises, for each pitch period of said test speech waveform: obtaining a glottal vector to be converted, said glottal vector comprising an excitation strength parameter, a set of R-parameters and the energy of the said aspiration noise estimate; obtaining a vocal tract vector to be converted, said vocal tract vector comprising a set of line spectral frequencies in Bark scale; applying said vocal tract continuous probabilistic linear transformation function estimated during the training stage to obtain a converted vocal tract parameter vector; transforming said glottal vector using said glottal waveform continuous probabilistic linear transformation function estimated during the training stage, thus obtaining a converted glottal vector comprising a set of converted parameters.
  • In particular, those stages of obtaining a glottal vector to be converted and a vocal tract vector to be converted further comprise the steps of: modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, to obtain a set of LF parameters, said set of LF parameters comprising an excitation strength parameter and a set of T-parameters modelling a glottal waveform, and a set of all-pole vocal tract filter coefficients; converting said all-pole vocal tract filter coefficients into line spectral frequencies in Bark scale; converting said T-parameters into R-parameters; defining a glottal vector to be converted; and defining a vocal tract vector to be converted.
  • Preferably, the stage of conversion further comprises a step of post-filtering said converted vocal tract parameter vector.
  • The stage of synthesis, in which said converted speech waveform is synthesised from the said set of converted parameters, preferably comprises the steps of: interpolating the trajectories of said converted parameters of each pitch period, thus obtaining a set of interpolated parameters comprising interpolated R-parameters, interpolated energy and interpolated vocal tract vector; converting said interpolated vocal tract vector into an all-pole filter coefficient vector; converting said interpolated R-parameters into interpolated T-parameters; for each frame of said test speech waveform, generating an excitation signal.
  • Preferably, the stage of generating an excitation signal comprises, for each of said frames: if said frame is voiced: from said interpolated T-parameters and said excitation strength parameter, generating glottal waveform; from said interpolated aspiration noise energy parameter, generating interpolated aspiration noise; generating said voiced excitation signal by adding said interpolated glottal waveform and said interpolated aspiration noise. And, if said frame is unvoiced: generating said unvoiced excitation signal from a Gaussian noise source.
  • Besides, the stage of synthesis further comprises: generating a synthetic contribution of each frame by filtering said excitation signal with said interpolated all-pole filter coefficient vector; multiplying said synthetic contribution by a Hamming window, overlapping and adding, in order to generate the converted speech signal.
  • The present invention also provides a method applicable to voice quality transformations, such as tracheoesophageal speech repair, which comprises at least some of the above-mentioned method steps.
  • It is another object of the present invention to provide a device comprising means for carrying out the above-mentioned method.
  • Finally, it is a further object of the present invention to provide a computer program code means adapted to perform the steps of the method previously mentioned when said program is run on a computer, a digital signal processor, a field-programmable gate array, an application-specific integrated circuit, a micro-processor, a micro-controller, or any other form of programmable hardware.
  • The advantages of the proposed invention will become apparent in the description that follows.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • To complete the description and in order to provide for a better understanding of the invention, a set of drawings is provided. Said drawings form an integral part of the description and illustrate a preferred embodiment of the invention, which should not be interpreted as restricting the scope of the invention, but rather as an example of how the invention can be embodied. The drawings comprise the following figures:
    • Figure 1 shows a conventional schematic diagram of the LP model.
    • Figure 2 shows a schematic diagram of the joint estimation analysis synthesis (JEAS) model according to an embodiment of the present invention.
    • Figure 3 shows a schematic diagram modelling the glottal wave.
    • Figure 4 shows a schematic diagram modelling the derivative glottal wave.
    • Figure 5 shows typical LF pulses corresponding to glottal and derivative glottal waves.
    • Figure 6 shows a conventional model of the voice source.
    • Figure 7 shows a joint estimation example: a) speech period, b) speech spectrum and jointly estimated spectral envelope, c) inverse filtered residual and jointly estimated RK wave.
    • Figure 8 shows a RK derivative glottal wave.
    • Figure 9 shows a schematic diagram of a conventional modelling of the spectral tilt.
    • Figure 10 shows the effects of adaptive pre-emphasis.
    • Figure 11 shows an example of LF fitting in normal, breathy and pressed phonations.
    • Figure 12 shows a typical denoising result.
    • Figure 13 shows the standard aspiration noise model parameters.
    • Figure 14 shows the Gaussian noise modulation by an LF waveform.
    • Figure 15 shows a schematic diagram of an aspiration noise modelling approach according to an embodiment of the present invention.
    • Figure 16 shows a schematic diagram of the employed overlap-add synthesis scheme.
    • Figure 17 illustrates resampling of the frame size contour.
    • Figure 18 shows the JEAS vs. PSHM spectral envelopes.
    • Figure 19 shows the continuous probabilistic linear transformation of LF glottal waveforms.
    • Figure 20 shows the RLSF distortion ratios of the converted PSHM and JEAS spectral envelopes.
    • Figure 21 shows the R LSD distortion ratios of Residual Predicted (RP) and Glottal Waveform Converted (GWC) spectra.
    • Figure 22 shows the results of the ABX test.
    • Figure 23 shows the results of the quality comparison test.
    DETAILED DESCRIPTION OF THE INVENTION Definitions
  • In the context of the present invention, the term "approximately" and terms of its family (such as "approximate", "approximation", etc.) should be understood as indicating values or forms very near to those which accompany the aforementioned term. That is to say, a deviation within reasonable limits from an exact value or form should be accepted, because the expert in the technique will understand that such a deviation from the values or forms indicated is inevitable due to measurement inaccuracies, etc. The same applies to the term "nearly".
  • In the context of the present invention, the following terms are defined as follows:
  • The expression "pitch period" means a segment of a speech waveform which comprises a period of the fundamental frequency.
  • The term "frame" means a segment of a speech waveform, which corresponds to a pitch period in voiced parts and to a fixed amount of time in unvoiced parts. In a preferred embodiment of the present invention, which should not be interpreted as a limitation to the present invention, a frame corresponds to 10 ms in unvoiced parts.
  • The expression "source data" refers to a collection of speech waveforms uttered by a source speaker, while the expression "target data" refers to a collection of speech waveforms uttered by a target speaker. Besides, the expression "parallel source and target data" refers to a collection of speech waveforms uttered both by the source and the target speakers.
  • In this text, the term "comprises" and its derivations (such as "comprising", etc.) should not be understood in an excluding sense, that is, these terms should not be interpreted as excluding the possibility that what is described and defined may include further elements, steps, etc.
  • 1. Joint Estimation Analysis Synthesis
  • A method of speech modelling for the analysis, modification and synthesis of speech is described next. The model is called Joint Estimation Analysis Synthesis (JEAS). Its biggest advantage is the automatic and simultaneous parameterisation of the vocal tract and the voice source, which allows the manipulation not only of spectral envelopes, but of glottal characteristics as well. In addition, it also supports high-quality pitch and time-scale modifications. Next, the employed voice source model and source-filter deconvolution technique, and the way analysis, synthesis and prosodic transformations are implemented is described.
  • 1.1 Speech modelling
  • Figure 2 shows a schematic diagram of the JEAS model. It is based on a general Source-Filter representation. It employs white Gaussian and amplitude-modulated white Gaussian noise to model the Turbulence and Aspiration Noise components respectively, a digital differentiator for Lip Radiation and an all-pole filter to represent the Vocal Tract. Besides, the Liljencrants-Fant (LF) model is adopted to better capture the characteristics of the derivative glottal wave. Then, in order to estimate the different model component parameterisations from the speech wave, a joint voice source and vocal tract parameter estimation technique based on Convex Optimization is applied.
  • Next, the modelling of the voice source is explained:
  • Numerous parametric models of the glottal source have been proposed in the literature. Despite their differences, they all share many common features and can be described by a small set of parameters. In most cases, they exploit the linearity and time-invariance properties of the Source-Filter representation and assume the commutation of the vocal tract and lip radiation filters to combine the modelling of the source excitation and lip radiation in the parameterisation of the derivative of the glottal waveform as shown in Figure 4.
  • The present method adopts the well-known LF model, which is a four-parameter time-domain model of one cycle of the derivative glottal waveform. Typical LF pulses corresponding to glottal and derivative glottal waves are shown in Figure 5. Mathematically, it can be described as: g n = { E G ε α n sin ω g n 0 n < T e - E e ε T α e - ε n - T e - e ε T c - T e T e n < T c
    Figure imgb0001
  • The model consists of two segments: the first one characterises the derivative glottal waveform from the instant of glottal opening to the instant of main excitation Te, where the amplitude reaches the maximum negative value -Ee. As shown in equation (1), the segment is a sinusoidal function which grows exponentially in amplitude, F g = ω g 2 π
    Figure imgb0002
    being the frequency of the sine function and α determining the rate of the amplitude increase. E 0 is a scaling factor used to ensure that the signal has a zero mean. The timing parameter Tp is related to the sinusoidal frequency through T p = 1 2 F g
    Figure imgb0003
    and denotes the instant of the maximum glottal flow. Ee is closely related to the strength of the source excitation and the main determinant of the intensity of the speech signal. Its variation affects the overall harmonic amplitudes, except the very lowest components which are more determined by the shape of the pulse.
  • The second segment models the closing or return phase from the main excitation Te to the instant of full closure Tc using an exponential function. The duration of the return phase is thus determined by Tc -Te . The main parameter characterising this segment is Ta , which represents the "effective duration" of the return phase. This is defined by the duration from Te to the point where a tangent fitted at the start of the return phase crosses zero. ε-1 is the time-constant of the exponential function, and can be determined iteratively from Ta, Te and Tc through ε = 1 T a 1 - e - ε T c - T e .
    Figure imgb0004
    T 0 corresponds to the fundamental period. Generally, Tc is made to coincide with the opening of the following pulse. This fact might suggest that the model does not account for the closed phase of the glottal waveform. However, for reasonably small values of Ta , the exponential function will fit closely to the zero line providing a closed phase without the need for additional control parameters.
  • Along with the excitation strength Ee , the LF pulse can be uniquely determined by the T-parameters: (Tp , Te , Ta, Tc). These parameters can be easily identified from the estimated derivative glottal wave. Therefore, they are generally obtained first and the synthesis parameters, from which the LF waveform can be computed directly, (E 0; α; ωg; ε) are then derived taking the following constraints into account: T 0 g t dt = 0
    Figure imgb0005
    ω g = π T p
    Figure imgb0006
    ϵ T a = 1 - e - ϵ T c - T c
    Figure imgb0007
    E 0 = - E c e ϵ T c sin ω g T e
    Figure imgb0008
  • Another important set of LF parameters are the R-parameters (R g, Rk, Ra), which are normalised respect to T0 and correlate with the most salient glottal phenomena, i.e. the glottal pulse width and the skewness and abruptness of closure. R g = T 0 2 T p ; R k = T c - T p T p ; R c = T c T 0
    Figure imgb0009
  • Rg is a normalised version of the glottal formant frequency Fg, which is defined as the inverse of twice the duration of the opening phase Tp. Rk is the LF parameter which captures glottal asymmetry. It is defined as the ratio between the times of the opening and closing branches of the glottal pulse, and the larger its value, the more symmetrical the pulse is. The relationship between R g, Rk and the Open Quotient OQ is: OQ = (1 + R k) / (2 R g) . Thus, OQ is positively correlated with Rk and negatively correlated with Rg . The Ra parameter corresponds to the effective "return time" Ta normalised by the fundamental period and captures the differences relating to the spectral tilt.
  • Next, the method adopted for glottal source and vocal tract filter deconvolution is explained:
  • The aim of Source-Filter deconvolution is to obtain estimates of the glottal source and vocal tract filter components from the speech wave. Two main deconvolution approaches exist. Before parametric models of the glottal waveform were developed, Inverse Filtering (IF) was the most commonly employed deconvolution method. It is based on calculating a vocal tract filter transfer function, whose inverse is used to obtain a glottal waveform estimate which can then be parameterised.
  • A different approach involves modelling both glottal source and vocal tract filter, and developing techniques to jointly estimate the source and tract model parameters from the speech wave. Joint Estimation methods are fully automatic. This is an important condition that a mathematical model aimed at analysis, synthesis and modification of the speech signal should meet. Due to the characteristics of the mathematical voice source and vocal tract descriptions, such an approach is a complex nonlinear problem. For this reason, the use of LP has been deployed more widely as a simpler method to obtain a direct and efficient source-filter parameterisation of the speech signal. Its poor modelling of the voice source has not limited its application in the context of speech coding and to efficiently represent the speech spectrum with a small number of parameters. However, it has prevented its use in speech synthesis and transformation applications. Advances in voice conversion and Hidden Markov Model (HMM) speech synthesis in the last few years have emphasized the importance of refined vocoding and thus, the problem of automatic joint estimation of voice source and vocal tract filter parameters has gained renewed interest.
  • The method employed to obtain the JEAS voice source and vocal tract model parameters from the speech wave follows the second deconvolution approach and is based on the joint estimation of the vocal tract filter and the glottal waveform proposed by Lu and Smith (Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999).
  • 1.2 Analysis within the Joint Estimation Analysis Synthesis model
  • During analysis, voiced and unvoiced speech segments are processed differently due to their diverse source characteristics. While the voice source in voiced speech is represented by a combination of the LF and aspiration noise models, white Gaussian noise is used to excite the vocal tract filter in unvoiced frames (see Figure 2). Their different modelling requires a preprocessing step where the voiced and unvoiced speech sections are determined and the glottal closure instants (GCI) of the voiced segments are estimated. Then, the voice source and vocal tract parameters are obtained through joint source-filter estimation and LF re-parameterisation in voiced sections (V) and through standard autocorrelation LP and Gaussian noise energy matching in unvoiced portions (U).
  • An algorithm, such as the well-known Dynamic Programming Projected Phase-Slope Algorithm (DYPSA), is used for GCI estimation. It employs the group-delay function in combination with a phase-slope projection method to determine GCI candidates, plus N-best dynamic programming to select the most likely candidates according to a cost function which takes waveform similarity, pitch deviation, normalised energy and deviation from the ideal phase-slope into account.
  • The voicing decision is made based on energy, zero-crossing and GCI information. Voiced segments are then processed pitch-synchronously, while unvoiced frames are periodically extracted. In a particular embodiment, they are extracted every 10ms.
  • The method employed by the invention to obtain the JEAS voice source and vocal tract model parameters involves using a voice source model simple enough to allow the source filter deconvolution to be formulated as a Convex Optimization problem. Then, the derivative glottal waveform obtained by inverse filtering (IF) with the estimated filter coefficients is reparameterised by LF model fitting.
  • The success of the technique lies in providing a derivative glottal waveform constraint when estimating the vocal tract filter. Because of this, the resulting IF derivative glottal waveform is closer to the true glottal excitation and its fitting to an LF model is less error prone.
  • The joint estimation algorithm models the voice source using the well-known Rosenberg-Klatt (RK) model, which consists of a basic voicing waveform describing the shape of the derivative glottal wave and a low-pass filter, 1 1 - μ - 1 ,
    Figure imgb0010
    with µ>0, as shown in Figure 6. The RK derivative of the glottal waveform is given by g ^ n = { 0 1 n < n c 2 a n - n c - 3 b n - n c 2 n c n < T 0 ,
    Figure imgb0011
    where To corresponds to the pitch period and nc represents the duration of the closed phase, which can also be expressed as n c = T 0 - OQ + T 0 ,
    Figure imgb0012
  • OQ being the open-quotient, i.e. the fraction of the pitch period in which the glottis is open. In addition, the parameters a and b need to be always positive and hold the following relationship, a = b OQ T 0 ,
    Figure imgb0013
    in order to maintain an appropriate waveshape.
  • Source-filter deconvolution via convex optimization is accomplished by minimising the squared error between the modelled and the true derivative glottal waveforms. The modelled derivative glottal waveform (n) corresponds to that of equation (7), while the true derivative glottal wave g(n) is obtained through inverse filtering as g n = s n - k = 1 p α k s n - k ,
    Figure imgb0014
    where s(n) is the speech wave and αk are the coefficients of the vocal tract all-pole filter.
  • The error between the modelled and the true derivative glottal waves e(n) can be calculated by subtracting Equations (7) and (10) e n = g ^ n - g n = { 0 - s n + k = 1 p α k s n - k 1 n < n c 2 a n - n c - 3 b n - n c 2 - s n + k = 1 p α k s n - k n c n < T 0
    Figure imgb0015
  • Rearranging the previous expression and rewriting it in matrix form we have E = e 1 e n c e n c + 1 e T 0 = s 0 s - p 0 0 s n c - 1 s n c - p 0 0 s n c s n c + 1 - p 2 1 - 3 1 2 s T 0 - 1 s T 0 - p 2 T 0 - n c - 3 T 0 - n c 2 X - s 1 s n c s n c + 1 s T 0 = f 1 T f T 0 T X - s 1 s T 0 = FX - S ,
    Figure imgb0016
    where X = [α 1 ... α p a b ]T is the parameter vector to estimate so that the sum of the squares of the equation error E is minimised, i.e. min X E 2 = min X n = 1 T 0 - 1 E n 2 = min X FX - S 2 .
    Figure imgb0017
  • H. Lu et. al. demonstrated (Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999) that the simplicity of the RK glottal model guarantees this optimization to be convex, i.e. to only have one minimum which corresponds to the optimal solution, and thus, efficiently solvable via Quadratic Programming. A quadratic problem is defined as follows min X q X = 1 2 X T HX + g T X subject to : AX b A eq X = b eq
    Figure imgb0018
  • Equation (12) can be solved using quadratic programming if expanded to have its form, i.e. min x FX - S 2 = FX - S T FX - S = X T F T FX - 2 S T FX + S T S ,
    Figure imgb0019
    by defining H = 2 F T F g T = - 2 S T F
    Figure imgb0020
    and ignoring the term STS, which is always positive, for the purposes of minimisation. In addition, equation (9) imposes the following equality and inequality constraints a > 0 b > 0 a = b OQ T 0 .
    Figure imgb0021
  • The derived quadratic program can be solved using a number of existing iterative numerical algorithms. In the developed implementation, the quadratic programming function of the MATLAB Optimization Toolbox has been employed. The result of the minimization problem is the simultaneous estimation of the RK model parameters a and b and the all-pole filter coefficients α k . Figure 7 shows a joint estimation example for one pitch period.
  • The described joint estimation process assumes that the closed and open-phases are defined, while in practice the parameter which delimits the end of the closed-phase and the beginning of the open-phase nc is unknown. Its optimal value is found by uniformly sampling the possible nc values (empirically shown to vary from 0% to 60% of the pitch period To ), solving the quadratic problem at each sampled nc value and choosing the estimate resulting in minimum error.
  • As it can be seen in Figure 8, the basic RK voicing waveform of equation (7) does not explicitly model the return phase of the derivative glottal waveform and changes abruptly at the glottal closure instants. For this reason, a low-pass filter is added to the basic model, with the purpose of reducing the abruptness of glottal closure. In the frequency domain, the filter coefficient µ is responsible for controlling the tilt of the source spectrum.
  • In order to allow the formulation of the convex optimization problem, the spectral tilt filter is separated from the source model and incorporated to the vocal tract model by adding an extra pole to the all-pole filter as shown in Figure 9. This implies that the vocal tract filter coefficients estimated using this formulation also encode the spectral slope information of the voice source. As a result, the derivative glottal waveforms obtained using this approach fail to adequately capture the variations in the return phase of the glottal source.
  • The present invention uses adaptive pre-emphasis to estimate and remove the spectral tilt filter contribution from the speech wave before convex optimization. Order one LP analysis and IF is applied to estimate and remove the spectral slope from the speech frames under analysis. The effect of adaptive pre-emphasis is illustrated in Figure 10: a) Speech spectrum and estimated spectral envelope, b) IF derivative glottal wave and fitted LF waveform, c) IF derivative glottal wave spectrum and fitted LF wave spectrum. The vocal tract filter envelope estimates obtained this way do not encode source spectral tilt characteristics, which are reflected in the closing phase of the resulting derivative glottal waveforms instead. This improves the fitting of the return phase of the LF model and thus, of the high frequencies of the glottal source.
  • The LF model is capable of more accurately describing the glottal derivative waveform than the RK model. However, its more complex nonlinear formulation fails to fulfil the convexity condition and prevents its use in the joint voice source and vocal tract filter parameter estimation algorithm. Instead, the RK model is employed during source-filter deconvolution and the LF model is then used to re-parameterise the derivative glottal wave obtained by inverse filtering the speech waveform with the jointly estimated filter coefficients.
  • LF model fitting is carried out in two steps. First, initial estimates of the LF T-parameters (Tp, Te, Ta, T c ) and the glottal excitation strength Ee are obtained from the time-domain IF voice source waveform by conventional direct estimation methods. Then, their values are refined using the conventional constrained nonlinear optimization technique. The overall procedure is as follows.
  • The glottal excitation strength Ee and its time index Te are located first by finding the minimum of the IF derivative glottal waveform. Then, Tp and Tc are determined as the first zero-crossings before and after Te respectively. Ta is estimated as Ta = (Tc- Te)2/3. Tp and Ta are further refined using constrained nonlinear minimisation. Because the initial Ee , Te and Tc estimates are quite reliable, their values are kept unchanged during optimization. Ta is confined to vary between 0 and Tc-Te and Tp to b within ±20% of its initial estimate. The return and open phases are optimized separately and sequentially. In both cases, the minimisation function is the sum of the squared error between the IF derivative glottal wave and the fitted estimate for the particular phase. Figure 11 shows an example of LF fitting in normal, breathy and pressed phonations.
  • Because the LF parameterisation does not model glottal aspiration noise, the stochastic component present in the IF derivative glottal waveform is not captured during LF fitting. However, perceptually, the lack of aspiration noise results in an unnatural speech quality and thus, a methodology for its extraction and parameterisation has been developed within the JEAS framework.
  • Wavelet denoising is used to extract the glottal aspiration noise from the IF derivative glottal wave estimate. In a preferred embodiment, the wavelet denoising technique used is Wavelet Packet Analysis, which has been found to obtain more reliable aspiration noise estimates compared to other techniques employed to identify and separate the periodic and aperiodic components of quasi-periodic signals, such as frequency transform analysis or periodic prediction. Wavelet Packet Analysis is preferably performed at level 4 with the 7th order Daubechies wavelet, using soft-thresholding and the Stein Unbiased Risk Estimate threshold evaluation criteria. Figure 12 shows a typical denoising result: a) original and denoised IF derivative glottal wave, b) noise estimate.
  • Once an estimate of the aspiration noise has been extracted, it needs to be parameterised. Studies of aspiration noise have shown that this is synchronous with the glottal wave and likely to present noise bursts at glottal closure and often also at glottal opening. Most models neglect the nature of the glottal opening pulse and approximate it as pitch synchronous amplitude modulated Gaussian noise, with higher energy around the glottal closure instants. The amplitude of the noise burst is usually modulated using Rectangular, Hanning or Hamming windows. A spectral shaping filter is sometimes included to account for the average spectral density of the aspiration noise and the high-pass filtering introduced by the commutation of the vocal tract and radiation filters. However, various models also neglect the spectral shaping filter since it has been found not to be perceptually important. These pitch synchronous amplitude modulated Gaussian noise approaches require the determination of the following parameters illustrated in Figure 13 from the aspiration noise component:
    • Noise Floor (Nf): the noise floor of the aspiration noise;
    • Noise Pulse Amplitude (NPa ): the amplitude modulation index of the noise pulse
    • Noise Pulse Position (NPp ): the position of the center of the noise pulse window in the glottal period
    • Noise Pulse Width (NPw): the width of the noise pulse window
  • Unfortunately, automatic calculation of the above parameters from the estimated aspiration noise components is troublesome in many cases. In order to avoid these errors, a different approach is followed in the present invention and, in particular, in the JEAS implementation. While the aspiration component is still approximated as pitch synchronous amplitude modulated Gaussian noise, an alternative function which does not require the estimation of Nf, Npa, NPp and NPw is employed to modulate its amplitude: the LF waveform. In fact, the shape of the LF waveform follows the most salient amplitude modulation characteristics of glottal aspiration noise, i.e. the magnitude of its amplitude increases during the open phase and is maximum at glottal closure. If stationary Gaussian noise is modulated with an LF waveform, the resulting signal will present the two likely aspiration noise bursts around glottal opening and glottal closure as shown in Figure 14 (a) Gaussian noise source, b) LF waveform, c) LF modulated Gaussian noise). According to informal listening tests, this approach is comparable to the previously described window-based modelling techniques.
  • Thus, in the present invention, the aspiration noise estimate obtained for a particular pitch period during JEAS analysis is parameterised as follows. First, zero mean unit variance Gaussian noise is modulated with the already fitted LF waveform for that pitch period. Then, its energy is adjusted to match that (ANE) of the aspiration noise estimate. Because using a spectral shaping filter has informally been found not to make a perceptual difference, it is not included in the parameterisation. Figure 15 depicts a diagram of the employed aspiration noise modelling approach.
  • 1.3 Synthesis within the Joint Estimation Analysis Synthesis model
  • Synthesis is done by following the JEAS Model of Figure 2 and applying the parameters estimated during analysis. In theory, each frame k of the speech waveform, which corresponds to a pitch period in voiced segments and to a fixed segment (in a particular example, a fixed segment of 10 ms) in unvoiced parts, can be generated by filtering the estimated voiced or unvoiced excitation signal e(n) with the vocal tract filter vt for that particular frame s k n = e k n * υ t k = e k n - i = 1 p α i s k n - i , n = 1 N k
    Figure imgb0022
    where p is the filter order and Nk is the number of samples in the frame.
  • The excitation signal is constructed either by adding the fitted LF and aspiration noise estimates, 1f (n) and an(n), or by simply generating a Gaussian noise source, gn(n), in voiced (V) and unvoiced (U) segments respectively e k n = { l f k n + a n k n , k = V g n k n , k = U
    Figure imgb0023
  • In practice, since the described JEAS analysis is done independently for each frame, the continuity of the estimated parameters between adjacent frames is not guaranteed, particularly within voiced segments. As a result, perceptual artifacts are produced when the parameters change too abruptly from frame to frame. To reduce this problem, the voiced glottal source and vocal tract parameter trajectories are smoothed before resynthesis.
  • Regarding the vocal tract, the jointly estimated filter coefficients (α1 ... αp) are first converted to Line Spectral Frequencies (LSF) due to their better interpolation properties. Then, each set of the LSF coefficients LSF P (lsf 1 ... Isfp) is averaged with those of the previous and following frames to obtain a smoother vocal tract filter estimate for synthesis LSF k p = i = k - 1 i = k + 1 LSF i p / 3 .
    Figure imgb0024
  • As for the glottal source, a similar approach is followed. First, the fitted LF T-parameters (Tp, Te, Ta, Tc ) are converted to R-parameters (Rg, Rk, Ra) which are more suitable for interpolation since they are normalised with respect to the fundamental period. Again, in order to smooth their trajectories, each R-parameter set is averaged with the ones of the previous and next frames. Aspiration noise energy ANE trajectories are also smoothed the same way R k = i = k - 1 i = k + 1 R i / 3 .
    Figure imgb0025
  • Once the source parameters (Rg , Rk , Ra , ANE) have been averaged, they are used to recompute smoothed LF derivative glottal waveforms lf (n) and amplitude modulated aspiration noise estimates an (n) to be used as the filter excitation e(n) for resynthesis.
  • In order to synthesise the speech wave, the overlap-add scheme of equation (21) is employed s ˜ n = k = 1 K w k n - k N sc k s c k n - k N sc k ,
    Figure imgb0026
    where K is the total number of frames, Wk is a Hamming window such that w k n = { 0.54 - 0.46 cos 2 pi n N sc k , 0 n N sc k 0 , otherwise
    Figure imgb0027
    and SCk is a synthetic contribution of length Nk sc =N k-1 + Nk generated by s c k n = e k n - N k - 1 - i = 1 p α i s c k n - i n = 1 N sc k
    Figure imgb0028
    so that a k-th synthesis frame of Nk samples is obtained as s ˜ n + k N k = w k - 1 n + N k s c k - 1 n + N k + w k n s c k n
    Figure imgb0029
  • Figure 16 shows a schematic diagram of the employed overlap-add synthesis scheme.
  • 1.4 Pitch and Time-Scale Modification
  • Due to the explicit and independent modelling of the fundamental period and the interpolation capabilities of the employed vocal tract and glottal source parameterisations, pitch and timescale modifications are easily implemented within the JEAS framework.
  • Both pitch and time-scale transformations are based on a parameter trajectory interpolation approach, where the first task involves calculating the number of frames in a particular segment required to achieve the desired modifications. Once the modified number of frames has been calculated, frame size contours, excitation and vocal tract parameter trajectories are resampled at the modified number of frames using, for example, cubic spline interpolation. Because JEAS modelling is pitch-synchronous, the frame sizes correspond with the pitch periods in voiced segments while they are fixed in unvoiced segments. Due to their better interpolation characteristics, LSF coefficients and R-parameters are employed during pitch and time-scale transformations to represent the vocal tract and glottal source respectively, in addition to aspiration ANE and Gaussian GNE noise energies.
  • Time-scale modification is carried out by increasing or decreasing the number of frames per segment and interpolating the parameter tracks accordingly. For example, in order to increase the duration of a voiced segment of f frames by 25%, the modified number of frames is calculated as mf = f + 0.25f. Then, the f-point pitch period contour is resampled at the new set of uniformly spaced mf points as shown in Figure 17. This way, the contour of the fundamental period, i.e. the intonation, is preserved while its variation is slowed down. The same resampling needs to be applied to each of the LSF coefficient, R-parameter and ANE tracks, to synthesise time-modified speech. Unvoiced segments can also be time-scaled using the described procedure. In this case, the excitation parameter trajectories to resample are the energies of the Gaussian noise source GNE.
  • Pitch can be altered by simply multiplying the fundamental period contour by a scaling factor.
  • For example, if a given pitch period contour of f frames T = {T 1, T 2, ..., Tf } is multiplied by 0.5, speech synthesied with the modified contour T'= 0.5 T = { T' 1 , T'2, ..., T'f } would be perceived to have twice the original fundamental frequency. However, its duration would also be perceived to be half the original. Scaling the fundamental periods involves modifying the frame sizes and, as a consequence, the segment durations. For this reason, the number of frames in a segment also needs to be modified when scaling pitch if its duration is to be maintained. The modified number of frames mf at the scaled fundamental periods whose duration approximates the original can be calculated as mf = f + T - T ʹ * f T ʹ
    Figure imgb0030
    where T is the original mean fundamental period, T ' is the scaled mean fundamental period. Once mf has been calculated, the scaled pitch period contour T', LSF coefficients, R-parameters and ANE trajectories must be resampled at the new number of frames before resynthesising the pitch-modified speech wave.
  • 2. Voice Conversion
  • In this section, the use of the JEAS glottal source parameterisation and continuous probabilistic linear transformations is explored for voice source conversion and the performance of the resulting JEAS Voice Conversion framework is compared against that of a conventional Sinusoidal VC system (H. Ye and S. Young, High Quality Voice Morphing, in Proc. ICASSP, ), referred to as PSHM. The first section details the speech model and feature transformation techniques employed in the JEAS VC implementation. Objective measurement of its spectral envelope and voice source conversion performance and subjective evaluation of the recognizability and quality of the converted output is presented next.
  • 2.1 JEAS Voice Conversion
  • The spectral envelope and glottal waveform transformation methods employed within JEAS voice conversion are described next. While spectral envelope conversion is done in a way similar to the well-known sinusoidal voice conversion implementation, the main advantage of JEAS Modelling, i.e. the parameterisation of the voice source, allows the source characteristics to be also transformed to match the target. As well as offering the potential for improved fidelity in the target identity, this also avoids the need for conventional residual prediction methods. In addition, because the JEAS parameterisation does not involve a magnitude and phase division of the spectrum, the artifacts due to converted magnitude and phase mismatches are not produced and, thus, the use of additional techniques, such as phase prediction, is not required.
  • 2.1.1 Spectral Envelope Conversion
  • The jointly estimated JEAS all-pole vocal tract filter coefficients {α1 ... αp} are converted to Bark scaled LSF parameters for the transformation of the JEAS spectral envelopes. First, the linear frequency response of the jointly estimated vocal tract filter is calculated. This is resampled according to the Bark scale using, for example, the well-known cubic spline interpolation technique. The warped all-pole filter coefficients are then computed by applying, for example, the conventional Levinson-Durbin algorithm to the autocorrelation sequence of the warped power spectrum. Then, the filter coefficients are transformed into LSF for conversion.
  • A continuous probabilistic linear transformation function is employed to convert the LSF spectral envelopes. Gaussian Mixture Models (GMMs) are used to describe the source and target glottal feature vector spaces, classify them into M classes and train class specific linear transformations. A weighted sum of the linear transformations is then employed to convert each glottal source feature vector x F x = m = 1 M λ m x W m x
    Figure imgb0031
    where is the extended feature vector =[x'; 1]' and _m is the interpolation weight of transformation matrix Wm, its value given by the probability of vector x belonging to class Cm λ m x = P C m | x = α m N y i ; μ m , Σ m i = 1 M α i N y i ; μ i , Σ i
    Figure imgb0032
    αm, µm, and Σm being the weights, means and variances of the GMM components respectively and N() representing the Normal Distribution. The transformation matrices Wm are estimated using parallel source and target training data and a least square error criterion.
  • After conversion, the new LSF parameters are transformed to all-pole filter coefficients and resampled back to the linear scale before synthesis. Because the use of linear transformations broadens the formants of the converted speech, a perceptual post-filter is applied to narrow the formant bandwidths, deepen the spectral valleys and sharpen the formant peaks.
  • Figure 18 illustrates the JEAS vs. conventional PSHM spectral envelopes, where it can be seen that the PSHM envelopes capture the spectral tilt but in JEAS, it is encoded by the glottal waveforms instead. In addition, whilst both methods manage to represent the most important formants, small differences exist in their amplitudes, frequencies and/or bandwidths.
  • 2.1.2 Glottal Waveform Conversion
  • Previous work on glottal waveform conversion has demonstrated that the quantization of glottal parameters is possible and capable of capturing voice source quality differences. For example, Childers et al. (Glottal source modelling for voice conversion. Speech Communication, 16:127-138, 1995) built 32-entry codebooks of polynomial voice source parameters from sentences produced with different voice qualities and managed to achieve conversions between modal, vocal fry, breathy, rough, falsetto, whisper and hoarse phonations. However, experiments involving transformations between more similar phonations, i.e. different modal speakers, or alternative conversion methods have not been explored yet. The use of LF glottal parameterisations has not been investigated either.
  • The glottal waveform morphing approach adopted within JEAS voice conversion employs Continuous Probabilistic Linear Transformations to map glottal LF parameters of different modal male and female speakers, which are the most commonly used speaker types in voice conversion applications.
  • Continuous probabilistic linear transformations have been chosen for being the most robust and efficient approach found to convert spectral envelopes. The limitations of the codebook-based conversion methods for envelope transformations, i.e. the discontinuities caused by the use of a discrete number of codebook entries, can also be extrapolated to the modification of glottal waveforms. Thus, the use of continuous probabilistic modelling and transformations is expected to achieve better glottal conversions too.
  • The feature vectors employed to convert the glottal source characteristics are derived from the JEAS model parameters linked to the voice source of every pitch period, i.e. the glottal excitation strength E e and T-parameters (Tp, Te, Ta, Tc) obtained from the LF fitting procedure and the energy (ANE) of the aspiration noise estimate used to adjust that of the modelled pitch-synchronous amplitude modulated Gaussian noise. In order to normalise the To dependent T-parameters for conversion, they are transformed into R-parameters (Rg , Rk, Ra), resulting in the five-dimensional feature vector (E e, R g, Rk , Ra , ANE) for glottal waveform conversion.
  • As it is shown in Figure 19, the described glottal conversion approach is capable of bringing the source feature vector parameter contours closer to the target which, as a consequence, also produces converted glottal waveforms more similar to the target. In particular, Figure 19 shows the linear transformation of LF Glottal Waveforms: a) source, target and converted derivative glottal LF waves; b) source, target and converted trajectories of the glottal feature vector parameters (Ee, Rg, Rk, Ra, ANE).
  • 2.2 An experiment: Comparison between a conventional sinusoidal voice conversion method and the JEAS Voice Conversion method
  • Next, an experiment is described, in which the performance of a conventional sinusoidal voice conversion method (H. Ye and S. Young, High Quality Voice Morphing, in Proc. ICASSP, ), referred to as PSHM is compared to the performance of the JEAS method. Both methods have been evaluated in a conversion task based on the VOICES database (A. Kain, High Resolution voice transformation, PhD thesis, Oregon Health and Science University, 2001). Specifically designed for voice conversion purposes, the corpus is composed of 3 instances of 50 phonetically rich sentences spoken by 10 speakers (5 male, 5 female), i.e. a total of 150 utterances per speaker. The speech data was recorded using a 'mimicking' approach which resulted in a natural time-alignment between the identical sentences produced by the different speakers and factor out the prosodic cues of speaker identity to some extent. Glottal closure instants derived from laryngograph signals are also provided for each sentence, and have been used for both PSHM and JEAS pitch synchronous analysis. Four different voice conversion experiments have been investigated: male-to-male (MM), male-to-female (MF), female-to-male (FM) and female-to-female (FF) transformations. The first 120 sentences are used for training and the remaining 30 for testing each speaker pair conversion.
  • LSF spectral vectors of order 30 have been employed throughout the conversion experiments, to train 8 linear spectral envelope transforms between each source and target speaker pair using the parallel VOICES training data. This number has been chosen for being capable of achieving small spectral distortion ratios while still generalising to the test data. Aligned source-target vector pairs were obtained by applying forced alignment to mark sub-phone boundaries and using Dynamic Time Warping to further constrain their time alignment. For residual and phase prediction, target GMMs and codebooks of 40 classes and entries have been built. Finally, glottal waveform conversions have also been carried out using 8 linear transforms per speaker pair. Objective and subjective evaluations have been used to compare the performance of the two methods.
  • 2.2.1 Objective Evaluation 2.2.1.1 Spectral Envelope Conversion
  • Because the linear spectral envelope transformations are actually applied to LSF vectors, their conversion performance can be easily evaluated by comparing source, target and converted LSF vector distances. If the distance between two LSF vectors lsf1 and lsf2 is defined as D LSF ls f 1 , ls f 2 = ls f 1 - ls f 2 = ls f 1 - ls f 2 ls f 1 - ls f 2 ,
    Figure imgb0033
    the following distortion ratio R LSF can be used as an objective measure of how close the source vectors have been converted into the target R LSF = t = 1 L D LSF ls f conv t , ls f igt t t = 1 L D LSF ls f ssc t , ls f igt t 100.
    Figure imgb0034
    where lsf src -(t), lsf tgt (t) and lsf conv (t) are the source, target and converted LSF vectors respectively and the summation is computed over the time-aligned test data, L being the total number of test vectors after time alignment. Note that a 100% distortion ratio corresponds to the distortion between the source and the target.
  • RLSF ratios have been computed for the PSHM and JEAS spectral envelope conversions on the VOICES test set. Figure 20 shows the obtained results. Although the differences are small, JEAS has been found to perform slightly better than PSHM with LSF distortion ratios 3% smaller in all conversion tasks overall. This might be due to the fact that JEAS spectral envelopes do not encode spectral tilt information, which reduces the LSF variations caused by tilt differences resulting in more accurate mappings.
  • 2.2.1.2 Voice Source Conversion
  • Similar objective distortion measures can also be used to evaluate the conversion of the voice source characteristics, i.e. Residual Prediction and Glottal Waveform Conversion in the PSHM and JEAS implementations respectively.
  • Residual Prediction reintroduces the target spectral details not captured by spectral envelope conversion, bringing as a result the converted speech spectra closer to the target. Glottal Waveform Conversion, on the other hand, maps time-domain representations of the glottal waveforms which in the frequency domain result in better matching glottal formants and spectral tilts of the converted spectra. Whilst the methods differ, their spectral effect is similar, i.e, they aim to reduce the differences between the converted and the target speech spectra.
  • One way to evaluate if the voice source conversion methods achieve the desired effect is to measure the log spectral distances (LSD) between the converted and target spectra before and after voice source conversion. If the RMS log spectral distance between two spectra is defined as D LSD S 1 S 2 = 1 K k = 1 K 10 log 10 amp k 1 - 10 log 10 amp k 2 2
    Figure imgb0035
    where {ampk} are the harmonic amplitudes resampled from spectrum S at K points on the bark frequency scale (K has been set to 100 points in this work). Then, a distortion ratio R LSD similar to R LSF can be used to compare the converted-to-target log spectral distances with and without voice source conversion R LSD = t = 1 L D LSD S conv t , S tgt t t = 1 L D LSD S orig t , S tgt t 100
    Figure imgb0036
    where Sconv (t) and Sorig (t) are the converted spectra with and without voice source conversion respectively and Stgt (t) is the target spectrum. Thus, a 100% ratio corresponds to the distortion between spectral envelope converted spectra without voice source transformation and the target spectra.
  • Figure 21 illustrates R LSD ratios computed for Residual Prediction and Glottal Waveform Conversion on the test set. Results show that both voice source conversion techniques manage to reduce the distortions between the converted and target speech spectra. Residual Prediction performs slightly better, mainly because the algorithm is designed to predict residuals which minimise the log spectral distance represented in R LSD. In contrast, glottal waveform conversion is trained to minimise the glottal parameter conversion error over the training data and not the log spectral distance. Nevertheless, both methods are successful in bringing the converted spectra close to the target.
  • 2.2.2 Subjective Evaluation
  • In order to compare the PSHM and JEAS voice conversion systems perceptually, a listening test was carried out to check their performance in terms of recognizability and quality. 12 subjects took part in the perceptual study, which consisted of two parts.
  • The first part was an ABX test in which subjects were presented with PSHM-converted (A), JEAS-converted (B) and target (X) utterances and were asked to choose the speech sample A or B they found sounded more like the target X in terms of speaker identity. Spectral envelopes and voice source characteristics were transformed with the methods described above for each system, i.e. spectral envelope conversion, residual and phase prediction were used for PSHM transformations and spectral envelope and glottal waveform conversion for JEAS transformations. In addition, the prosody of the target was employed to synthesise the converted sentences in order to normalise the pitch, duration and energy differences between source and target speakers for the perceptual comparison. 10 utterances of each conversion type (MM, MF, FM, FF) were presented. The order of the samples in terms of conversion type and conversion system was randomised. Informal listening of the utterances transformed using the PSHM and JEAS conversion systems revealed that it was often very difficult to convincingly choose between systems in terms of speaker identity. For this reason, subjects were also allowed to select a 'NO STRONG PREFERENCE' option when they found it difficult to choose or did not have a strong preference towards one of the presented A or B speech samples.
  • Figure 22 shows the results of the ABX test. In all conversion types, the JEAS-converted samples are preferred over the PSHM-converted ones overall, but the preference difference varies depending on the type of conversion, being for example almost the same for FM transformations. However, the 'NO STRONG PREFERENCE' (NSP) option has been selected almost as often as the JEAS-converted utterances in general, which reveals that subjects found it really difficult to distinguish between conversion systems in terms of speaker identity. Because the most important speaker identifying cues, i.e. spectral envelopes, are transformed using the same method in the two conversion implementations, it is expected that both systems should perform equally in terms of speaker recognizability. In addition, the obtained results show that the Residual Prediction and Glottal Waveform Conversion techniques are also comparable in terms of perceptual speaker identity transformation.
  • The second listening test aimed at determining which system produces speech with a higher quality. Subjects were presented with PSHM and JEAS converted speech utterance pairs and asked to choose the one they thought had a better speech quality. Results are illustrated in Figure 23. There is a clear preference for the sentences converted using the JEAS method, chosen 75.7% of the time on average, which stems from the clearly distinguishable quality difference between the PSHM and JEAS transformed samples. Utterances obtained after PSHM conversion have a 'noisy' quality caused by phase discontinuities which still exist despite Phase Prediction. Comparatively, JEAS converted sentences sound much smoother. This quality difference is also thought to have slightly biased the preference for JEAS conversion in the ABX test.
  • Among others, the method and device of voice conversion of the present invention is applicable to frameworks requiring voice quality transformations. As one such application, its use to repair the deviant voice source characteristics of tracheoesophageal speech can be mentioned.
  • The invention is obviously not limited to the specific embodiments described herein, but also encompasses any variations that may be considered by any person skilled in the art (for example, as regards the choice of components, configuration, etc.), within the general scope of the invention as defined in the appended claims.

Claims (13)

  1. A method of converting a source speaker's speech signal into a converted voice signal, which comprises the steps of:
    - a stage of training, in which:
    - given a training database of parallel source and target data, for each pitch period of said training database:
    - modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, to obtain a set of Liljencrants-Fant LF parameters, said set of LF parameters comprising an excitation strength parameter Ee and a set of T-parameters Tp, Te, Ta, Tc modelling a glottal waveform, and a set of all-pole vocal tract filter coefficients α1 ... αp;
    - converting said T-parameters Tp, Te, Ta, Tc into R-parameters Rg, Rk, Ra;
    - converting said all-pole vocal tract filter coefficients α1... αp into line spectral frequencies in Bark scale lsf1... lsfp;
    - defining a glottal vector G to be converted;
    - defining a vocal tract vector LSF to be converted, said vocal tract vector LSF comprising said line spectral frequencies in Bark scale lsf1... lsfp;
    - applying wavelet denoising to obtain an estimate of a glottal aspiration noise;
    - from the set of vocal tract vectors LSF obtained for each pitch period of the said training database, estimating a vocal tract continuous probabilistic linear transformation function using the least square error criterion;
    the method being characterised in that said stage of modelling further comprises the steps of:
    - modelling said aspiration noise estimate by modulating zero mean unit variance Gaussian noise with the said modelled glottal waveform and adjusting its energy ANE to match that of the said aspiration noise estimate;
    said glottal vector G to be converted comprising said excitation strength parameter Ee , said R-parameters Rg, Rk , Ra and said energy ANE of the aspiration noise estimate,
    the method further comprising:
    - a stage of conversion, in which a given test speech waveform is modelled and transformed into a set of converted parameters Ee', Rg' Rk' ,Ra', ANE', LSF' ;
    - a stage of synthesis, in which a converted speech waveform is synthesised from the said set of converted parameters Ee' ,Rg' ,Rk' ,Ra' ,ANE' ,LSF'.
  2. The method according to claim 1, wherein said stage of training further comprises:
    - from the set of glottal vectors G obtained for each pitch period of the said training database, estimating a glottal waveform continuous probabilistic linear transformation function using the least square error criterion.
  3. The method according to either claim 1 or 2, wherein said step of modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, comprises the steps of:
    - modelling the glottal waveform using the Rosenberg-Klatt model;
    - using convex optimization to obtain a set of Rosenberg-Klatt glottal waveform parameters and the all-pole vocal tract filter coefficients α1 ... αp, wherein said step of using convex optimization comprises a step of adaptive pre-emphasis for estimating and removing a spectral tilt filter contribution from the speech waveform before convex optimization.
  4. The method according to claim 3, wherein said step of modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, further comprises the steps of:
    - obtaining a derivative glottal waveform by inverse filtering said pitch period using said all-pole vocal tract filter coefficients α1 ... αp;
    - fitting said set of LF parameters to the said inverse filtered derivative glottal waveform by direct estimation and constrained non-linear optimization.
  5. The method according to any preceding claim, wherein said stage of conversion comprises, for each pitch period of said test speech waveform:
    - obtaining a glottal vector G to be converted, said glottal vector comprising an excitation strength parameter Ee , a set of R-parameters Rg , Rk , Ra and the energy ANE of the said aspiration noise estimate;
    - obtaining a vocal tract vector LSF to be converted, said vocal tract vector LSF comprising a set of line spectral frequencies in Bark scale lsf1 ... lsfp;
    - applying said vocal tract continuous probabilistic linear transformation function estimated during the training stage to obtain a converted vocal tract parameter vector LSF';
    - transforming said glottal vector G using said glottal waveform continuous probabilistic linear transformation function estimated during the training stage, thus obtaining a converted glottal vector G' comprising a set of converted parameters Ee' , Rg' , Rk' , Ra' , ANE' , LSF'.
  6. The method according to claim 5, wherein said stages of obtaining a glottal vector G to be converted and a vocal tract vector LSF to be converted further comprise the steps of:
    - modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, to obtain a set of LF parameters, said set of LF parameters comprising an excitation strength parameter Ee and a set of T-parameters Tp, Te, Ta, Tc modelling a glottal waveform, and a set of all-pole vocal tract filter coefficients α1 ... αp;
    - converting said all-pole vocal tract filter coefficients into line spectral frequencies in Bark scale lsf1 ... lsfp;
    - converting said T-parameters into R-parameters Rg, Rk, Ra;
    - defining a glottal vector G to be converted;
    - defining a vocal tract vector LSF to be converted.
  7. The method according to either claim 5 or 6, wherein said stage of conversion further comprises a step of post-filtering said converted vocal tract parameter vector LSF'.
  8. The method according to any preceding claim, wherein said stage of synthesis, in which said converted speech waveform is synthesised from the said set of converted parameters Ee' ,Rg' ,Rk' ,Ra' ,ANE' ,LSF', comprises the steps of:
    - interpolating the trajectories of said converted parameters Rg', Rk', Ra' , ANE', LSF' of each pitch period, thus obtaining a set of interpolated parameters Rg", Rk", Ra", ANE", LSF" comprising interpolated R-parameters Rg", Rk", Ra", interpolated energy (ANE'') and interpolated vocal tract vector LSF";
    - converting said interpolated vocal tract vector LSF" into all-pole filter coefficient vector A";
    - converting said interpolated R-parameters Rg" Rk" Ra" into interpolated T-parameters Tp" Ta" Te" Tc";
    - for each frame of said test speech waveform, generating an excitation signal ek(n), wherein k denotes the k-th frame.
  9. The method according to claim 8, wherein said stage of generating an excitation signal comprises, for each of said frames:
    - if said frame is voiced:
    - from said interpolated T-parameters Tp" Ta" Te" Tc" and said excitation strength parameter Ee, generating an interpolated glottal waveform lfk(n);
    - from said interpolated energy parameter ANE", generating interpolated aspiration noise ank(n);
    - generating said voiced excitation signal ek(n) by adding said interpolated glottal waveform lfk(n) and said interpolated aspiration noise ank(n);
    - if said frame is unvoiced:
    - generating said unvoiced excitation signal ek(n) from a Gaussian noise source gnk(n).
  10. The method according to either claim 8 or 9, wherein said stage of synthesis further comprises:
    - generating a synthetic contribution of each frame by filtering said excitation signal ek(n) with said interpolated all-pole filter coefficient vector A";
    - multiplying said synthetic contribution by a Hamming window, overlapping and adding, in order to generate the converted speech signal.
  11. A method applicable to voice quality transformations, such as tracheoesophageal speech repair, which comprises the method steps of any preceding claim.
  12. A device comprising means adapted to carry out the steps of the method of any preceding claim.
  13. A computer program code means adapted to perform the steps of the method according to any claim 1-11, when said program is run on a computer, a digital signal processor, a field-programmable gate array, an application-specific integrated circuit, a micro-processor, a micro-controller, or any other form of programmable hardware.
EP08804436A 2008-09-19 2008-09-19 Method, device and computer program code means for voice conversion Not-in-force EP2215632B1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2008/062502 WO2010031437A1 (en) 2008-09-19 2008-09-19 Method and system of voice conversion

Publications (2)

Publication Number Publication Date
EP2215632A1 EP2215632A1 (en) 2010-08-11
EP2215632B1 true EP2215632B1 (en) 2011-03-16

Family

ID=40277465

Family Applications (1)

Application Number Title Priority Date Filing Date
EP08804436A Not-in-force EP2215632B1 (en) 2008-09-19 2008-09-19 Method, device and computer program code means for voice conversion

Country Status (5)

Country Link
EP (1) EP2215632B1 (en)
AT (1) ATE502380T1 (en)
DE (1) DE602008005641D1 (en)
ES (1) ES2364005T3 (en)
WO (1) WO2010031437A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9607610B2 (en) 2014-07-03 2017-03-28 Google Inc. Devices and methods for noise modulation in a universal vocoder synthesizer
US11100940B2 (en) 2019-12-20 2021-08-24 Soundhound, Inc. Training a voice morphing apparatus
US11600284B2 (en) 2020-01-11 2023-03-07 Soundhound, Inc. Voice morphing apparatus having adjustable parameters

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901598A (en) * 2010-06-30 2010-12-01 北京捷通华声语音技术有限公司 Humming synthesis method and system
ES2364401B2 (en) * 2011-06-27 2011-12-23 Universidad Politécnica de Madrid METHOD AND SYSTEM FOR ESTIMATING PHYSIOLOGICAL PARAMETERS OF THE FONATION.
RU2510954C2 (en) * 2012-05-18 2014-04-10 Александр Юрьевич Бредихин Method of re-sounding audio materials and apparatus for realising said method
EP3857541B1 (en) 2018-09-30 2023-07-19 Microsoft Technology Licensing, LLC Speech waveform generation
WO2020174356A1 (en) * 2019-02-25 2020-09-03 Technologies Of Voice Interface Ltd Speech interpretation device and system
CN113780107B (en) * 2021-08-24 2024-03-01 电信科学技术第五研究所有限公司 Radio signal detection method based on deep learning dual-input network model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100809368B1 (en) * 2006-08-09 2008-03-05 한국과학기술원 Voice Color Conversion System using Glottal waveform

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9607610B2 (en) 2014-07-03 2017-03-28 Google Inc. Devices and methods for noise modulation in a universal vocoder synthesizer
US11100940B2 (en) 2019-12-20 2021-08-24 Soundhound, Inc. Training a voice morphing apparatus
US11600284B2 (en) 2020-01-11 2023-03-07 Soundhound, Inc. Voice morphing apparatus having adjustable parameters

Also Published As

Publication number Publication date
ES2364005T3 (en) 2011-08-22
DE602008005641D1 (en) 2011-04-28
WO2010031437A1 (en) 2010-03-25
ATE502380T1 (en) 2011-04-15
EP2215632A1 (en) 2010-08-11

Similar Documents

Publication Publication Date Title
EP2215632B1 (en) Method, device and computer program code means for voice conversion
McCree et al. A mixed excitation LPC vocoder model for low bit rate speech coding
Erro et al. Voice conversion based on weighted frequency warping
EP2881947B1 (en) Spectral envelope and group delay inference system and voice signal synthesis system for voice analysis/synthesis
Drugman et al. Glottal source processing: From analysis to applications
US9031834B2 (en) Speech enhancement techniques on the power spectrum
Arslan Speaker transformation algorithm using segmental codebooks (STASC)
JP4294724B2 (en) Speech separation device, speech synthesis device, and voice quality conversion device
Degottex et al. Phase minimization for glottal model estimation
US20180174571A1 (en) Speech processing device, speech processing method, and computer program product
Akande et al. Estimation of the vocal tract transfer function with application to glottal wave analysis
US20050131680A1 (en) Speech synthesis using complex spectral modeling
Cabral et al. Towards an improved modeling of the glottal source in statistical parametric speech synthesis
Cabral et al. Glottal spectral separation for speech synthesis
Cabral et al. Glottal spectral separation for parametric speech synthesis
Al-Radhi et al. Time-Domain Envelope Modulating the Noise Component of Excitation in a Continuous Residual-Based Vocoder for Statistical Parametric Speech Synthesis.
Ohtsuka et al. TRANSLATED PAPER
Del Pozo et al. The linear transformation of LF glottal waveforms for voice conversion.
Ferreira et al. A holistic glotal phase related feature
Kawahara et al. Beyond bandlimited sampling of speech spectral envelope imposed by the harmonic structure of voiced sounds.
Jelinek et al. Frequency-domain spectral envelope estimation for low rate coding of speech
Lenarczyk Parametric speech coding framework for voice conversion based on mixed excitation model
Del Pozo Voice source and duration modelling for voice conversion and speech repair
Shi et al. A variational EM method for pole-zero modeling of speech with mixed block sparse and Gaussian excitation
Agiomyrgiannakis et al. Towards flexible speech coding for speech synthesis: an LF+ modulated noise vocoder.

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20091014

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MT NL NO PL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL BA MK RS

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

RTI1 Title (correction)

Free format text: METHOD, DEVICE AND COMPUTER PROGRAM CODE MEANS FOR VOICE CONVERSION

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MT NL NO PL PT RO SE SI SK TR

REG Reference to a national code

Ref country code: GB

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: CH

Ref legal event code: EP

REG Reference to a national code

Ref country code: IE

Ref legal event code: FG4D

REF Corresponds to:

Ref document number: 602008005641

Country of ref document: DE

Date of ref document: 20110428

Kind code of ref document: P

REG Reference to a national code

Ref country code: DE

Ref legal event code: R096

Ref document number: 602008005641

Country of ref document: DE

Effective date: 20110428

REG Reference to a national code

Ref country code: PT

Ref legal event code: SC4A

Free format text: AVAILABILITY OF NATIONAL TRANSLATION

Effective date: 20110606

REG Reference to a national code

Ref country code: NL

Ref legal event code: VDEP

Effective date: 20110316

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: LT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: HR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: LV

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: NO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110616

REG Reference to a national code

Ref country code: ES

Ref legal event code: FG2A

Ref document number: 2364005

Country of ref document: ES

Kind code of ref document: T3

Effective date: 20110822

LTIE Lt: invalidation of european patent or patent extension

Effective date: 20110316

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: FI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: CY

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: BG

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110616

Ref country code: SI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: AT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: BE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: EE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: CZ

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: SK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: IS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110716

Ref country code: RO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: NL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

26N No opposition filed

Effective date: 20111219

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: PL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

Ref country code: DK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

REG Reference to a national code

Ref country code: DE

Ref legal event code: R097

Ref document number: 602008005641

Country of ref document: DE

Effective date: 20111219

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: MC

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20110930

REG Reference to a national code

Ref country code: IE

Ref legal event code: MM4A

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20110919

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: MT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

REG Reference to a national code

Ref country code: CH

Ref legal event code: PL

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: LU

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20110919

REG Reference to a national code

Ref country code: ES

Ref legal event code: PC2A

Owner name: FUNDACION CENTRO DE TECNOLOGIAS DE INTERACCION VIS

Effective date: 20130604

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: CH

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20120930

Ref country code: LI

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20120930

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: TR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: HU

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20110316

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: GB

Payment date: 20140929

Year of fee payment: 7

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: PT

Payment date: 20140320

Year of fee payment: 7

Ref country code: IT

Payment date: 20140922

Year of fee payment: 7

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: DE

Payment date: 20140929

Year of fee payment: 7

REG Reference to a national code

Ref country code: PT

Ref legal event code: MM4A

Free format text: LAPSE DUE TO NON-PAYMENT OF FEES

Effective date: 20160321

REG Reference to a national code

Ref country code: DE

Ref legal event code: R119

Ref document number: 602008005641

Country of ref document: DE

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IT

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20150919

GBPC Gb: european patent ceased through non-payment of renewal fee

Effective date: 20150919

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: PT

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20160321

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GB

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20150919

Ref country code: DE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20160401

REG Reference to a national code

Ref country code: FR

Ref legal event code: PLFP

Year of fee payment: 9

REG Reference to a national code

Ref country code: FR

Ref legal event code: PLFP

Year of fee payment: 10

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: FR

Payment date: 20170926

Year of fee payment: 10

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: ES

Payment date: 20171010

Year of fee payment: 10

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: FR

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20180930

REG Reference to a national code

Ref country code: ES

Ref legal event code: FD2A

Effective date: 20191104

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: ES

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20180920