EP2215632B1

EP2215632B1 - Method, device and computer program code means for voice conversion

Info

Publication number: EP2215632B1
Application number: EP08804436A
Authority: EP
Inventors: María Arantzazu DEL POZO ECHEZARRETA
Original assignee: Fundacion Centro de Tecnologias de Interaccion Visual y Comunicaciones Vicomtech
Current assignee: Fundacion Centro de Tecnologias de Interaccion Visual y Comunicaciones Vicomtech
Priority date: 2008-09-19
Filing date: 2008-09-19
Publication date: 2011-03-16
Anticipated expiration: 2028-09-19
Also published as: ES2364005T3; DE602008005641D1; WO2010031437A1; ATE502380T1; EP2215632A1

Abstract

A method of converting a source speakers speech signal into a converted speech signal, which comprises a stage of training using a given database of parallel source and target data. For each pitch period modelling a glottal waveform and a vocal tract filter to obtain a set of parameters comprising an excitation strength, parameters modelling a glottal waveform, and all-pole vocal tract filter coefficients. Defining a glottal vector to be converted; defining a vocal tract vector to be converted, obtaining an estimate of a glottal aspiration noise and estimating a vocal tract transformation function. The stage of modelling comprises: modelling said aspiration noise estimate by- modulating Gaussian noise with the said modelled glottal waveform and adjusting its energy to match that of the said aspiration noise estimate. The method further comprises a stage of conversion and a stage of synthesis.

Description

FIELD OF THE INVENTION

The present invention relates to methods and systems for voice conversion.

STATE OF THE ART

Voice Conversion aims at transforming a source speaker's speech to sound like that of a different target speaker. Text-to-speech synthesisers, dialogue systems and speech repair are among the numerous applications which can greatly benefit from the development of voice conversion technology.
The most widely used speech signal representations are the Source-Filter Model and the Sinusoidal Model. The Source-Filter representation (G. Fant, Acoustic Theory of Speech Production, ISBN 9027916004) is based on a simple production model composed of a glottal source waveform exciting a time-varying filter loaded at its output by the radiation of the lips. The main challenge in Source-Filter modelling is the estimation of the glottal waveform and vocal tract filter parameters from the speech signal.
Among the existing glottal waveform parameterisations, the Liljencrants-Fant (LF) model (The LF-model revisited. Transformations and frequency domain analysis, STL-QPSR, vol. 36, number 2-3, 1995, pages 119-156) has become the model of choice for research on the glottal source. It has been shown to be capable of modelling a wide range of naturally occurring phonations and the effects of its parameter variations are well understood. It exploits the linearity and time-invariance properties of the Source-Filter representation and assumes the commutation of the vocal tract and lip radiation filters to combine the modelling of the source excitation and lip radiation in the parameterisation of the derivative of the glottal waveform.
Linear Prediction (LP) is one popular technique used to obtain a combined parameterisation of the glottal source, vocal tract and lip radiation components in a unique all-pole filter H(z) . Such a filter is then excited, as shown in Figure 1, by a sequence of impulses spaced at the fundamental period To during voiced speech and by white Gaussian noise during unvoiced speech. If the speech signal were truly the response of an all-pole filter, the LP error or residual would be a train of impulses spaced at the voiced excitation instants and the impulse/noise voice source modelling would be accurate. In practice, however, the LP residual looks more like a white noise signal with larger values around the instants of excitation. While exciting the LP filter with the LP residual results in speech that is indistinguishable from the original, using an impulse train as the voiced excitation produces speech with a very buzzy quality. The strength of LP lies in its ability to automatically estimate a set of filter coefficients which compactly represent the envelope of the speech spectrum, making it popular in applications where the spectral characteristics of the speech wave need to be captured with a small number of parameters. Its main drawback, on the other hand, stems from the over-simplified modelling of the glottal waveform which prevents its use in systems requiring high-quality speech outputs.
As an alternative to LP, H. Lu et. al. have proposed a convex optimization method to automatically estimate the vocal tract filter and glottal waveform jointly (Joint estimation of vocal tract filter and glottal source waveform via convex optimization, Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999). The better modelling of the glottal source employed by this approach results in speech which has better quality than that of LP. In addition, the parameterisation of the glottal waveform allows its parametric modification, which can be exploited in voice conversion applications.
Sinusoidal Models assume the speech waveform to be composed of the sum of a small number of sinusoids with time-varying amplitudes, frequencies and phases. Such modelling was mainly developed by McAulay and Quatieri (Speech Analysis/Synthesis Based on a Sinusoidal Representation, IEEE Transactions on Acoustics, Speech and Signal Processing, pp. 744-754, 1986) in the mid-1980's and has been shown to be capable of producing high quality speech even after pitch and time-scale transformations. However, because of the high number of sinusoidal amplitudes, frequencies and phases involved, sinusoidal modelling results less flexible than the source-filter representation to modify spectral features.
In order to obtain high-quality converted speech, state-of-the-art voice conversion (VC) implementations mainly employ variations and extensions of the original sinusoidal model. In addition, they generally adopt a source-filter formulation based on LP to carry out spectral transformations.
Spectral envelopes are generally encoded in line spectral frequencies (LSF) for voice conversion, since LSFs have been shown to possess very good linear interpolation characteristics and to relate well to formant location and bandwidth. Because the frequency resolution of the human ear is greater at low frequencies than at high frequencies, spectral envelopes are often warped to a non-linear scale, e.g. the Bark scale, taking the non-uniform sensitivity of the human ear into account. Usually, only spectral envelopes of voiced speech segments are transformed, since unvoiced sounds contain little vocal tract information and their spectral envelopes present high variations. Among the existing different spectral envelope conversion techniques, continuous probabilistic linear transformations have been found to be the most robust and efficient approach. These can be obtained through least square error minimisation of parallel source and target training databases or using more general maximum likelihood transformation frameworks (Ye, H. and Young, S. Quality-enhanced Voice Morphing using Maximum Likelihood Transformations, IEEE Audio Speech and Language Processing, vol. 14, no. 4, pp. 1301-1312, 2006). One problem all spectral envelope conversion methods share is the broadening of the spectral peaks, expansion of the formant bandwidths and over-smoothing caused by the averaging effect of the parameter interpolations. This phenomenon makes the converted speech sound slightly muffled. In order to solve this issue, post-filtering is often applied as a post-processing stage to narrow formant bandwidths and suppress the noise in the spectral valleys as in, for example, Ye, H. and Young, S., Quality-enhanced Voice Morphing using Maximum Likelihood Transformations, IEEE Audio Speech and Language Processing, vol. 14, no. 4, pp. 1301-1312, 2006.
As for LP residual conversion, sinusoidal VC systems have developed residual prediction and selection methods (D. Suendermann, A. Bonafonte, H. Ney, and H. Hoege, A study on residual prediction techniques for voice conversion, in Proc.ICASSP, 2005, pp. 13-16.) based on the correlation between spectral envelope and LP residuals. These methods reintroduce the target spectral detail lost after envelope conversion. Because residuals contain the errors introduced by the LP parameterisation, residual prediction techniques have been found to improve conversion performance. However, LP residuals do not constitute an accurate model of the voice source and residual prediction alone is not capable of modifying the quality of the voice source. This prevents their use in applications requiring voice quality modifications such as, for example, speech repair.
The patent application WO 2008/018653 A1 discloses a further voice conversion technique using the Liljencrants-Fant parameters of the glottal wave.

SUMMARY OF THE INVENTION

It is therefore an object of the present invention to provide a method of voice conversion based on a source-filter model which uses a representation of the glottal source more accurate than LP residuals. This allows the use of continuous probabilistic linear transformations for the conversion of the voice source.
In particular, it is an object of the present invention a method of converting a source speaker's speech signal into a converted voice signal, which comprises a stage of training, a stage of conversion and a stage of synthesis.
The stage of training comprises, given a training database of parallel source and target data, for each pitch period of said training database: modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, to obtain a set of LF parameters, said set of LF parameters comprising an excitation strength parameter and a set of T-parameters modelling a glottal waveform, and a set of all-pole vocal tract filter coefficients; converting said T-parameters into R-parameters; converting said all-pole vocal tract filter coefficients into line spectral frequencies in Bark scale; defining a glottal vector to be converted; defining a vocal tract vector to be converted, said vocal tract vector comprising said line spectral frequencies in Bark scale; applying wavelet denoising to obtain an estimate of a glottal aspiration noise.
The stage of training also comprises, from the set of vocal tract vectors obtained for each pitch period of the said training database, estimating a vocal tract continuous probabilistic linear transformation function using the least square error criterion.
The previous stage of modelling further comprises the steps of modelling said aspiration noise estimate by modulating zero mean unit variance Gaussian noise with the said modelled glottal waveform and adjusting its energy to match that of the said aspiration noise estimate. Besides, the glottal vector to be converted comprises said excitation strength parameter, said R-parameters and said energy of the aspiration noise estimate.
In the stage of conversion, a given test speech waveform is modelled and transformed into a set of converted parameters.
In the stage of synthesis, a converted speech waveform is synthesised from the said set of converted parameters.
Preferably, the stage of training further comprises: from the set of glottal vectors obtained for each pitch period of the said training database, estimating a glottal waveform continuous probabilistic linear transformation function using the least square error criterion.
The step of modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, preferably comprises the steps of: modelling the glottal waveform using the Rosenberg-Klatt model; using convex optimization to obtain a set of Rosenberg-Klatt glottal waveform parameters and the all-pole vocal tract filter coefficients, wherein said step of using convex optimization comprises a step of adaptive pre-emphasis for estimating and removing a spectral tilt filter contribution from the speech waveform before convex optimization. Besides, that step of modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, further comprises the steps of: obtaining a derivative glottal waveform by inverse filtering said pitch period using said all-pole vocal tract filter coefficients; fitting said set of LF parameters to the said inverse filtered derivative glottal waveform by direct estimation and constrained non-linear optimization.
The stage of conversion preferably comprises, for each pitch period of said test speech waveform: obtaining a glottal vector to be converted, said glottal vector comprising an excitation strength parameter, a set of R-parameters and the energy of the said aspiration noise estimate; obtaining a vocal tract vector to be converted, said vocal tract vector comprising a set of line spectral frequencies in Bark scale; applying said vocal tract continuous probabilistic linear transformation function estimated during the training stage to obtain a converted vocal tract parameter vector; transforming said glottal vector using said glottal waveform continuous probabilistic linear transformation function estimated during the training stage, thus obtaining a converted glottal vector comprising a set of converted parameters.
In particular, those stages of obtaining a glottal vector to be converted and a vocal tract vector to be converted further comprise the steps of: modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, to obtain a set of LF parameters, said set of LF parameters comprising an excitation strength parameter and a set of T-parameters modelling a glottal waveform, and a set of all-pole vocal tract filter coefficients; converting said all-pole vocal tract filter coefficients into line spectral frequencies in Bark scale; converting said T-parameters into R-parameters; defining a glottal vector to be converted; and defining a vocal tract vector to be converted.
Preferably, the stage of conversion further comprises a step of post-filtering said converted vocal tract parameter vector.
The stage of synthesis, in which said converted speech waveform is synthesised from the said set of converted parameters, preferably comprises the steps of: interpolating the trajectories of said converted parameters of each pitch period, thus obtaining a set of interpolated parameters comprising interpolated R-parameters, interpolated energy and interpolated vocal tract vector; converting said interpolated vocal tract vector into an all-pole filter coefficient vector; converting said interpolated R-parameters into interpolated T-parameters; for each frame of said test speech waveform, generating an excitation signal.
Preferably, the stage of generating an excitation signal comprises, for each of said frames: if said frame is voiced: from said interpolated T-parameters and said excitation strength parameter, generating glottal waveform; from said interpolated aspiration noise energy parameter, generating interpolated aspiration noise; generating said voiced excitation signal by adding said interpolated glottal waveform and said interpolated aspiration noise. And, if said frame is unvoiced: generating said unvoiced excitation signal from a Gaussian noise source.
Besides, the stage of synthesis further comprises: generating a synthetic contribution of each frame by filtering said excitation signal with said interpolated all-pole filter coefficient vector; multiplying said synthetic contribution by a Hamming window, overlapping and adding, in order to generate the converted speech signal.
The present invention also provides a method applicable to voice quality transformations, such as tracheoesophageal speech repair, which comprises at least some of the above-mentioned method steps.
It is another object of the present invention to provide a device comprising means for carrying out the above-mentioned method.
Finally, it is a further object of the present invention to provide a computer program code means adapted to perform the steps of the method previously mentioned when said program is run on a computer, a digital signal processor, a field-programmable gate array, an application-specific integrated circuit, a micro-processor, a micro-controller, or any other form of programmable hardware.
The advantages of the proposed invention will become apparent in the description that follows.

BRIEF DESCRIPTION OF THE DRAWINGS

To complete the description and in order to provide for a better understanding of the invention, a set of drawings is provided. Said drawings form an integral part of the description and illustrate a preferred embodiment of the invention, which should not be interpreted as restricting the scope of the invention, but rather as an example of how the invention can be embodied. The drawings comprise the following figures:

Figure 1 shows a conventional schematic diagram of the LP model.
Figure 2 shows a schematic diagram of the joint estimation analysis synthesis (JEAS) model according to an embodiment of the present invention.
Figure 3 shows a schematic diagram modelling the glottal wave.
Figure 4 shows a schematic diagram modelling the derivative glottal wave.
Figure 5 shows typical LF pulses corresponding to glottal and derivative glottal waves.
Figure 6 shows a conventional model of the voice source.
Figure 7 shows a joint estimation example: a) speech period, b) speech spectrum and jointly estimated spectral envelope, c) inverse filtered residual and jointly estimated RK wave.
Figure 8 shows a RK derivative glottal wave.
Figure 9 shows a schematic diagram of a conventional modelling of the spectral tilt.
Figure 10 shows the effects of adaptive pre-emphasis.
Figure 11 shows an example of LF fitting in normal, breathy and pressed phonations.
Figure 12 shows a typical denoising result.
Figure 13 shows the standard aspiration noise model parameters.
Figure 14 shows the Gaussian noise modulation by an LF waveform.
Figure 15 shows a schematic diagram of an aspiration noise modelling approach according to an embodiment of the present invention.
Figure 16 shows a schematic diagram of the employed overlap-add synthesis scheme.
Figure 17 illustrates resampling of the frame size contour.
Figure 18 shows the JEAS vs. PSHM spectral envelopes.
Figure 19 shows the continuous probabilistic linear transformation of LF glottal waveforms.
Figure 20 shows the R_LSF distortion ratios of the converted PSHM and JEAS spectral envelopes.
Figure 21 shows the R _LSD distortion ratios of Residual Predicted (RP) and Glottal Waveform Converted (GWC) spectra.
Figure 22 shows the results of the ABX test.
Figure 23 shows the results of the quality comparison test.

DETAILED DESCRIPTION OF THE INVENTION

Definitions

In the context of the present invention, the term "approximately" and terms of its family (such as "approximate", "approximation", etc.) should be understood as indicating values or forms very near to those which accompany the aforementioned term. That is to say, a deviation within reasonable limits from an exact value or form should be accepted, because the expert in the technique will understand that such a deviation from the values or forms indicated is inevitable due to measurement inaccuracies, etc. The same applies to the term "nearly".
In the context of the present invention, the following terms are defined as follows:
The expression "pitch period" means a segment of a speech waveform which comprises a period of the fundamental frequency.
The term "frame" means a segment of a speech waveform, which corresponds to a pitch period in voiced parts and to a fixed amount of time in unvoiced parts. In a preferred embodiment of the present invention, which should not be interpreted as a limitation to the present invention, a frame corresponds to 10 ms in unvoiced parts.
The expression "source data" refers to a collection of speech waveforms uttered by a source speaker, while the expression "target data" refers to a collection of speech waveforms uttered by a target speaker. Besides, the expression "parallel source and target data" refers to a collection of speech waveforms uttered both by the source and the target speakers.
In this text, the term "comprises" and its derivations (such as "comprising", etc.) should not be understood in an excluding sense, that is, these terms should not be interpreted as excluding the possibility that what is described and defined may include further elements, steps, etc.

1. Joint Estimation Analysis Synthesis

A method of speech modelling for the analysis, modification and synthesis of speech is described next. The model is called Joint Estimation Analysis Synthesis (JEAS). Its biggest advantage is the automatic and simultaneous parameterisation of the vocal tract and the voice source, which allows the manipulation not only of spectral envelopes, but of glottal characteristics as well. In addition, it also supports high-quality pitch and time-scale modifications. Next, the employed voice source model and source-filter deconvolution technique, and the way analysis, synthesis and prosodic transformations are implemented is described.

1.1 Speech modelling

Figure 2 shows a schematic diagram of the JEAS model. It is based on a general Source-Filter representation. It employs white Gaussian and amplitude-modulated white Gaussian noise to model the Turbulence and Aspiration Noise components respectively, a digital differentiator for Lip Radiation and an all-pole filter to represent the Vocal Tract. Besides, the Liljencrants-Fant (LF) model is adopted to better capture the characteristics of the derivative glottal wave. Then, in order to estimate the different model component parameterisations from the speech wave, a joint voice source and vocal tract parameter estimation technique based on Convex Optimization is applied.
Next, the modelling of the voice source is explained:
Numerous parametric models of the glottal source have been proposed in the literature. Despite their differences, they all share many common features and can be described by a small set of parameters. In most cases, they exploit the linearity and time-invariance properties of the Source-Filter representation and assume the commutation of the vocal tract and lip radiation filters to combine the modelling of the source excitation and lip radiation in the parameterisation of the derivative of the glottal waveform as shown in Figure 4.
The present method adopts the well-known LF model, which is a four-parameter time-domain model of one cycle of the derivative glottal waveform. Typical LF pulses corresponding to glottal and derivative glottal waves are shown in Figure 5. Mathematically, it can be described as: $g (n) = {\begin{cases} E_{G} ε^{α n} \sin (ω_{g} n) & 0 \leq n < T_{e} \\ - \frac{E_{e}}{ε T α} [e^{- ε (n - T_{e})} - e^{ε (T_{c} - T_{e})}] & T_{e} \leq n < T_{c} \end{cases}$
The model consists of two segments: the first one characterises the derivative glottal waveform from the instant of glottal opening to the instant of main excitation T_e, where the amplitude reaches the maximum negative value -E_e. As shown in equation (1), the segment is a sinusoidal function which grows exponentially in amplitude, $F_{g} = \frac{ω_{g}}{2 π}$
being the frequency of the sine function and α determining the rate of the amplitude increase. E ₀ is a scaling factor used to ensure that the signal has a zero mean. The timing parameter T_p is related to the sinusoidal frequency through $T_{p} = \frac{1}{2 F_{g}}$
and denotes the instant of the maximum glottal flow. E_e is closely related to the strength of the source excitation and the main determinant of the intensity of the speech signal. Its variation affects the overall harmonic amplitudes, except the very lowest components which are more determined by the shape of the pulse.
The second segment models the closing or return phase from the main excitation T_e to the instant of full closure T_c using an exponential function. The duration of the return phase is thus determined by T_c -T_e . The main parameter characterising this segment is T_a , which represents the "effective duration" of the return phase. This is defined by the duration from T_e to the point where a tangent fitted at the start of the return phase crosses zero. ε^-1 is the time-constant of the exponential function, and can be determined iteratively from T_a, T_e and T_c through $ε = \frac{1}{T_{a}} (1 - e^{- ε (T_{c} - T_{e})}) .$
T ₀ corresponds to the fundamental period. Generally, T_c is made to coincide with the opening of the following pulse. This fact might suggest that the model does not account for the closed phase of the glottal waveform. However, for reasonably small values of T_a , the exponential function will fit closely to the zero line providing a closed phase without the need for additional control parameters.
Along with the excitation strength E_e , the LF pulse can be uniquely determined by the T-parameters: (T_p , T_e , T_a, T_c). These parameters can be easily identified from the estimated derivative glottal wave. Therefore, they are generally obtained first and the synthesis parameters, from which the LF waveform can be computed directly, (E ₀; α; ω_g; ε) are then derived taking the following constraints into account: $\begin{matrix} \int_{T}^{0} g (t) \cdot dt = 0 \end{matrix}$
$\begin{matrix} ω_{g} = \frac{π}{T_{p}} \end{matrix}$
$\begin{matrix} ϵ T_{a} = 1 - e^{- ϵ (T_{c} - T_{c})} \end{matrix}$
$\begin{matrix} E \end{matrix}$ $\begin{matrix} _{0} = - \frac{E_{c}}{e^{ϵ T_{c}} \sin (ω_{g} T_{e})} \end{matrix}$
Another important set of LF parameters are the R-parameters (R _g, R_k, R_a), which are normalised respect to T₀ and correlate with the most salient glottal phenomena, i.e. the glottal pulse width and the skewness and abruptness of closure. $R_{g} = \frac{T_{0}}{2 \cdot T_{p}}; R_{k} = \frac{T_{c} - T_{p}}{T_{p}}; R_{c} = \frac{T_{c}}{T_{0}}$
R_g is a normalised version of the glottal formant frequency Fg, which is defined as the inverse of twice the duration of the opening phase T_p. R_k is the LF parameter which captures glottal asymmetry. It is defined as the ratio between the times of the opening and closing branches of the glottal pulse, and the larger its value, the more symmetrical the pulse is. The relationship between R _g, R_k and the Open Quotient OQ is: OQ = (1 + R _k) / (2 R _g) . Thus, OQ is positively correlated with R_k and negatively correlated with R_g . The R_a parameter corresponds to the effective "return time" T_a normalised by the fundamental period and captures the differences relating to the spectral tilt.
Next, the method adopted for glottal source and vocal tract filter deconvolution is explained:
The aim of Source-Filter deconvolution is to obtain estimates of the glottal source and vocal tract filter components from the speech wave. Two main deconvolution approaches exist. Before parametric models of the glottal waveform were developed, Inverse Filtering (IF) was the most commonly employed deconvolution method. It is based on calculating a vocal tract filter transfer function, whose inverse is used to obtain a glottal waveform estimate which can then be parameterised.
A different approach involves modelling both glottal source and vocal tract filter, and developing techniques to jointly estimate the source and tract model parameters from the speech wave. Joint Estimation methods are fully automatic. This is an important condition that a mathematical model aimed at analysis, synthesis and modification of the speech signal should meet. Due to the characteristics of the mathematical voice source and vocal tract descriptions, such an approach is a complex nonlinear problem. For this reason, the use of LP has been deployed more widely as a simpler method to obtain a direct and efficient source-filter parameterisation of the speech signal. Its poor modelling of the voice source has not limited its application in the context of speech coding and to efficiently represent the speech spectrum with a small number of parameters. However, it has prevented its use in speech synthesis and transformation applications. Advances in voice conversion and Hidden Markov Model (HMM) speech synthesis in the last few years have emphasized the importance of refined vocoding and thus, the problem of automatic joint estimation of voice source and vocal tract filter parameters has gained renewed interest.
The method employed to obtain the JEAS voice source and vocal tract model parameters from the speech wave follows the second deconvolution approach and is based on the joint estimation of the vocal tract filter and the glottal waveform proposed by Lu and Smith (Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999).

1.2 Analysis within the Joint Estimation Analysis Synthesis model

During analysis, voiced and unvoiced speech segments are processed differently due to their diverse source characteristics. While the voice source in voiced speech is represented by a combination of the LF and aspiration noise models, white Gaussian noise is used to excite the vocal tract filter in unvoiced frames (see Figure 2). Their different modelling requires a preprocessing step where the voiced and unvoiced speech sections are determined and the glottal closure instants (GCI) of the voiced segments are estimated. Then, the voice source and vocal tract parameters are obtained through joint source-filter estimation and LF re-parameterisation in voiced sections (V) and through standard autocorrelation LP and Gaussian noise energy matching in unvoiced portions (U).
An algorithm, such as the well-known Dynamic Programming Projected Phase-Slope Algorithm (DYPSA), is used for GCI estimation. It employs the group-delay function in combination with a phase-slope projection method to determine GCI candidates, plus N-best dynamic programming to select the most likely candidates according to a cost function which takes waveform similarity, pitch deviation, normalised energy and deviation from the ideal phase-slope into account.
The voicing decision is made based on energy, zero-crossing and GCI information. Voiced segments are then processed pitch-synchronously, while unvoiced frames are periodically extracted. In a particular embodiment, they are extracted every 10ms.
The method employed by the invention to obtain the JEAS voice source and vocal tract model parameters involves using a voice source model simple enough to allow the source filter deconvolution to be formulated as a Convex Optimization problem. Then, the derivative glottal waveform obtained by inverse filtering (IF) with the estimated filter coefficients is reparameterised by LF model fitting.
The success of the technique lies in providing a derivative glottal waveform constraint when estimating the vocal tract filter. Because of this, the resulting IF derivative glottal waveform is closer to the true glottal excitation and its fitting to an LF model is less error prone.
The joint estimation algorithm models the voice source using the well-known Rosenberg-Klatt (RK) model, which consists of a basic voicing waveform describing the shape of the derivative glottal wave and a low-pass filter, $\frac{1}{1 - μ^{- 1}},$
with µ>0, as shown in Figure 6. The RK derivative of the glottal waveform is given by $\hat{g} (n) = {\begin{cases} 0 & 1 \leq n < n_{c} \\ 2 a (n - n_{c}) - 3 b {(n - n_{c})}^{2} & n_{c} \leq n < T_{0} \end{cases},$
where To corresponds to the pitch period and n_c represents the duration of the closed phase, which can also be expressed as $n_{c} = T_{0} - OQ + T_{0},$
OQ being the open-quotient, i.e. the fraction of the pitch period in which the glottis is open. In addition, the parameters a and b need to be always positive and hold the following relationship, $a = b \cdot OQ \cdot T_{0},$
in order to maintain an appropriate waveshape.
Source-filter deconvolution via convex optimization is accomplished by minimising the squared error between the modelled and the true derivative glottal waveforms. The modelled derivative glottal waveform ĝ(n) corresponds to that of equation (7), while the true derivative glottal wave g(n) is obtained through inverse filtering as $g (n) = s (n) - \sum_{k = 1}^{p} α_{k} s (n - k),$
where s(n) is the speech wave and α_k are the coefficients of the vocal tract all-pole filter.
The error between the modelled and the true derivative glottal waves e(n) can be calculated by subtracting Equations (7) and (10) $\begin{array}{l} e (n) & = \hat{g} (n) - g (n) \\ = {\begin{cases} 0 - s (n) + \sum_{k = 1}^{p} α_{k} s (n - k) & 1 \leq n < n_{c} \\ 2 a (n - n_{c}) - 3 b {(n - n_{c})}^{2} - s (n) + \sum_{k = 1}^{p} α_{k} s (n - k) & n_{c} \leq n < T_{0} \end{cases} \end{array}$
Rearranging the previous expression and rewriting it in matrix form we have $\begin{array}{l} E = [\begin{matrix} e (1) \\ ⋮ \\ e (n_{c}) \\ e (n_{c} + 1) \\ ⋮ \\ e (T) (_{0}) \end{matrix}] & = [\begin{matrix} s (0) & \dots & s (- p) & 0 & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ s (n_{c} - 1) & \dots & s (n_{c} - p) & 0 & 0 \\ s (n_{c}) & \dots & s (n_{c} + 1 - p) & 2 (1) & - 3 {(1)}^{2} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ s (T_{0} - 1) & \dots & s (T_{0} - p) & 2 (T_{0} - n_{c}) & - 3 {(T_{0} - n_{c})}^{2} \end{matrix}] X - [\begin{matrix} s (1) \\ ⋮ \\ s (n_{c}) \\ s (n_{c} + 1) \\ ⋮ \\ s (T_{0}) \end{matrix}] \\ = [\begin{matrix} f_{1}^{T} \\ ⋮ \\ f_{T_{0}}^{T} \end{matrix}] X - [\begin{matrix} s_{1} \\ ⋮ \\ s_{T_{0}} \end{matrix}] = FX - S, \end{array}$
where X = [α ₁ ... α _p a b ]^T is the parameter vector to estimate so that the sum of the squares of the equation error E is minimised, i.e. $\min_{X} {‖ E ‖}^{2} = \min_{X} \sum_{n = 1}^{T_{0} - 1} {(E (n) ())}^{2} = \min_{X} {‖ FX - S ‖}^{2} .$
H. Lu et. al. demonstrated (Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999) that the simplicity of the RK glottal model guarantees this optimization to be convex, i.e. to only have one minimum which corresponds to the optimal solution, and thus, efficiently solvable via Quadratic Programming. A quadratic problem is defined as follows $\begin{matrix} \min_{X} q (X) = \frac{1}{2} X^{T} HX + g^{T} X \\ \begin{array}{l} subject to : & AX \geq b \\ A_{eq} X = b_{eq} \end{array} \end{matrix}$
Equation (12) can be solved using quadratic programming if expanded to have its form, i.e. $\begin{array}{l} \min_{x} {‖ FX - S ‖}^{2} & = {(FX - S)}^{T} (FX - S) \\ = X^{T} F^{T} FX - 2 S^{T} FX + S^{T} S, \end{array}$
by defining $\begin{array}{l} H = 2 F^{T} F \\ g^{T} = - 2 S^{T} F \end{array}$
and ignoring the term S^TS, which is always positive, for the purposes of minimisation. In addition, equation (9) imposes the following equality and inequality constraints $\begin{array}{l} a > 0 \\ b > 0 \\ a = b \cdot OQ \cdot T_{0} . \end{array}$
The derived quadratic program can be solved using a number of existing iterative numerical algorithms. In the developed implementation, the quadratic programming function of the MATLAB Optimization Toolbox has been employed. The result of the minimization problem is the simultaneous estimation of the RK model parameters a and b and the all-pole filter coefficients α _k . Figure 7 shows a joint estimation example for one pitch period.
The described joint estimation process assumes that the closed and open-phases are defined, while in practice the parameter which delimits the end of the closed-phase and the beginning of the open-phase n_c is unknown. Its optimal value is found by uniformly sampling the possible n_c values (empirically shown to vary from 0% to 60% of the pitch period T_o ), solving the quadratic problem at each sampled n_c value and choosing the estimate resulting in minimum error.
As it can be seen in Figure 8, the basic RK voicing waveform of equation (7) does not explicitly model the return phase of the derivative glottal waveform and changes abruptly at the glottal closure instants. For this reason, a low-pass filter is added to the basic model, with the purpose of reducing the abruptness of glottal closure. In the frequency domain, the filter coefficient µ is responsible for controlling the tilt of the source spectrum.
In order to allow the formulation of the convex optimization problem, the spectral tilt filter is separated from the source model and incorporated to the vocal tract model by adding an extra pole to the all-pole filter as shown in Figure 9. This implies that the vocal tract filter coefficients estimated using this formulation also encode the spectral slope information of the voice source. As a result, the derivative glottal waveforms obtained using this approach fail to adequately capture the variations in the return phase of the glottal source.
The present invention uses adaptive pre-emphasis to estimate and remove the spectral tilt filter contribution from the speech wave before convex optimization. Order one LP analysis and IF is applied to estimate and remove the spectral slope from the speech frames under analysis. The effect of adaptive pre-emphasis is illustrated in Figure 10: a) Speech spectrum and estimated spectral envelope, b) IF derivative glottal wave and fitted LF waveform, c) IF derivative glottal wave spectrum and fitted LF wave spectrum. The vocal tract filter envelope estimates obtained this way do not encode source spectral tilt characteristics, which are reflected in the closing phase of the resulting derivative glottal waveforms instead. This improves the fitting of the return phase of the LF model and thus, of the high frequencies of the glottal source.
The LF model is capable of more accurately describing the glottal derivative waveform than the RK model. However, its more complex nonlinear formulation fails to fulfil the convexity condition and prevents its use in the joint voice source and vocal tract filter parameter estimation algorithm. Instead, the RK model is employed during source-filter deconvolution and the LF model is then used to re-parameterise the derivative glottal wave obtained by inverse filtering the speech waveform with the jointly estimated filter coefficients.
LF model fitting is carried out in two steps. First, initial estimates of the LF T-parameters (T_p, T_e, T_a, T _c ) and the glottal excitation strength E_e are obtained from the time-domain IF voice source waveform by conventional direct estimation methods. Then, their values are refined using the conventional constrained nonlinear optimization technique. The overall procedure is as follows.
The glottal excitation strength E_e and its time index T_e are located first by finding the minimum of the IF derivative glottal waveform. Then, T_p and T_c are determined as the first zero-crossings before and after T_e respectively. T_a is estimated as T_a = (T_c- T_e)2/3. T_p and T_a are further refined using constrained nonlinear minimisation. Because the initial E_e , T_e and T_c estimates are quite reliable, their values are kept unchanged during optimization. T_a is confined to vary between 0 and T_c-T_e and T_p to b within ±20% of its initial estimate. The return and open phases are optimized separately and sequentially. In both cases, the minimisation function is the sum of the squared error between the IF derivative glottal wave and the fitted estimate for the particular phase. Figure 11 shows an example of LF fitting in normal, breathy and pressed phonations.
Because the LF parameterisation does not model glottal aspiration noise, the stochastic component present in the IF derivative glottal waveform is not captured during LF fitting. However, perceptually, the lack of aspiration noise results in an unnatural speech quality and thus, a methodology for its extraction and parameterisation has been developed within the JEAS framework.
Wavelet denoising is used to extract the glottal aspiration noise from the IF derivative glottal wave estimate. In a preferred embodiment, the wavelet denoising technique used is Wavelet Packet Analysis, which has been found to obtain more reliable aspiration noise estimates compared to other techniques employed to identify and separate the periodic and aperiodic components of quasi-periodic signals, such as frequency transform analysis or periodic prediction. Wavelet Packet Analysis is preferably performed at level 4 with the 7th order Daubechies wavelet, using soft-thresholding and the Stein Unbiased Risk Estimate threshold evaluation criteria. Figure 12 shows a typical denoising result: a) original and denoised IF derivative glottal wave, b) noise estimate.
Once an estimate of the aspiration noise has been extracted, it needs to be parameterised. Studies of aspiration noise have shown that this is synchronous with the glottal wave and likely to present noise bursts at glottal closure and often also at glottal opening. Most models neglect the nature of the glottal opening pulse and approximate it as pitch synchronous amplitude modulated Gaussian noise, with higher energy around the glottal closure instants. The amplitude of the noise burst is usually modulated using Rectangular, Hanning or Hamming windows. A spectral shaping filter is sometimes included to account for the average spectral density of the aspiration noise and the high-pass filtering introduced by the commutation of the vocal tract and radiation filters. However, various models also neglect the spectral shaping filter since it has been found not to be perceptually important. These pitch synchronous amplitude modulated Gaussian noise approaches require the determination of the following parameters illustrated in Figure 13 from the aspiration noise component:

Noise Floor (N_f): the noise floor of the aspiration noise;
Noise Pulse Amplitude (NP_a ): the amplitude modulation index of the noise pulse
Noise Pulse Position (NP_p ): the position of the center of the noise pulse window in the glottal period
Noise Pulse Width (NP_w): the width of the noise pulse window

Unfortunately, automatic calculation of the above parameters from the estimated aspiration noise components is troublesome in many cases. In order to avoid these errors, a different approach is followed in the present invention and, in particular, in the JEAS implementation. While the aspiration component is still approximated as pitch synchronous amplitude modulated Gaussian noise, an alternative function which does not require the estimation of N_f, Np_a, NP_p and NP_w is employed to modulate its amplitude: the LF waveform. In fact, the shape of the LF waveform follows the most salient amplitude modulation characteristics of glottal aspiration noise, i.e. the magnitude of its amplitude increases during the open phase and is maximum at glottal closure. If stationary Gaussian noise is modulated with an LF waveform, the resulting signal will present the two likely aspiration noise bursts around glottal opening and glottal closure as shown in Figure 14 (a) Gaussian noise source, b) LF waveform, c) LF modulated Gaussian noise). According to informal listening tests, this approach is comparable to the previously described window-based modelling techniques.
Thus, in the present invention, the aspiration noise estimate obtained for a particular pitch period during JEAS analysis is parameterised as follows. First, zero mean unit variance Gaussian noise is modulated with the already fitted LF waveform for that pitch period. Then, its energy is adjusted to match that (ANE) of the aspiration noise estimate. Because using a spectral shaping filter has informally been found not to make a perceptual difference, it is not included in the parameterisation. Figure 15 depicts a diagram of the employed aspiration noise modelling approach.

1.3 Synthesis within the Joint Estimation Analysis Synthesis model

Synthesis is done by following the JEAS Model of Figure 2 and applying the parameters estimated during analysis. In theory, each frame k of the speech waveform, which corresponds to a pitch period in voiced segments and to a fixed segment (in a particular example, a fixed segment of 10 ms) in unvoiced parts, can be generated by filtering the estimated voiced or unvoiced excitation signal e(n) with the vocal tract filter vt for that particular frame $s_{k} (n) = e_{k} (n) * υ t_{k} = e_{k} (n) - \sum_{i = 1}^{p} α_{i} \cdot s_{k} (n - i), n = 1 \dots N_{k}$
where p is the filter order and N_k is the number of samples in the frame.
The excitation signal is constructed either by adding the fitted LF and aspiration noise estimates, 1f (n) and an(n), or by simply generating a Gaussian noise source, gn(n), in voiced (V) and unvoiced (U) segments respectively $e_{k} (n) = {\begin{cases} l f_{k} (n) + a n_{k} (n), & k = V \\ g n_{k} (n), & k = U \end{cases}$
In practice, since the described JEAS analysis is done independently for each frame, the continuity of the estimated parameters between adjacent frames is not guaranteed, particularly within voiced segments. As a result, perceptual artifacts are produced when the parameters change too abruptly from frame to frame. To reduce this problem, the voiced glottal source and vocal tract parameter trajectories are smoothed before resynthesis.
Regarding the vocal tract, the jointly estimated filter coefficients (α₁ ... α_p) are first converted to Line Spectral Frequencies (LSF) due to their better interpolation properties. Then, each set of the LSF coefficients LSF ^P (lsf ₁ ... Isf_p) is averaged with those of the previous and following frames to obtain a smoother vocal tract filter estimate for synthesis ${LSF}_{k}^{p} = \sum_{i = k - 1}^{i = k + 1} {LSF}_{i}^{p} / 3 .$
As for the glottal source, a similar approach is followed. First, the fitted LF T-parameters (T_p, T_e, T_a, T_c ) are converted to R-parameters (R_g, R_k, R_a) which are more suitable for interpolation since they are normalised with respect to the fundamental period. Again, in order to smooth their trajectories, each R-parameter set is averaged with the ones of the previous and next frames. Aspiration noise energy ANE trajectories are also smoothed the same way $R_{k} = \sum_{i = k - 1}^{i = k + 1} R_{i} / 3 .$
Once the source parameters (R_g , R_k , R_a , ANE) have been averaged, they are used to recompute smoothed LF derivative glottal waveforms lf (n) and amplitude modulated aspiration noise estimates an (n) to be used as the filter excitation e(n) for resynthesis.
In order to synthesise the speech wave, the overlap-add scheme of equation (21) is employed $\tilde{s} (n) = \sum_{k = 1}^{K} w_{k} (n - k N_{sc}^{k}) \cdot s c_{k} (n - k N_{sc}^{k}),$
where K is the total number of frames, W_k is a Hamming window such that $w_{k} (n) = {\begin{cases} 0.54 - 0.46 \cos (2 pi \frac{n}{N_{sc}^{k}}), & 0 \leq n \leq N_{sc}^{k} \\ 0, & otherwise \end{cases}$
and SC_k is a synthetic contribution of length N^k _sc =N _k-1 + N_k generated by $s c_{k} (n) = e_{k} (n - N_{k - 1}) - \sum_{i = 1}^{p} α_{i} \cdot s c_{k} (n - i) n = 1 \dots N_{sc}^{k}$
so that a k-th synthesis frame of N_k samples is obtained as $\tilde{s} (n + k N_{k}) = w_{k - 1} (n + N_{k}) s c_{k - 1} (n + N_{k}) + w_{k} (n) s c_{k} (n)$
Figure 16 shows a schematic diagram of the employed overlap-add synthesis scheme.

1.4 Pitch and Time-Scale Modification

Due to the explicit and independent modelling of the fundamental period and the interpolation capabilities of the employed vocal tract and glottal source parameterisations, pitch and timescale modifications are easily implemented within the JEAS framework.
Both pitch and time-scale transformations are based on a parameter trajectory interpolation approach, where the first task involves calculating the number of frames in a particular segment required to achieve the desired modifications. Once the modified number of frames has been calculated, frame size contours, excitation and vocal tract parameter trajectories are resampled at the modified number of frames using, for example, cubic spline interpolation. Because JEAS modelling is pitch-synchronous, the frame sizes correspond with the pitch periods in voiced segments while they are fixed in unvoiced segments. Due to their better interpolation characteristics, LSF coefficients and R-parameters are employed during pitch and time-scale transformations to represent the vocal tract and glottal source respectively, in addition to aspiration ANE and Gaussian GNE noise energies.
Time-scale modification is carried out by increasing or decreasing the number of frames per segment and interpolating the parameter tracks accordingly. For example, in order to increase the duration of a voiced segment of f frames by 25%, the modified number of frames is calculated as mf = f + 0.25f. Then, the f-point pitch period contour is resampled at the new set of uniformly spaced mf points as shown in Figure 17. This way, the contour of the fundamental period, i.e. the intonation, is preserved while its variation is slowed down. The same resampling needs to be applied to each of the LSF coefficient, R-parameter and ANE tracks, to synthesise time-modified speech. Unvoiced segments can also be time-scaled using the described procedure. In this case, the excitation parameter trajectories to resample are the energies of the Gaussian noise source GNE.
Pitch can be altered by simply multiplying the fundamental period contour by a scaling factor.
For example, if a given pitch period contour of f frames T = {T ¹, T ², ..., T^f } is multiplied by 0.5, speech synthesied with the modified contour T'= 0.5 T = { T' ¹ , T'², ..., T'^f } would be perceived to have twice the original fundamental frequency. However, its duration would also be perceived to be half the original. Scaling the fundamental periods involves modifying the frame sizes and, as a consequence, the segment durations. For this reason, the number of frames in a segment also needs to be modified when scaling pitch if its duration is to be maintained. The modified number of frames mf at the scaled fundamental periods whose duration approximates the original can be calculated as $mf = f + \frac{(\overline{T} - \overline{T} ʹ) * f}{\overline{T} ʹ}$
where T is the original mean fundamental period, T ' is the scaled mean fundamental period. Once mf has been calculated, the scaled pitch period contour T', LSF coefficients, R-parameters and ANE trajectories must be resampled at the new number of frames before resynthesising the pitch-modified speech wave.

2. Voice Conversion

In this section, the use of the JEAS glottal source parameterisation and continuous probabilistic linear transformations is explored for voice source conversion and the performance of the resulting JEAS Voice Conversion framework is compared against that of a conventional Sinusoidal VC system (H. Ye and S. Young, High Quality Voice Morphing, in Proc. ICASSP, ), referred to as PSHM. The first section details the speech model and feature transformation techniques employed in the JEAS VC implementation. Objective measurement of its spectral envelope and voice source conversion performance and subjective evaluation of the recognizability and quality of the converted output is presented next.

2.1 JEAS Voice Conversion

The spectral envelope and glottal waveform transformation methods employed within JEAS voice conversion are described next. While spectral envelope conversion is done in a way similar to the well-known sinusoidal voice conversion implementation, the main advantage of JEAS Modelling, i.e. the parameterisation of the voice source, allows the source characteristics to be also transformed to match the target. As well as offering the potential for improved fidelity in the target identity, this also avoids the need for conventional residual prediction methods. In addition, because the JEAS parameterisation does not involve a magnitude and phase division of the spectrum, the artifacts due to converted magnitude and phase mismatches are not produced and, thus, the use of additional techniques, such as phase prediction, is not required.

2.1.1 Spectral Envelope Conversion

The jointly estimated JEAS all-pole vocal tract filter coefficients {α₁ ... α_p} are converted to Bark scaled LSF parameters for the transformation of the JEAS spectral envelopes. First, the linear frequency response of the jointly estimated vocal tract filter is calculated. This is resampled according to the Bark scale using, for example, the well-known cubic spline interpolation technique. The warped all-pole filter coefficients are then computed by applying, for example, the conventional Levinson-Durbin algorithm to the autocorrelation sequence of the warped power spectrum. Then, the filter coefficients are transformed into LSF for conversion.
A continuous probabilistic linear transformation function is employed to convert the LSF spectral envelopes. Gaussian Mixture Models (GMMs) are used to describe the source and target glottal feature vector spaces, classify them into M classes and train class specific linear transformations. A weighted sum of the linear transformations is then employed to convert each glottal source feature vector x $F (x) = (\sum_{m = 1}^{M} λ_{m} (x) W_{m}) \overline{x}$
where x̅ is the extended feature vector x̅=[x'; 1]' and _m is the interpolation weight of transformation matrix W_m, its value given by the probability of vector x belonging to class C_m $λ_{m} (x) = P (C_{m} | x) = \frac{α_{m} N (y_{i}; μ_{m}, Σ_{m})}{\sum_{i = 1}^{M} α_{i} N (y_{i}; μ_{i}, Σ_{i})}$
α_m, µ_m, and Σ_m being the weights, means and variances of the GMM components respectively and N() representing the Normal Distribution. The transformation matrices W_m are estimated using parallel source and target training data and a least square error criterion.
After conversion, the new LSF parameters are transformed to all-pole filter coefficients and resampled back to the linear scale before synthesis. Because the use of linear transformations broadens the formants of the converted speech, a perceptual post-filter is applied to narrow the formant bandwidths, deepen the spectral valleys and sharpen the formant peaks.
Figure 18 illustrates the JEAS vs. conventional PSHM spectral envelopes, where it can be seen that the PSHM envelopes capture the spectral tilt but in JEAS, it is encoded by the glottal waveforms instead. In addition, whilst both methods manage to represent the most important formants, small differences exist in their amplitudes, frequencies and/or bandwidths.

2.1.2 Glottal Waveform Conversion

Previous work on glottal waveform conversion has demonstrated that the quantization of glottal parameters is possible and capable of capturing voice source quality differences. For example, Childers et al. (Glottal source modelling for voice conversion. Speech Communication, 16:127-138, 1995) built 32-entry codebooks of polynomial voice source parameters from sentences produced with different voice qualities and managed to achieve conversions between modal, vocal fry, breathy, rough, falsetto, whisper and hoarse phonations. However, experiments involving transformations between more similar phonations, i.e. different modal speakers, or alternative conversion methods have not been explored yet. The use of LF glottal parameterisations has not been investigated either.
The glottal waveform morphing approach adopted within JEAS voice conversion employs Continuous Probabilistic Linear Transformations to map glottal LF parameters of different modal male and female speakers, which are the most commonly used speaker types in voice conversion applications.
Continuous probabilistic linear transformations have been chosen for being the most robust and efficient approach found to convert spectral envelopes. The limitations of the codebook-based conversion methods for envelope transformations, i.e. the discontinuities caused by the use of a discrete number of codebook entries, can also be extrapolated to the modification of glottal waveforms. Thus, the use of continuous probabilistic modelling and transformations is expected to achieve better glottal conversions too.
The feature vectors employed to convert the glottal source characteristics are derived from the JEAS model parameters linked to the voice source of every pitch period, i.e. the glottal excitation strength E _e and T-parameters (T_p, T_e, T_a, T_c) obtained from the LF fitting procedure and the energy (ANE) of the aspiration noise estimate used to adjust that of the modelled pitch-synchronous amplitude modulated Gaussian noise. In order to normalise the T_o dependent T-parameters for conversion, they are transformed into R-parameters (R_g , R_k, R_a), resulting in the five-dimensional feature vector (E _e, R _g, R_k , R_a , ANE) for glottal waveform conversion.
As it is shown in Figure 19, the described glottal conversion approach is capable of bringing the source feature vector parameter contours closer to the target which, as a consequence, also produces converted glottal waveforms more similar to the target. In particular, Figure 19 shows the linear transformation of LF Glottal Waveforms: a) source, target and converted derivative glottal LF waves; b) source, target and converted trajectories of the glottal feature vector parameters (E_e, R_g, R_k, R_a, ANE).

2.2 An experiment: Comparison between a conventional sinusoidal voice conversion method and the JEAS Voice Conversion method

Next, an experiment is described, in which the performance of a conventional sinusoidal voice conversion method (H. Ye and S. Young, High Quality Voice Morphing, in Proc. ICASSP, ), referred to as PSHM is compared to the performance of the JEAS method. Both methods have been evaluated in a conversion task based on the VOICES database (A. Kain, High Resolution voice transformation, PhD thesis, Oregon Health and Science University, 2001). Specifically designed for voice conversion purposes, the corpus is composed of 3 instances of 50 phonetically rich sentences spoken by 10 speakers (5 male, 5 female), i.e. a total of 150 utterances per speaker. The speech data was recorded using a 'mimicking' approach which resulted in a natural time-alignment between the identical sentences produced by the different speakers and factor out the prosodic cues of speaker identity to some extent. Glottal closure instants derived from laryngograph signals are also provided for each sentence, and have been used for both PSHM and JEAS pitch synchronous analysis. Four different voice conversion experiments have been investigated: male-to-male (MM), male-to-female (MF), female-to-male (FM) and female-to-female (FF) transformations. The first 120 sentences are used for training and the remaining 30 for testing each speaker pair conversion.
LSF spectral vectors of order 30 have been employed throughout the conversion experiments, to train 8 linear spectral envelope transforms between each source and target speaker pair using the parallel VOICES training data. This number has been chosen for being capable of achieving small spectral distortion ratios while still generalising to the test data. Aligned source-target vector pairs were obtained by applying forced alignment to mark sub-phone boundaries and using Dynamic Time Warping to further constrain their time alignment. For residual and phase prediction, target GMMs and codebooks of 40 classes and entries have been built. Finally, glottal waveform conversions have also been carried out using 8 linear transforms per speaker pair. Objective and subjective evaluations have been used to compare the performance of the two methods.

2.2.1 Objective Evaluation

2.2.1.1 Spectral Envelope Conversion

Because the linear spectral envelope transformations are actually applied to LSF vectors, their conversion performance can be easily evaluated by comparing source, target and converted LSF vector distances. If the distance between two LSF vectors lsf₁ and lsf₂ is defined as $D_{LSF} (ls f_{1}, ls f_{2}) = |ls f_{1} - ls f_{2}| = \sqrt{(ls f_{1} - ls f_{2}) (ls f_{1} - ls f_{2})},$
the following distortion ratio R _LSF can be used as an objective measure of how close the source vectors have been converted into the target $R_{LSF} = \frac{\sum_{t = 1}^{L} D_{LSF} (ls f_{conv} (t), ls f_{igt} (t))}{\sum_{t = 1}^{L} D_{LSF} (ls f_{ssc} (t), ls f_{igt} (t))} \cdot 100.$
where lsf _src -(t), lsf _tgt (t) and lsf _conv (t) are the source, target and converted LSF vectors respectively and the summation is computed over the time-aligned test data, L being the total number of test vectors after time alignment. Note that a 100% distortion ratio corresponds to the distortion between the source and the target.
R_LSF ratios have been computed for the PSHM and JEAS spectral envelope conversions on the VOICES test set. Figure 20 shows the obtained results. Although the differences are small, JEAS has been found to perform slightly better than PSHM with LSF distortion ratios 3% smaller in all conversion tasks overall. This might be due to the fact that JEAS spectral envelopes do not encode spectral tilt information, which reduces the LSF variations caused by tilt differences resulting in more accurate mappings.

2.2.1.2 Voice Source Conversion

Similar objective distortion measures can also be used to evaluate the conversion of the voice source characteristics, i.e. Residual Prediction and Glottal Waveform Conversion in the PSHM and JEAS implementations respectively.
Residual Prediction reintroduces the target spectral details not captured by spectral envelope conversion, bringing as a result the converted speech spectra closer to the target. Glottal Waveform Conversion, on the other hand, maps time-domain representations of the glottal waveforms which in the frequency domain result in better matching glottal formants and spectral tilts of the converted spectra. Whilst the methods differ, their spectral effect is similar, i.e, they aim to reduce the differences between the converted and the target speech spectra.
One way to evaluate if the voice source conversion methods achieve the desired effect is to measure the log spectral distances (LSD) between the converted and target spectra before and after voice source conversion. If the RMS log spectral distance between two spectra is defined as $D_{LSD} (S_{1}, S_{2}) = \sqrt{\frac{1}{K} \sum_{k = 1}^{K} {(10 \log_{10} ({amp}_{k}^{1}) - 10 \log_{10} ({amp}_{k}^{}))}^{2}}$
where {amp_k} are the harmonic amplitudes resampled from spectrum S at K points on the bark frequency scale (K has been set to 100 points in this work). Then, a distortion ratio R _LSD similar to R _LSF can be used to compare the converted-to-target log spectral distances with and without voice source conversion $R_{LSD} = \frac{\sum_{t = 1}^{L} D_{LSD} (S_{conv} (t), S_{tgt} (t))}{\sum_{t = 1}^{L} D_{LSD} (S_{orig} (t), S_{tgt} (t))} \cdot 100$
where S_conv (t) and S_orig (t) are the converted spectra with and without voice source conversion respectively and S_tgt (t) is the target spectrum. Thus, a 100% ratio corresponds to the distortion between spectral envelope converted spectra without voice source transformation and the target spectra.
Figure 21 illustrates R _LSD ratios computed for Residual Prediction and Glottal Waveform Conversion on the test set. Results show that both voice source conversion techniques manage to reduce the distortions between the converted and target speech spectra. Residual Prediction performs slightly better, mainly because the algorithm is designed to predict residuals which minimise the log spectral distance represented in R _LSD. In contrast, glottal waveform conversion is trained to minimise the glottal parameter conversion error over the training data and not the log spectral distance. Nevertheless, both methods are successful in bringing the converted spectra close to the target.

2.2.2 Subjective Evaluation

In order to compare the PSHM and JEAS voice conversion systems perceptually, a listening test was carried out to check their performance in terms of recognizability and quality. 12 subjects took part in the perceptual study, which consisted of two parts.
The first part was an ABX test in which subjects were presented with PSHM-converted (A), JEAS-converted (B) and target (X) utterances and were asked to choose the speech sample A or B they found sounded more like the target X in terms of speaker identity. Spectral envelopes and voice source characteristics were transformed with the methods described above for each system, i.e. spectral envelope conversion, residual and phase prediction were used for PSHM transformations and spectral envelope and glottal waveform conversion for JEAS transformations. In addition, the prosody of the target was employed to synthesise the converted sentences in order to normalise the pitch, duration and energy differences between source and target speakers for the perceptual comparison. 10 utterances of each conversion type (MM, MF, FM, FF) were presented. The order of the samples in terms of conversion type and conversion system was randomised. Informal listening of the utterances transformed using the PSHM and JEAS conversion systems revealed that it was often very difficult to convincingly choose between systems in terms of speaker identity. For this reason, subjects were also allowed to select a 'NO STRONG PREFERENCE' option when they found it difficult to choose or did not have a strong preference towards one of the presented A or B speech samples.
Figure 22 shows the results of the ABX test. In all conversion types, the JEAS-converted samples are preferred over the PSHM-converted ones overall, but the preference difference varies depending on the type of conversion, being for example almost the same for FM transformations. However, the 'NO STRONG PREFERENCE' (NSP) option has been selected almost as often as the JEAS-converted utterances in general, which reveals that subjects found it really difficult to distinguish between conversion systems in terms of speaker identity. Because the most important speaker identifying cues, i.e. spectral envelopes, are transformed using the same method in the two conversion implementations, it is expected that both systems should perform equally in terms of speaker recognizability. In addition, the obtained results show that the Residual Prediction and Glottal Waveform Conversion techniques are also comparable in terms of perceptual speaker identity transformation.
The second listening test aimed at determining which system produces speech with a higher quality. Subjects were presented with PSHM and JEAS converted speech utterance pairs and asked to choose the one they thought had a better speech quality. Results are illustrated in Figure 23. There is a clear preference for the sentences converted using the JEAS method, chosen 75.7% of the time on average, which stems from the clearly distinguishable quality difference between the PSHM and JEAS transformed samples. Utterances obtained after PSHM conversion have a 'noisy' quality caused by phase discontinuities which still exist despite Phase Prediction. Comparatively, JEAS converted sentences sound much smoother. This quality difference is also thought to have slightly biased the preference for JEAS conversion in the ABX test.
Among others, the method and device of voice conversion of the present invention is applicable to frameworks requiring voice quality transformations. As one such application, its use to repair the deviant voice source characteristics of tracheoesophageal speech can be mentioned.
The invention is obviously not limited to the specific embodiments described herein, but also encompasses any variations that may be considered by any person skilled in the art (for example, as regards the choice of components, configuration, etc.), within the general scope of the invention as defined in the appended claims.

Claims

A method of converting a source speaker's speech signal into a converted voice signal, which comprises the steps of:
- a stage of training, in which:
- given a training database of parallel source and target data, for each pitch period of said training database:
- modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, to obtain a set of Liljencrants-Fant LF parameters, said set of LF parameters comprising an excitation strength parameter E_e and a set of T-parameters T_p, T_e, T_a, T_c modelling a glottal waveform, and a set of all-pole vocal tract filter coefficients α₁ ... α_p;

- converting said T-parameters T_p, T_e, T_a, T_c into R-parameters R_g, R_k, R_a;

- converting said all-pole vocal tract filter coefficients α₁... α_p into line spectral frequencies in Bark scale lsf₁... lsf_p;

- defining a glottal vector G to be converted;

- defining a vocal tract vector LSF to be converted, said vocal tract vector LSF comprising said line spectral frequencies in Bark scale lsf₁... lsf_p;

- applying wavelet denoising to obtain an estimate of a glottal aspiration noise;

- from the set of vocal tract vectors LSF obtained for each pitch period of the said training database, estimating a vocal tract continuous probabilistic linear transformation function using the least square error criterion;
the method being characterised in that said stage of modelling further comprises the steps of:

- modelling said aspiration noise estimate by modulating zero mean unit variance Gaussian noise with the said modelled glottal waveform and adjusting its energy ANE to match that of the said aspiration noise estimate;
said glottal vector G to be converted comprising said excitation strength parameter E_e , said R-parameters R_g, R_k , R_a and said energy ANE of the aspiration noise estimate,
the method further comprising:

- a stage of conversion, in which a given test speech waveform is modelled and transformed into a set of converted parameters E_e', R_g' R_k' ,R_a', ANE', LSF' ;

- a stage of synthesis, in which a converted speech waveform is synthesised from the said set of converted parameters E_e' ,R_g' ,R_k' ,R_a' ,ANE' ,LSF'.
The method according to claim 1, wherein said stage of training further comprises:
- from the set of glottal vectors G obtained for each pitch period of the said training database, estimating a glottal waveform continuous probabilistic linear transformation function using the least square error criterion.
The method according to either claim 1 or 2, wherein said step of modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, comprises the steps of:
- modelling the glottal waveform using the Rosenberg-Klatt model;

- using convex optimization to obtain a set of Rosenberg-Klatt glottal waveform parameters and the all-pole vocal tract filter coefficients α₁ ... α_p, wherein said step of using convex optimization comprises a step of adaptive pre-emphasis for estimating and removing a spectral tilt filter contribution from the speech waveform before convex optimization.
The method according to claim 3, wherein said step of modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, further comprises the steps of:
- obtaining a derivative glottal waveform by inverse filtering said pitch period using said all-pole vocal tract filter coefficients α₁ ... α_p;

- fitting said set of LF parameters to the said inverse filtered derivative glottal waveform by direct estimation and constrained non-linear optimization.
The method according to any preceding claim, wherein said stage of conversion comprises, for each pitch period of said test speech waveform:
- obtaining a glottal vector G to be converted, said glottal vector comprising an excitation strength parameter E_e , a set of R-parameters R_g , R_k , R_a and the energy ANE of the said aspiration noise estimate;

- obtaining a vocal tract vector LSF to be converted, said vocal tract vector LSF comprising a set of line spectral frequencies in Bark scale lsf₁ ... lsf_p;

- applying said vocal tract continuous probabilistic linear transformation function estimated during the training stage to obtain a converted vocal tract parameter vector LSF';

- transforming said glottal vector G using said glottal waveform continuous probabilistic linear transformation function estimated during the training stage, thus obtaining a converted glottal vector G' comprising a set of converted parameters E_e' , R_g' , R_k' , R_a' , ANE' , LSF'.
The method according to claim 5, wherein said stages of obtaining a glottal vector G to be converted and a vocal tract vector LSF to be converted further comprise the steps of:
- modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, to obtain a set of LF parameters, said set of LF parameters comprising an excitation strength parameter Ee and a set of T-parameters T_p, T_e, T_a, T_c modelling a glottal waveform, and a set of all-pole vocal tract filter coefficients α₁ ... α_p;

- converting said all-pole vocal tract filter coefficients into line spectral frequencies in Bark scale lsf₁ ... lsf_p;

- converting said T-parameters into R-parameters R_g, R_k, R_a;

- defining a glottal vector G to be converted;

- defining a vocal tract vector LSF to be converted.
The method according to either claim 5 or 6, wherein said stage of conversion further comprises a step of post-filtering said converted vocal tract parameter vector LSF'.
The method according to any preceding claim, wherein said stage of synthesis, in which said converted speech waveform is synthesised from the said set of converted parameters E_e' ,R_g' ,R_k' ,R_a' ,ANE' ,LSF', comprises the steps of:
- interpolating the trajectories of said converted parameters R_g', R_k', R_a' , ANE', LSF' of each pitch period, thus obtaining a set of interpolated parameters R_g", R_k", R_a", ANE", LSF" comprising interpolated R-parameters R_g", R_k", R_a", interpolated energy (ANE'') and interpolated vocal tract vector LSF";

- converting said interpolated vocal tract vector LSF" into all-pole filter coefficient vector A";

- converting said interpolated R-parameters R_g" R_k" R_a" into interpolated T-parameters T_p" T_a" T_e" T_c";

- for each frame of said test speech waveform, generating an excitation signal e_k(n), wherein k denotes the k-th frame.
The method according to claim 8, wherein said stage of generating an excitation signal comprises, for each of said frames:
- if said frame is voiced:
- from said interpolated T-parameters T_p" T_a" T_e" T_c" and said excitation strength parameter E_e, generating an interpolated glottal waveform lf_k(n);

- from said interpolated energy parameter ANE", generating interpolated aspiration noise an_k(n);

- generating said voiced excitation signal e_k(n) by adding said interpolated glottal waveform lf_k(n) and said interpolated aspiration noise an_k(n);

- if said frame is unvoiced:
- generating said unvoiced excitation signal e_k(n) from a Gaussian noise source gn_k(n).
The method according to either claim 8 or 9, wherein said stage of synthesis further comprises:
- generating a synthetic contribution of each frame by filtering said excitation signal e_k(n) with said interpolated all-pole filter coefficient vector A";

- multiplying said synthetic contribution by a Hamming window, overlapping and adding, in order to generate the converted speech signal.
A method applicable to voice quality transformations, such as tracheoesophageal speech repair, which comprises the method steps of any preceding claim.
A device comprising means adapted to carry out the steps of the method of any preceding claim.
A computer program code means adapted to perform the steps of the method according to any claim 1-11, when said program is run on a computer, a digital signal processor, a field-programmable gate array, an application-specific integrated circuit, a micro-processor, a micro-controller, or any other form of programmable hardware.