WO2010031437A1 - Method and system of voice conversion - Google Patents

Method and system of voice conversion Download PDF

Info

Publication number
WO2010031437A1
WO2010031437A1 PCT/EP2008/062502 EP2008062502W WO2010031437A1 WO 2010031437 A1 WO2010031437 A1 WO 2010031437A1 EP 2008062502 W EP2008062502 W EP 2008062502W WO 2010031437 A1 WO2010031437 A1 WO 2010031437A1
Authority
WO
WIPO (PCT)
Prior art keywords
glottal
parameters
converted
vocal tract
lsf
Prior art date
Application number
PCT/EP2008/062502
Other languages
English (en)
French (fr)
Inventor
María Arantzazu DEL POZO ECHEZARRETA
Original Assignee
Asociacion Centro De Tecnologias De Interaccion Visual Y Comunicaciones Vicomtech
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Asociacion Centro De Tecnologias De Interaccion Visual Y Comunicaciones Vicomtech filed Critical Asociacion Centro De Tecnologias De Interaccion Visual Y Comunicaciones Vicomtech
Priority to PCT/EP2008/062502 priority Critical patent/WO2010031437A1/en
Priority to AT08804436T priority patent/ATE502380T1/de
Priority to DE602008005641T priority patent/DE602008005641D1/de
Priority to ES08804436T priority patent/ES2364005T3/es
Priority to EP08804436A priority patent/EP2215632B1/de
Publication of WO2010031437A1 publication Critical patent/WO2010031437A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/057Time compression or expansion for improving intelligibility
    • G10L2021/0575Aids for the handicapped in speaking

Definitions

  • the present invention relates to methods and systems for voice conversion.
  • Voice Conversion aims at transforming a source speaker' s speech to sound like that of a different target speaker. Text-to-speech synthesisers, dialogue systems and speech repair are among the numerous applications which can greatly benefit from the development of voice conversion technology.
  • the most widely used speech signal representations are the Source-Filter Model and the Sinusoidal Model.
  • the Source-Filter representation (G. Fant, Acoustic Theory of Speech Production, ISBN 9027916004) is based on a simple production model composed of a glottal source waveform exciting a time-varying filter loaded at its output by the radiation of the lips.
  • the main challenge in Source- Filter modelling is the estimation of the glottal waveform and vocal tract filter parameters from the speech signal.
  • Linear Prediction is one popular technique used to obtain a combined parameterisation of the glottal source, vocal tract and lip radiation components in a unique all-pole filter H(z) .
  • a filter is then excited, as shown in Figure 1, by a sequence of impulses spaced at the fundamental period T 0 during voiced speech and by white Gaussian noise during unvoiced speech.
  • the LP error or residual would be a train of impulses spaced at the voiced excitation instants and the impulse/noise voice source modelling would be accurate.
  • the LP residual looks more like a white noise signal with larger values around the instants of excitation.
  • H. Lu et . al . have proposed a convex optimization method to automatically estimate the vocal tract filter and glottal waveform jointly
  • Sinusoidal Models assume the speech waveform to be composed of the sum of a small number of sinusoids with time-varying amplitudes, frequencies and phases. Such modelling was mainly developed by McAulay and Quatieri ⁇ Speech Analysis/Synthesis Based on a Sinusoidal Representation, IEEE Transactions on Acoustics, Speech and Signal Processing, pp. 744-754, 1986) in the mid- 1980' s and has been shown to be capable of producing high quality speech even after pitch and time-scale transformations. However, because of the high number of sinusoidal amplitudes, frequencies and phases involved, sinusoidal modelling results less flexible than the source-filter representation to modify spectral features. In order to obtain high-quality converted speech, state-of-the-art voice conversion (VC) implementations mainly employ variations and extensions of the original sinusoidal model. In addition, they generally adopt a source-filter formulation based on LP to carry out spectral transformations.
  • Spectral envelopes are generally encoded in line spectral frequencies (LSF) for voice conversion, since LSFs have been shown to possess very good linear interpolation characteristics and to relate well to formant location and bandwidth. Because the frequency resolution of the human ear is greater at low frequencies than at high frequencies, spectral envelopes are often warped to a non-linear scale, e.g. the Bark scale, taking the non-uniform sensitivity of the human ear into account. Usually, only spectral envelopes of voiced speech segments are transformed, since unvoiced sounds contain little vocal tract information and their spectral envelopes present high variations. Among the existing different spectral envelope conversion techniques, continuous probabilistic linear transformations have been found to be the most robust and efficient approach. These can be obtained through least square error minimisation of parallel source and target training databases or using more general maximum likelihood transformation frameworks
  • the stage of training comprises, given a training database of parallel source and target data, for each pitch period of said training database: modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, to obtain a set of LF parameters, said set of LF parameters comprising an excitation strength parameter and a set of T-parameters modelling a glottal waveform, and a set of all-pole vocal tract filter coefficients; converting said T-parameters into R-parameters; converting said all-pole vocal tract filter coefficients into line spectral frequencies in Bark scale; defining a glottal vector to be converted; defining a vocal tract vector to be converted, said vocal tract vector comprising said line spectral frequencies in Bark scale; applying wavelet denoising to obtain an estimate of a glottal aspiration noise .
  • the stage of training also comprises, from the set of vocal tract vectors obtained for each pitch period of the said training database, estimating a vocal tract continuous probabilistic linear transformation function using the least square error criterion.
  • the previous stage of modelling further comprises the steps of modelling said aspiration noise estimate by modulating zero mean unit variance Gaussian noise with the said modelled glottal waveform and adjusting its energy to match that of the said aspiration noise estimate.
  • the glottal vector to be converted comprises said excitation strength parameter, said R- parameters and said energy of the aspiration noise estimate .
  • a given test speech waveform is modelled and transformed into a set of converted parameters .
  • a converted speech waveform is synthesised from the said set of converted parameters .
  • the stage of training further comprises: from the set of glottal vectors obtained for each pitch period of the said training database, estimating a glottal waveform continuous probabilistic linear transformation function using the least square error criterion .
  • Lu and Smith's model preferably comprises the steps of: modelling the glottal waveform using the Rosenberg-Klatt model; using convex optimization to obtain a set of Rosenberg-Klatt glottal waveform parameters and the all- pole vocal tract filter coefficients, wherein said step of using convex optimization comprises a step of adaptive pre-emphasis for estimating and removing a spectral tilt filter contribution from the speech waveform before convex optimization.
  • step of modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model further comprises the steps of: obtaining a derivative glottal waveform by inverse filtering said pitch period using said all-pole vocal tract filter coefficients; fitting said set of LF parameters to the said inverse filtered derivative glottal waveform by direct estimation and constrained non-linear optimization.
  • the stage of conversion preferably comprises, for each pitch period of said test speech waveform: obtaining a glottal vector to be converted, said glottal vector comprising an excitation strength parameter, a set of R- parameters and the energy of the said aspiration noise estimate; obtaining a vocal tract vector to be converted, said vocal tract vector comprising a set of line spectral frequencies in Bark scale; applying said vocal tract continuous probabilistic linear transformation function estimated during the training stage to obtain a converted vocal tract parameter vector; transforming said glottal vector using said glottal waveform continuous probabilistic linear transformation function estimated during the training stage, thus obtaining a converted glottal vector comprising a set of converted parameters.
  • those stages of obtaining a glottal vector to be converted and a vocal tract vector to be converted further comprise the steps of: modelling each pitch period by means of a glottal waveform and a vocal tract filter according to Lu and Smith's model, to obtain a set of LF parameters, said set of LF parameters comprising an excitation strength parameter and a set of T-parameters modelling a glottal waveform, and a set of all-pole vocal tract filter coefficients; converting said all-pole vocal tract filter coefficients into line spectral frequencies in Bark scale; converting said T- parameters into R-parameters; defining a glottal vector to be converted; and defining a vocal tract vector to be converted.
  • the stage of conversion further comprises a step of post-filtering said converted vocal tract parameter vector.
  • the stage of synthesis in which said converted speech waveform is synthesised from the said set of converted parameters, preferably comprises the steps of: interpolating the trajectories of said converted parameters of each pitch period, thus obtaining a set of interpolated parameters comprising interpolated R- parameters, interpolated energy and interpolated vocal tract vector; converting said interpolated vocal tract vector into an all-pole filter coefficient vector; converting said interpolated R-parameters into interpolated T-parameters; for each frame of said test speech waveform, generating an excitation signal.
  • the stage of generating an excitation signal comprises, for each of said frames: if said frame is voiced: from said interpolated T-parameters and said excitation strength parameter, generating glottal waveform; from said interpolated aspiration noise energy parameter, generating interpolated aspiration noise; generating said voiced excitation signal by adding said interpolated glottal waveform and said interpolated aspiration noise. And, if said frame is unvoiced: generating said unvoiced excitation signal from a Gaussian noise source.
  • the stage of synthesis further comprises: generating a synthetic contribution of each frame by filtering said excitation signal with said interpolated all-pole filter coefficient vector; multiplying said synthetic contribution by a Hamming window, overlapping and adding, in order to generate the converted speech signal .
  • the present invention also provides a method applicable to voice quality transformations, such as tracheoesophageal speech repair, which comprises at least some of the above-mentioned method steps.
  • Figure 1 shows a conventional schematic diagram of the LP model .
  • FIG. 2 shows a schematic diagram of the joint estimation analysis synthesis (JEAS) model according to an embodiment of the present invention.
  • Figure 3 shows a schematic diagram modelling the glottal wave .
  • Figure 4 shows a schematic diagram modelling the derivative glottal wave.
  • Figure 5 shows typical LF pulses corresponding to glottal and derivative glottal waves.
  • Figure 6 shows a conventional model of the voice source.
  • Figure 7 shows a joint estimation example: a) speech period, b) speech spectrum and jointly estimated spectral envelope, c) inverse filtered residual and jointly estimated RK wave.
  • Figure 8 shows a RK derivative glottal wave.
  • Figure 9 shows a schematic diagram of a conventional modelling of the spectral tilt.
  • Figure 10 shows the effects of adaptive pre-emphasis .
  • Figure 11 shows an example of LF fitting in normal, breathy and pressed phonations.
  • Figure 12 shows a typical denoising result.
  • Figure 13 shows the standard aspiration noise model parameters .
  • Figure 14 shows the Gaussian noise modulation by an LF waveform.
  • Figure 15 shows a schematic diagram of an aspiration noise modelling approach according to an embodiment of the present invention.
  • Figure 16 shows a schematic diagram of the employed overlap-add synthesis scheme.
  • Figure 17 illustrates resampling of the frame size contour.
  • Figure 18 shows the JEAS vs. PSHM spectral envelopes.
  • Figure 19 shows the continuous probabilistic linear transformation of LF glottal waveforms.
  • Figure 20 shows the R LSF distortion ratios of the converted PSHM and JEAS spectral envelopes.
  • Figure 21 shows the R LSD distortion ratios of Residual Predicted (RP) and Glottal Waveform Converted (GWC) spectra .
  • Figure 22 shows the results of the ABX test.
  • Figure 23 shows the results of the quality comparison test.
  • pitch period means a segment of a speech waveform which comprises a period of the fundamental frequency.
  • frame means a segment of a speech waveform, which corresponds to a pitch period in voiced parts and to a fixed amount of time in unvoiced parts. In a preferred embodiment of the present invention, which should not be interpreted as a limitation to the present invention, a frame corresponds to 10 ms in unvoiced parts .
  • source data refers to a collection of speech waveforms uttered by a source speaker
  • target data refers to a collection of speech waveforms uttered by a target speaker
  • parallel source and target data refers to a collection of speech waveforms uttered both by the source and the target speakers.
  • JEAS JEAS
  • Figure 2 shows a schematic diagram of the JEAS model. It is based on a general Source-Filter representation. It employs white Gaussian and amplitude- modulated white Gaussian noise to model the Turbulence and Aspiration Noise components respectively, a digital differentiator for Lip Radiation and an all-pole filter to represent the Vocal Tract. Besides, the LiI jencrants- Fant (LF) model is adopted to better capture the characteristics of the derivative glottal wave. Then, in order to estimate the different model component parameterisations from the speech wave, a joint voice source and vocal tract parameter estimation technique based on Convex Optimization is applied.
  • LF LiI jencrants- Fant
  • the present method adopts the well-known LF model, which is a four-parameter time-domain model of one cycle of the derivative glottal waveform.
  • Typical LF pulses corresponding to glottal and derivative glottal waves are shown in Figure 5. Mathematically, it can be described as :
  • the model consists of two segments: the first one characterises the derivative glottal waveform from the instant of glottal opening to the instant of main excitation T e , where the amplitude reaches the maximum negative value -E e .
  • the segment is a sinusoidal function which grows exponentially in
  • E 0 is a scaling factor used to ensure that the signal has a zero mean.
  • T p is
  • E e denotes the instant of the maximum glottal flow.
  • E e is closely related to the strength of the source excitation and the main determinant of the intensity of the speech signal. Its variation affects the overall harmonic amplitudes, except the very lowest components which are more determined by the shape of the pulse.
  • the second segment models the closing or return phase from the main excitation T e to the instant of full closure T c using an exponential function.
  • the duration of the return phase is thus determined by T c -T e .
  • the main parameter characterising this segment is T a , which represents the "effective duration" of the return phase. This is defined by the duration from T e to the point where a tangent fitted at the start of the return phase crosses zero.
  • ⁇ ⁇ is the time-constant of the exponential function, and can be determined iteratively from T a , T e
  • T 0 corresponds to the ⁇ fundamental period.
  • T c is made to coincide with the opening of the following pulse. This fact might suggest that the model does not account for the closed phase of the glottal waveform. However, for reasonably small values of T a , the exponential function will fit closely to the zero line providing a closed phase without the need for additional control parameters.
  • the LF pulse can be uniquely determined by the T-parameters : (T p , T e , T a , T c ) . These parameters can be easily identified from the estimated derivative glottal wave. Therefore, they are generally obtained first and the synthesis parameters, from which the LF waveform can be computed directly, (E 0 ; OC; ⁇ g ; ⁇ ) are then derived taking the following constraints into account:
  • R- parameters R 9 , R k , R a
  • R 9 , R k , R a R 9 , R k , R a
  • R g is a normalised version of the glottal formant frequency F g , which is defined as the inverse of twice the duration of the opening phase T p .
  • R k is the LF parameter which captures glottal asymmetry. It is defined as the ratio between the times of the opening and closing branches of the glottal pulse, and the larger its value, the more symmetrical the pulse is.
  • OQ is positively correlated with R k and negatively correlated with R 9 .
  • the R a parameter corresponds to the effective "return time" T a normalised by the fundamental period and captures the differences relating to the spectral tilt.
  • Source-Filter deconvolution is to obtain estimates of the glottal source and vocal tract filter components from the speech wave.
  • Inverse Filtering (IF) was the most commonly employed deconvolution method. It is based on calculating a vocal tract filter transfer function, whose inverse is used to obtain a glottal waveform estimate which can then be parameterised.
  • a different approach involves modelling both glottal source and vocal tract filter, and developing techniques to jointly estimate the source and tract model parameters from the speech wave.
  • Joint Estimation methods are fully automatic. This is an important condition that a mathematical model aimed at analysis, synthesis and modification of the speech signal should meet. Due to the characteristics of the mathematical voice source and vocal tract descriptions, such an approach is a complex nonlinear problem. For this reason, the use of LP has been deployed more widely as a simpler method to obtain a direct and efficient source-filter parameterisation of the speech signal. Its poor modelling of the voice source has not limited its application in the context of speech coding and to efficiently represent the speech spectrum with a small number of parameters. However, it has prevented its use in speech synthesis and transformation applications. Advances in voice conversion and Hidden Markov Model (HMM) speech synthesis in the last few years have emphasized the importance of refined vocoding and thus, the problem of automatic joint estimation of voice source and vocal tract filter parameters has gained renewed interest.
  • HMM Hidden Markov Model
  • the method employed to obtain the JEAS voice source and vocal tract model parameters from the speech wave follows the second deconvolution approach and is based on the joint estimation of the vocal tract filter and the glottal waveform proposed by Lu and Smith (Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999) .
  • voiced and unvoiced speech segments are processed differently due to their diverse source characteristics. While the voice source in voiced speech is represented by a combination of the LF and aspiration noise models, white Gaussian noise is used to excite the vocal tract filter in unvoiced frames (see Figure 2) . Their different modelling requires a preprocessing step where the voiced and unvoiced speech sections are determined and the glottal closure instants (GCI) of the voiced segments are estimated. Then, the voice source and vocal tract parameters are obtained through joint source- filter estimation and LF re-parameterisation in voiced sections (V) and through standard autocorrelation LP and Gaussian noise energy matching in unvoiced portions (U).
  • GCI glottal closure instants
  • An algorithm such as the well-known Dynamic Programming Projected Phase-Slope Algorithm (DYPSA) , is used for GCI estimation. It employs the group-delay function in combination with a phase-slope projection method to determine GCI candidates, plus N-best dynamic programming to select the most likely candidates according to a cost function which takes waveform similarity, pitch deviation, normalised energy and deviation from the ideal phase-slope into account.
  • DYPSA Dynamic Programming Projected Phase-Slope Algorithm
  • the voicing decision is made based on energy, zero- crossing and GCI information. Voiced segments are then processed pitch-synchronously, while unvoiced frames are periodically extracted. In a particular embodiment, they are extracted every 10ms.
  • the method employed by the invention to obtain the JEAS voice source and vocal tract model parameters involves using a voice source model simple enough to allow the source filter deconvolution to be formulated as a Convex Optimization problem. Then, the derivative glottal waveform obtained by inverse filtering (IF) with the estimated filter coefficients is reparameterised by LF model fitting.
  • the success of the technique lies in providing a derivative glottal waveform constraint when estimating the vocal tract filter. Because of this, the resulting IF derivative glottal waveform is closer to the true glottal excitation and its fitting to an LF model is less error prone .
  • the joint estimation algorithm models the voice source using the well-known Rosenberg-Klatt (RK) model, which consists of a basic voicing waveform describing the shape of the derivative glottal wave and a low-pass
  • RK Rosenberg-Klatt
  • T 0 corresponds to the pitch period
  • n c represents the duration of the closed phase, which can also be expressed as OQ being the open-quotient, i.e. the fraction of the pitch period in which the glottis is open.
  • OQ the open-quotient
  • Source-filter deconvolution via convex optimization is accomplished by minimising the squared error between the modelled and the true derivative glottal waveforms.
  • the modelled derivative glottal waveform g(n) corresponds to that of equation (7), while the true derivative glottal wave g(n) is obtained through inverse filtering as .
  • the error between the modelled and the true derivative glottal waves e(n) can be calculated by subtracting Equations (7) and (10)
  • a ⁇ X b,,.
  • Equation (12) can be solved using quadratic programming if expanded to have its form, i.e.
  • the derived quadratic program can be solved using a number of existing iterative numerical algorithms.
  • the quadratic programming function of the MATLAB Optimization Toolbox has been employed.
  • the result of the minimization problem is the simultaneous estimation of the RK model parameters a and b and the all-pole filter coefficients OC ⁇ .
  • Figure 7 shows a joint estimation example for one pitch period.
  • the described joint estimation process assumes that the closed and open-phases are defined, while in practice the parameter which delimits the end of the closed-phase and the beginning of the open-phase n c is unknown. Its optimal value is found by uniformly sampling the possible n c values (empirically shown to vary from 0% to 60% of the pitch period T 0 ) , solving the quadratic problem at each sampled n c value and choosing the estimate resulting in minimum error.
  • the basic RK voicing waveform of equation (7) does not explicitly model the return phase of the derivative glottal waveform and changes abruptly at the glottal closure instants. For this reason, a low-pass filter is added to the basic model, with the purpose of reducing the abruptness of glottal closure.
  • the filter coefficient ⁇ is responsible for controlling the tilt of the source spectrum.
  • the spectral tilt filter is separated from the source model and incorporated to the vocal tract model by adding an extra pole to the all-pole filter as shown in Figure 9.
  • the vocal tract filter coefficients estimated using this formulation also encode the spectral slope information of the voice source.
  • the derivative glottal waveforms obtained using this approach fail to adequately capture the variations in the return phase of the glottal source .
  • the present invention uses adaptive pre-emphasis to estimate and remove the spectral tilt filter contribution from the speech wave before convex optimization.
  • Order one LP analysis and IF is applied to estimate and remove the spectral slope from the speech frames under analysis.
  • the effect of adaptive pre-emphasis is illustrated in Figure 10: a) Speech spectrum and estimated spectral envelope, b) IF derivative glottal wave and fitted LF waveform, c) IF derivative glottal wave spectrum and fitted LF wave spectrum.
  • the vocal tract filter envelope estimates obtained this way do not encode source spectral tilt characteristics, which are reflected in the closing phase of the resulting derivative glottal waveforms instead. This improves the fitting of the return phase of the LF model and thus, of the high frequencies of the glottal source.
  • the LF model is capable of more accurately describing the glottal derivative waveform than the RK model.
  • its more complex nonlinear formulation fails to fulfil the convexity condition and prevents its use in the joint voice source and vocal tract filter parameter estimation algorithm.
  • the RK model is employed during source-filter deconvolution and the LF model is then used to re-parameterise the derivative glottal wave obtained by inverse filtering the speech waveform with the jointly estimated filter coefficients.
  • LF model fitting is carried out in two steps. First, initial estimates of the LF T-parameters ( T p , T e , T a , T c ) and the glottal excitation strength E e are obtained from the time-domain IF voice source waveform by conventional direct estimation methods. Then, their values are refined using the conventional constrained nonlinear optimization technique. The overall procedure is as follows.
  • Figure 11 shows an example of LF fitting in normal, breathy and pressed phonations.
  • Wavelet denoising is used to extract the glottal aspiration noise from the IF derivative glottal wave estimate.
  • the wavelet denoising technique used is Wavelet Packet Analysis, which has been found to obtain more reliable aspiration noise estimates compared to other techniques employed to identify and separate the periodic and aperiodic components of quasi-periodic signals, such as frequency transform analysis or periodic prediction.
  • Wavelet Packet Analysis is preferably performed at level 4 with the 7th order Daubechies wavelet, using soft-thresholding and the Stein Unbiased Risk Estimate threshold evaluation criteria.
  • Figure 12 shows a typical denoising result: a) original and denoised IF derivative glottal wave, b) noise estimate.
  • N f the noise floor of the aspiration noise
  • NP a the amplitude modulation index of the noise pulse
  • NP p the position of the center of the noise pulse window in the glottal period
  • NP w the width of the noise pulse window
  • the aspiration component is still approximated as pitch synchronous amplitude modulated Gaussian noise
  • an alternative function which does not require the estimation of Nf, NP ar NP P and NP W is employed to modulate its amplitude: the LF waveform.
  • the shape of the LF waveform follows the most salient amplitude modulation characteristics of glottal aspiration noise, i.e. the magnitude of its amplitude increases during the open phase and is maximum at glottal closure.
  • JEAS analysis is parameterised as follows. First, zero mean unit variance Gaussian noise is modulated with the already fitted LF waveform for that pitch period. Then, its energy is adjusted to match that (ANE) of the aspiration noise estimate. Because using a spectral shaping filter has informally been found not to make a perceptual difference, it is not included in the parameterisation .
  • Figure 15 depicts a diagram of the employed aspiration noise modelling approach.
  • each frame k of the speech waveform which corresponds to a pitch period in voiced segments and to a fixed segment (in a particular example, a fixed segment of 10 ms) in unvoiced parts, can be generated by filtering the estimated voiced or unvoiced excitation signal e(n) with the vocal tract filter vt for that particular frame
  • the excitation signal is constructed either by adding the fitted LF and aspiration noise estimates,
  • the jointly estimated filter coefficients (OCi ... 0C p ) are first converted to Line Spectral Frequencies (LSF) due to their better interpolation properties. Then, each set of the LSF coefficients LSF P (lsfi ... lsf p ) is averaged with those of the previous and following frames to obtain a smoother vocal tract filter estimate for synthesis
  • LSF Line Spectral Frequencies
  • the fitted LF T-parameters ( T p , T e , T 3 , T c ) are converted to ⁇ -parameters (R g , R k , R 3 ) which are more suitable for interpolation since they are normalised with respect to the fundamental period.
  • each ⁇ -parameter set is averaged with the ones of the previous and next frames.
  • Aspiration noise energy ANE trajectories are also smoothed the same way
  • Figure 16 shows a schematic diagram of the employed overlap-add synthesis scheme.
  • Both pitch and time-scale transformations are based on a parameter trajectory interpolation approach, where the first task involves calculating the number of frames in a particular segment required to achieve the desired modifications. Once the modified number of frames has been calculated, frame size contours, excitation and vocal tract parameter trajectories are resampled at the modified number of frames using, for example, cubic spline interpolation. Because JEAS modelling is pitch- synchronous, the frame sizes correspond with the pitch periods in voiced segments while they are fixed in unvoiced segments. Due to their better interpolation characteristics, LSF coefficients and R-parameters are employed during pitch and time-scale transformations to represent the vocal tract and glottal source respectively, in addition to aspiration ANE and Gaussian GNE noise energies.
  • Pitch can be altered by simply multiplying the fundamental period contour by a scaling factor.
  • PSHM Sinusoidal VC system
  • the first section details the speech model and feature transformation techniques employed in the JEAS VC implementation. Objective measurement of its spectral envelope and voice source conversion performance and subjective evaluation of the recognizability and quality of the converted output is presented next.
  • JEAS Voice Conversion The spectral envelope and glottal waveform transformation methods employed within JEAS voice conversion are described next. While spectral envelope conversion is done in a way similar to the well-known sinusoidal voice conversion implementation, the main advantage of JEAS Modelling, i.e. the parameterisation of the voice source, allows the source characteristics to be also transformed to match the target. As well as offering the potential for improved fidelity in the target identity, this also avoids the need for conventional residual prediction methods. In addition, because the JEAS parameterisation does not involve a magnitude and phase division of the spectrum, the artifacts due to converted magnitude and phase mismatches are not produced and, thus, the use of additional techniques, such as phase prediction, is not required.
  • the jointly estimated JEAS all-pole vocal tract filter coefficients ⁇ 0Ci ... 0C p ⁇ are converted to Bark scaled LSF parameters for the transformation of the JEAS spectral envelopes.
  • the linear frequency response of the jointly estimated vocal tract filter is calculated. This is resampled according to the Bark scale using, for example, the well-known cubic spline interpolation technique.
  • the warped all-pole filter coefficients are then computed by applying, for example, the conventional Levinson-Durbin algorithm to the autocorrelation sequence of the warped power spectrum. Then, the filter coefficients are transformed into LSF for conversion.
  • a continuous probabilistic linear transformation function is employed to convert the LSF spectral envelopes.
  • GMMs Gaussian Mixture Models
  • ⁇ m , ⁇ m and ⁇ m being the weights, means and variances of the GMM components respectively and N() representing the Normal Distribution.
  • the transformation matrices W m are estimated using parallel source and target training data and a least square error criterion.
  • the new LSF parameters are transformed to all-pole filter coefficients and resampled back to the linear scale before synthesis. Because the use of linear transformations broadens the formants of the converted speech, a perceptual post-filter is applied to narrow the formant bandwidths, deepen the spectral valleys and sharpen the formant peaks.
  • Figure 18 illustrates the JEAS vs. conventional PSHM spectral envelopes, where it can be seen that the PSHM envelopes capture the spectral tilt but in JEAS, it is encoded by the glottal waveforms instead.
  • both methods manage to represent the most important formants, small differences exist in their amplitudes, frequencies and/or bandwidths.
  • the glottal waveform morphing approach adopted within JEAS voice conversion employs Continuous Probabilistic Linear Transformations to map glottal LF parameters of different modal male and female speakers, which are the most commonly used speaker types in voice conversion applications.
  • Continuous probabilistic linear transformations have been chosen for being the most robust and efficient approach found to convert spectral envelopes.
  • the limitations of the codebook-based conversion methods for envelope transformations i.e. the discontinuities caused by the use of a discrete number of codebook entries, can also be extrapolated to the modification of glottal waveforms.
  • the use of continuous probabilistic modelling and transformations is expected to achieve better glottal conversions too.
  • the feature vectors employed to convert the glottal source characteristics are derived from the JEAS model parameters linked to the voice source of every pitch period, i.e. the glottal excitation strength E e and T- parameters ( T p , T e , T a , T c ) obtained from the LF fitting procedure and the energy (ANE) of the aspiration noise estimate used to adjust that of the modelled pitch- synchronous amplitude modulated Gaussian noise.
  • the described glottal conversion approach is capable of bringing the source feature vector parameter contours closer to the target which, as a consequence, also produces converted glottal waveforms more similar to the target.
  • the described glottal conversion approach is capable of bringing the source feature vector parameter contours closer to the target which, as a consequence, also produces converted glottal waveforms more similar to the target.
  • Figure 19 shows the linear transformation of LF Glottal
  • Waveforms a) source, target and converted derivative glottal LF waves; b) source, target and converted trajectories of the glottal feature vector parameters
  • the speech data was recorded using a 'mimicking' approach which resulted in a natural time-alignment between the identical sentences produced by the different speakers and factor out the prosodic cues of speaker identity to some extent.
  • Glottal closure instants derived from laryngograph signals are also provided for each sentence, and have been used for both PSHM and JEAS pitch synchronous analysis.
  • Four different voice conversion experiments have been investigated: male-to-male (MM) , male-to-female (MF) , female-to-male
  • LSF spectral vectors of order 30 have been employed throughout the conversion experiments, to train 8 linear spectral envelope transforms between each source and target speaker pair using the parallel VOICES training data. This number has been chosen for being capable of achieving small spectral distortion ratios while still generalising to the test data. Aligned source-target vector pairs were obtained by applying forced alignment to mark sub-phone boundaries and using Dynamic Time Warping to further constrain their time alignment. For residual and phase prediction, target GMMs and codebooks of 40 classes and entries have been built. Finally, glottal waveform conversions have also been carried out using 8 linear transforms per speaker pair. Objective and subjective evaluations have been used to compare the performance of the two methods.
  • the following distortion ratio R LSF can be used as an objective measure of how close the source vectors have been converted into the target
  • lsf src (t), lsf tg t(t) and lsf cow (t) are the source, target and converted LSF vectors respectively and the summation is computed over the time-aligned test data, L being the total number of test vectors after time alignment. Note that a 100% distortion ratio corresponds to the distortion between the source and the target.
  • Residual Prediction reintroduces the target spectral details not captured by spectral envelope conversion, bringing as a result the converted speech spectra closer to the target.
  • Glottal Waveform Conversion maps time-domain representations of the glottal waveforms which in the frequency domain result in better matching glottal formants and spectral tilts of the converted spectra. Whilst the methods differ, their spectral effect is similar, i.e. they aim to reduce the differences between the converted and the target speech spectra .
  • Figure 21 illustrates R LS ⁇ ratios computed for Residual Prediction and Glottal Waveform Conversion on the test set. Results show that both voice source conversion techniques manage to reduce the distortions between the converted and target speech spectra. Residual Prediction performs slightly better, mainly because the algorithm is designed to predict residuals which minimise the log spectral distance represented in -R LSD - I n contrast, glottal waveform conversion is trained to minimise the glottal parameter conversion error over the training data and not the log spectral distance. Nevertheless, both methods are successful in bringing the converted spectra close to the target.
  • the first part was an ABX test in which subjects were presented with PSHM-converted (A) , JEAS-converted
  • the second listening test aimed at determining which system produces speech with a higher quality. Subjects were presented with PSHM and JEAS converted speech utterance pairs and asked to choose the one they thought had a better speech quality. Results are illustrated in Figure 23. There is a clear preference for the sentences converted using the JEAS method, chosen 75.7% of the time on average, which stems from the clearly distinguishable quality difference between the PSHM and JEAS transformed samples. Utterances obtained after PSHM conversion have a 'noisy' quality caused by phase discontinuities which still exist despite Phase Prediction. Comparatively, JEAS converted sentences sound much smoother. This quality difference is also thought to have slightly biased the preference for JEAS conversion in the ABX test.
  • the method and device of voice conversion of the present invention is applicable to frameworks requiring voice quality transformations.
  • its use to repair the deviant voice source characteristics of tracheoesophageal speech can be mentioned.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Auxiliary Devices For Music (AREA)
  • Numerical Control (AREA)
  • Circuit For Audible Band Transducer (AREA)
PCT/EP2008/062502 2008-09-19 2008-09-19 Method and system of voice conversion WO2010031437A1 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
PCT/EP2008/062502 WO2010031437A1 (en) 2008-09-19 2008-09-19 Method and system of voice conversion
AT08804436T ATE502380T1 (de) 2008-09-19 2008-09-19 Verfahren, vorrichtung und programmcode zur umwandlung von stimmen
DE602008005641T DE602008005641D1 (de) 2008-09-19 2008-09-19 Verfahren, vorrichtung und programmcode zur umwandlung von stimmen
ES08804436T ES2364005T3 (es) 2008-09-19 2008-09-19 Procedimiento, dispositivo y medio de código de programa informático para la conversión de voz.
EP08804436A EP2215632B1 (de) 2008-09-19 2008-09-19 Verfahren, vorrichtung und programmcode zur umwandlung von stimmen

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2008/062502 WO2010031437A1 (en) 2008-09-19 2008-09-19 Method and system of voice conversion

Publications (1)

Publication Number Publication Date
WO2010031437A1 true WO2010031437A1 (en) 2010-03-25

Family

ID=40277465

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2008/062502 WO2010031437A1 (en) 2008-09-19 2008-09-19 Method and system of voice conversion

Country Status (5)

Country Link
EP (1) EP2215632B1 (de)
AT (1) ATE502380T1 (de)
DE (1) DE602008005641D1 (de)
ES (1) ES2364005T3 (de)
WO (1) WO2010031437A1 (de)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901598A (zh) * 2010-06-30 2010-12-01 北京捷通华声语音技术有限公司 一种哼唱合成方法和系统
ES2364401A1 (es) * 2011-06-27 2011-09-01 Universidad Politécnica de Madrid Método y sistema para la estimación de parámetros fisiológicos de la fonación.
RU2510954C2 (ru) * 2012-05-18 2014-04-10 Александр Юрьевич Бредихин Способ переозвучивания аудиоматериалов и устройство для его осуществления
WO2020062217A1 (en) * 2018-09-30 2020-04-02 Microsoft Technology Licensing, Llc Speech waveform generation
WO2020174356A1 (en) * 2019-02-25 2020-09-03 Technologies Of Voice Interface Ltd Speech interpretation device and system
CN113780107A (zh) * 2021-08-24 2021-12-10 电信科学技术第五研究所有限公司 一种基于深度学习双输入网络模型的无线电信号检测方法

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9607610B2 (en) 2014-07-03 2017-03-28 Google Inc. Devices and methods for noise modulation in a universal vocoder synthesizer
EP3839947A1 (de) 2019-12-20 2021-06-23 SoundHound, Inc. Trainieren einer stimm-morphing-vorrichtung
US11600284B2 (en) 2020-01-11 2023-03-07 Soundhound, Inc. Voice morphing apparatus having adjustable parameters

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008018653A1 (en) * 2006-08-09 2008-02-14 Korea Advanced Institute Of Science And Technology Voice color conversion system using glottal waveform

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008018653A1 (en) * 2006-08-09 2008-02-14 Korea Advanced Institute Of Science And Technology Voice color conversion system using glottal waveform

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHILDERS D G: "Glottal source modeling for voice conversion", SPEECH COMMUNICATION, ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, NL, vol. 16, no. 2, 1 February 1995 (1995-02-01), pages 127 - 138, XP004024955, ISSN: 0167-6393 *
HUI-LING LU ET AL: "Estimating glottal aspiration noise via wavelet thresholding and best basis thresholding", APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, 2001 IEEE W ORKSHOP ON THE OCT. 21-24, 2001, PISCATAWAY, NJ, USA,IEEE, 21 October 2001 (2001-10-21), pages 11 - 14, XP010566862, ISBN: 978-0-7803-7126-2 *
HUI-LING LU ET AL: "Joint estimation of vocal tract filter and glottal source waveform via convex optimization", APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, 1999 IEEE WO RKSHOP ON NEW PALTZ, NY, USA 17-20 OCT. 1999, PISCATAWAY, NJ, USA,IEEE, US, 17 October 1999 (1999-10-17), pages 79 - 82, XP010365067, ISBN: 978-0-7803-5612-2 *
JUN SUN ET AL: "Modeling Glottal Source for High Quality Voice Conversion", INTELLIGENT CONTROL AND AUTOMATION, 2006. WCICA 2006. THE SIXTH WORLD CONGRESS ON DALIAN, CHINA 21-23 JUNE 2006, PISCATAWAY, NJ, USA,IEEE, vol. 2, 21 June 2006 (2006-06-21), pages 9459 - 9462, XP010946933, ISBN: 978-1-4244-0332-5 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901598A (zh) * 2010-06-30 2010-12-01 北京捷通华声语音技术有限公司 一种哼唱合成方法和系统
ES2364401A1 (es) * 2011-06-27 2011-09-01 Universidad Politécnica de Madrid Método y sistema para la estimación de parámetros fisiológicos de la fonación.
WO2013001109A1 (es) * 2011-06-27 2013-01-03 Universidad Politécnica de Madrid Método y sistema para la estimación de parámetros fisiológicos de la fonación
RU2510954C2 (ru) * 2012-05-18 2014-04-10 Александр Юрьевич Бредихин Способ переозвучивания аудиоматериалов и устройство для его осуществления
WO2020062217A1 (en) * 2018-09-30 2020-04-02 Microsoft Technology Licensing, Llc Speech waveform generation
US11869482B2 (en) 2018-09-30 2024-01-09 Microsoft Technology Licensing, Llc Speech waveform generation
WO2020174356A1 (en) * 2019-02-25 2020-09-03 Technologies Of Voice Interface Ltd Speech interpretation device and system
CN113780107A (zh) * 2021-08-24 2021-12-10 电信科学技术第五研究所有限公司 一种基于深度学习双输入网络模型的无线电信号检测方法
CN113780107B (zh) * 2021-08-24 2024-03-01 电信科学技术第五研究所有限公司 一种基于深度学习双输入网络模型的无线电信号检测方法

Also Published As

Publication number Publication date
EP2215632A1 (de) 2010-08-11
ATE502380T1 (de) 2011-04-15
ES2364005T3 (es) 2011-08-22
EP2215632B1 (de) 2011-03-16
DE602008005641D1 (de) 2011-04-28

Similar Documents

Publication Publication Date Title
EP2215632B1 (de) Verfahren, vorrichtung und programmcode zur umwandlung von stimmen
EP2881947B1 (de) Spektrale hüllkurve und gruppenverzögerungsinferenzsystem sowie sprachsignalsynthesesystem für sprachanalyse / synthese
JP4294724B2 (ja) 音声分離装置、音声合成装置および声質変換装置
Erro et al. Voice conversion based on weighted frequency warping
Degottex et al. Phase minimization for glottal model estimation
McCree et al. A mixed excitation LPC vocoder model for low bit rate speech coding
US9031834B2 (en) Speech enhancement techniques on the power spectrum
Akande et al. Estimation of the vocal tract transfer function with application to glottal wave analysis
EP1005021A2 (de) Verfahren und Vorrichtung für die Extraktion von Formant basierten Quellenfilterdaten unter Verwendung einer Kostenfunktion und invertierte Filterung für die Sprachkodierung und Synthese
Cabral et al. Glottal spectral separation for speech synthesis
Harrison Making accurate formant measurements: An empirical investigation of the influence of the measurement tool, analysis settings and speaker on formant measurements
Cabral et al. Glottal spectral separation for parametric speech synthesis
Pantazis et al. Analysis/synthesis of speech based on an adaptive quasi-harmonic plus noise model
Del Pozo et al. The linear transformation of LF glottal waveforms for voice conversion.
Ferreira et al. A holistic glotal phase related feature
Ahmadi et al. A new phase model for sinusoidal transform coding of speech
Tabet et al. Speech analysis and synthesis with a refined adaptive sinusoidal representation
Jelinek et al. Frequency-domain spectral envelope estimation for low rate coding of speech
Kawahara et al. Beyond bandlimited sampling of speech spectral envelope imposed by the harmonic structure of voiced sounds.
Nar et al. Verification of TD-PSOLA for Implementing Voice Modification
Shi et al. A variational EM method for pole-zero modeling of speech with mixed block sparse and Gaussian excitation
Lenarczyk Parametric speech coding framework for voice conversion based on mixed excitation model
Agiomyrgiannakis et al. Towards flexible speech coding for speech synthesis: an LF+ modulated noise vocoder.
Srivastava Fundamentals of linear prediction
Schwardt et al. Voice conversion based on static speaker characteristics

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2008804436

Country of ref document: EP

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08804436

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE