WO2006082287A1 - Procede d'estimation d'une fonction de conversion de voix - Google Patents

Procede d'estimation d'une fonction de conversion de voix Download PDF

Info

Publication number
WO2006082287A1
WO2006082287A1 PCT/FR2005/003308 FR2005003308W WO2006082287A1 WO 2006082287 A1 WO2006082287 A1 WO 2006082287A1 FR 2005003308 W FR2005003308 W FR 2005003308W WO 2006082287 A1 WO2006082287 A1 WO 2006082287A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
speaker
recorded
estimating
conversion
Prior art date
Application number
PCT/FR2005/003308
Other languages
English (en)
French (fr)
Inventor
Olivier Rosec
Taoufik En-Najjary
Original Assignee
France Telecom
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by France Telecom filed Critical France Telecom
Priority to EP05850632A priority Critical patent/EP1846918B1/fr
Priority to DE602005012998T priority patent/DE602005012998D1/de
Publication of WO2006082287A1 publication Critical patent/WO2006082287A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Definitions

  • the present invention relates to a method for estimating a voice conversion function between, on the one hand, the voice of a speaker defined from a voice message recorded by said speaker, and, on the other hand, the voice of a reference speaker defined by a speech synthesis database. It also relates to a method for estimating a voice conversion function between, on the one hand, the voice of a source speaker defined from a first voice message recorded by said source speaker, and, on the other hand, on the other hand, the voice of a target speaker defined from a second voice message recorded by said target speaker.
  • the invention finds an advantageous application whenever it is desired to have a speaker say a voice message recorded by another speaker. It is thus possible, for example, to diversify the voices used in speech synthesis systems, or, conversely, to anonymously render messages recorded by different speakers. It is also conceivable to implement the method according to the invention to perform film dubbing.
  • voice conversion consists of estimating a transformation function, or conversion function, which, applied to a first speaker whose voice is defined from a recorded voice message, makes it possible to reproduce as faithfully as possible the voice of a second speaker.
  • said second speaker may be a reference speaker whose voice is defined by a voice synthesis database or a so-called "target” speaker whose voice is also defined from a voice message registered, with the first speaker being called "source”.
  • the principle of voice conversion consists, in known manner, in a learning operation which aims to estimate a function connecting the tone of the voice of the first speaker to that of the voice of the second speaker.
  • two parallel recordings of the two speakers that is to say containing the same voice message, are necessary.
  • An analysis is conducted on each of the recordings in order to extract representative parameters of the tone of the voice.
  • Many transformation methods based on this principle have been proposed, for example, conversion by vector quantization (M. Abe, S. Nakamura, K. Shikano and H.
  • HMM Hidden Markov Model
  • a technical problem to be solved by the object of the present invention is to propose a method of estimating a voice conversion function between, on the one hand, the voice of a speaker defined from a voice message recorded by said speaker, and, on the other hand, the voice of a reference speaker defined by a voice synthesis database, which would provide a converted speech of better quality than that provided by the methods to non-parallel corpora known.
  • the method according to the invention makes it possible to obtain two parallel recordings of the same voice message, one being recorded directly by the speaker, which constitutes in a way the basic message, and the other being a Synthetic reproduction of this basic message.
  • the estimation of the conversion function sought is then performed by a conventional learning operation performed on two parallel recordings. The different stages of this treatment will be described in detail below.
  • Two applications of the method according to the invention can be envisaged, namely, on the one hand, an application to the conversion of voice messages recorded by a source speaker into corresponding messages reproduced by said reference speaker, and, on the other hand , an application to the conversion of synthetic messages recorded by a speaker of reference in corresponding messages reproduced by a target speaker.
  • the first application leads to make anonymous, because reproduced by the same reference speaker, voice messages recorded by different speakers.
  • the second application aims, on the contrary, to diversify the voices used in speech synthesis.
  • the same principle of parallelization of messages via a reference speaker can be applied to the conversion of voice between two speakers according to a method of estimating a voice conversion function between, on the one hand, the voice of a source speaker defined from a first voice message recorded by said source speaker, and, on the other hand, the voice of a target speaker defined from a second voice message recorded by said target speaker, which according to the invention is remarkable in that said method comprises the steps of:
  • said voice synthesis database is a database of a concatenated speech synthesis system.
  • said voice synthesis database is a database of a speech synthesis system by corpus.
  • the acoustic database is not restricted to a dictionary of mono-represented diphones, but contains these same elements recorded in different contexts (grammatical, syntactic, phonemic, phonological or prosodic). Each element thus manipulated, also called “unit”, is thus a segment of speech characterized by a set of symbolic descriptors relative to the context in which it was recorded.
  • the problematic of the synthesis changes radically: it is no longer a matter of distorting the speech signal with the aim of degrading the quality of the timbre as little as possible, but rather of having a sufficiently rich database.
  • the selection of units can therefore be likened to a problem of minimizing a cost function composed of two types of metrics: a "target cost” which measures the adequacy of the units with the symbolic parameters resulting from the language processing modules of the system and a "concatenation cost” which accounts for the acoustic compatibility of two consecutive units.
  • the unit selection module generally operates in two steps: firstly a "pre-selection" which consists in selecting sets of candidate units for each target sequence, then a “final selection” which aims to determine the sequence optimal according to a certain predetermined cost function.
  • the pre-selection methods are mostly variants of the method called "Context Oriented Clustering" introduced by Nakajima (S. Nakajima and H.
  • Fig. 1 is a block diagram showing the steps of a voice conversion method between a speaker and a reference speaker.
  • Fig. 2 is a block diagram showing the steps of a voice conversion method between a source speaker and a target speaker.
  • Figure 3 is a diagram of a voice conversion system implementing the estimation method according to the invention.
  • FIG. 1 illustrates a method for estimating voice conversion between a speaker and a reference speaker.
  • the voice of said speaker is defined from a recorded voice message while the voice of said reference speaker is defined from an acoustic data base of a concatenated speech synthesis system, preferably by corpus. although a mono-represented diphon synthesis system can also be used.
  • a synthetic record parallel to the voice message recorded by the speaker is generated from said voice synthesis data base.
  • a first block required for generation is intended to extract from the record of the speaker concerned information of a symbolic type relating to the message contained in said record.
  • a first type of treatment envisaged consists in extracting the message delivered in text form from the voice recording. This can be obtained automatically by a voice recognition system, or manually by listening and retranscribing voice messages.
  • the text thus recognized directly feeds the system 30 of speech synthesis, thereby generating the desired reference synthetic record.
  • HMM HMM
  • a prosodic annotation algorithm can be integrated in the method or a manual annotation phase of the corpus can be considered to take into account melodic markers deemed relevant.
  • the acoustic analysis is carried out for example by means of the Harmony plus Noise Model (HNM) which supposes that a segment (also called frame) voiced of the speech signal s ( ⁇ ) can be decomposed into a harmonic part h ( n) representing the quasi-periodic component of the signal consisting of a sum of L harmonic sinusoids of amplitudes Ai and of phases ⁇ h and a noisy part b (n) representing the friction noise and the variation of the glottal excitation from one period to another, modeled by a LPC (Linear Prediction Coefficients) filter excited by a white noise Gaussian (Y. Stylianou, "Harmony plus Noise Model for speech, combined with statistical methods, for speech and modification", PhD thesis, National School of Telecommunications, France, 1996).
  • HNM Harmony plus Noise Model
  • the first step of the HNM analysis is to make a decision as to whether the analyzed frame is voiced. This processing is performed in asynchronous mode using an analysis step set at 10 ms.
  • the fundamental frequency F 0 and the maximum frequency of voicing that is to say the frequency beyond which the signal is considered to consist solely of noise, are first determined. Then, a synchronized analysis on F 0 makes it possible to estimate the parameters of the harmonic part (the amplitudes and the phases) as well as the parameters of the noise.
  • the harmonic parameters are calculated by minimizing a weighted least squares criterion (see the article by Y. Stylianou cited above):
  • the parts of the spectrum corresponding to noise are modeled using a simple linear prediction.
  • the frequency response of the AR model thus estimated is then sampled at constant pitch, which provides an estimate of the spectral envelope on the noisy areas.
  • the parameters modeling this spectral envelope are deduced using the regularized discrete cepstrum method (O. Cappe, E. Moulines, Regularization techniques for discrete cepstrum estimation, IEEE Signal Processing Letters, Vol 3 (4), pp. 100-102, April 1996).
  • the order of cepstral modeling was set at 20.
  • a Bark scale transformation is performed.
  • LSF Line Spectral Frequency
  • LAR Log Area Ratio
  • Dynamic Alignment DTW for Dynamic Time Warping
  • the alignment path can be constrained so as to respect the segmentation marks.
  • a joint classification of the acoustic vectors of the two aligned recordings is performed.
  • N (z; ⁇ ; ⁇ ) is the probability density of the normal law of mean ⁇ and covariance matrix ⁇ , and where ⁇ , are the coefficients of the mixture ( ⁇ is the probability a priori that z is generated by the ith Gaussian).
  • the estimation of the model parameters is carried out by applying a classical iterative procedure, namely the Expectation-Maximization (EM) algorithm (AP Dempster, Laird NM, Rubin D. R, Maximum Likelihood of incomplete data via the EM algorithm, Journal of the Royal Statistical Society B, 39, pp. 1-38, 1977).
  • EM Expectation-Maximization
  • the determination of the initial parameters of the GMM model is obtained using a standard vector quantization technique.
  • the GMM model can be used to regression determine a conversion function between the speaker and the reference speaker.
  • a conversion of a speaker x to a speaker y it is written in the form:
  • FIG. 2 illustrates a method for estimating a voice conversion function between a source speaker and a target speaker whose voices are respectively defined from voice messages recorded by each of the speakers, these recordings being non-parallel.
  • synthetic reference records are generated from said recorded voice messages according to a procedure similar to that just described with reference to FIG.
  • a voice conversion system incorporating the described estimation method is shown in FIG. 3.
  • the analysis step still relies on HNM modeling, but this time is conducted in a pitched manner. synchronous, because this allows pitch and spectral envelope changes of better quality (see the article by Y. Stylianou cited above).
  • the extracted spectral parameters are then transformed using a conversion module 80 performing the conversion determined by the relation (6).
  • modified parameters as well as the residual information necessary for the sound generation are transmitted to a synthesis module by HNM.
  • the harmonic component of the signal defined by equation (2) and present for the voiced signal frames is generated by summation of sinusoids previously tabulated whose amplitudes are calculated from the converted spectral parameters.
  • the stochastic portion is determined by inverse Fourier Transform (IFFT) on the spectrum calculated from the spectral parameters.
  • IFFT inverse Fourier Transform
  • the HNM model can be replaced by other models known to those skilled in the art, such as linear prediction coding (LPC) models, sinusoidal or MBE ("Multi-Band Excited") models. ").
  • LPC linear prediction coding
  • MBE sinusoidal or MBE
  • VQ vector quantization
  • Fuzzy VQ fuzzy vector quantization
  • the steps of the method are determined by the instructions of a program for estimating a voice conversion function incorporated in a server, and the method according to the invention is implemented when this program is loaded into a computer whose operation is then controlled by the execution of the program.
  • the invention also applies to a computer program, including a computer program on or in an information carrier, adapted to implement the invention.
  • This program can use any programming language, and be in the form of source code, object code, or intermediate code between source code and object code such as in a partially compiled form, or in any other form desirable to implement the method according to the invention.
  • the information carrier may be any entity or device capable of storing the program.
  • the medium may comprise storage means, such as a ROM memory, for example a CD ROM or a microelectronic circuit ROM memory, or a means magnetic recording, for example a floppy disk or a hard disk.
  • the information medium may be a transmissible medium such as an electrical or optical signal, which may be conveyed via an electrical or optical cable, by radio or by other means.
  • the program according to the invention can be downloaded in particular on an Internet type network.
  • the information carrier may be an integrated circuit in which the program is incorporated, the circuit being adapted to execute or to be used in the execution of the method in question.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)
  • Devices For Executing Special Programs (AREA)
PCT/FR2005/003308 2005-01-31 2005-12-28 Procede d'estimation d'une fonction de conversion de voix WO2006082287A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP05850632A EP1846918B1 (fr) 2005-01-31 2005-12-28 Procede d'estimation d'une fonction de conversion de voix
DE602005012998T DE602005012998D1 (de) 2005-01-31 2005-12-28 Verfahren zur schätzung einer sprachumsetzungsfunktion

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FR0550278 2005-01-31
FR0550278 2005-01-31

Publications (1)

Publication Number Publication Date
WO2006082287A1 true WO2006082287A1 (fr) 2006-08-10

Family

ID=34954674

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FR2005/003308 WO2006082287A1 (fr) 2005-01-31 2005-12-28 Procede d'estimation d'une fonction de conversion de voix

Country Status (5)

Country Link
EP (1) EP1846918B1 (es)
AT (1) ATE424022T1 (es)
DE (1) DE602005012998D1 (es)
ES (1) ES2322909T3 (es)
WO (1) WO2006082287A1 (es)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1811497A2 (en) * 2006-01-19 2007-07-25 Kabushiki Kaisha Toshiba Apparatus and method for voice conversion
EP2017832A1 (en) * 2005-12-02 2009-01-21 Asahi Kasei Kogyo Kabushiki Kaisha Voice quality conversion system
WO2018090356A1 (en) * 2016-11-21 2018-05-24 Microsoft Technology Licensing, Llc Automatic dubbing method and apparatus
CN111179902A (zh) * 2020-01-06 2020-05-19 厦门快商通科技股份有限公司 基于高斯模型模拟共鸣腔的语音合成方法、设备及介质

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020173962A1 (en) * 2001-04-06 2002-11-21 International Business Machines Corporation Method for generating pesonalized speech from text

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020173962A1 (en) * 2001-04-06 2002-11-21 International Business Machines Corporation Method for generating pesonalized speech from text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KAIN A ET AL: "Spectral voice conversion for text-to-speech synthesis", ACOUSTICS, SPEECH AND SIGNAL PROCESSING, 1998. PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON SEATTLE, WA, USA 12-15 MAY 1998, NEW YORK, NY, USA,IEEE, US, vol. 1, 12 May 1998 (1998-05-12), pages 285 - 288, XP010279123, ISBN: 0-7803-4428-6 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2017832A1 (en) * 2005-12-02 2009-01-21 Asahi Kasei Kogyo Kabushiki Kaisha Voice quality conversion system
EP2017832A4 (en) * 2005-12-02 2009-10-21 Asahi Chemical Ind VOICE QUALITY CONVERSION SYSTEM
US8099282B2 (en) 2005-12-02 2012-01-17 Asahi Kasei Kabushiki Kaisha Voice conversion system
EP1811497A2 (en) * 2006-01-19 2007-07-25 Kabushiki Kaisha Toshiba Apparatus and method for voice conversion
EP1811497A3 (en) * 2006-01-19 2008-06-25 Kabushiki Kaisha Toshiba Apparatus and method for voice conversion
US7580839B2 (en) 2006-01-19 2009-08-25 Kabushiki Kaisha Toshiba Apparatus and method for voice conversion using attribute information
WO2018090356A1 (en) * 2016-11-21 2018-05-24 Microsoft Technology Licensing, Llc Automatic dubbing method and apparatus
US11514885B2 (en) 2016-11-21 2022-11-29 Microsoft Technology Licensing, Llc Automatic dubbing method and apparatus
CN111179902A (zh) * 2020-01-06 2020-05-19 厦门快商通科技股份有限公司 基于高斯模型模拟共鸣腔的语音合成方法、设备及介质
CN111179902B (zh) * 2020-01-06 2022-10-28 厦门快商通科技股份有限公司 基于高斯模型模拟共鸣腔的语音合成方法、设备及介质

Also Published As

Publication number Publication date
EP1846918A1 (fr) 2007-10-24
DE602005012998D1 (de) 2009-04-09
ATE424022T1 (de) 2009-03-15
ES2322909T3 (es) 2009-07-01
EP1846918B1 (fr) 2009-02-25

Similar Documents

Publication Publication Date Title
Ye et al. Quality-enhanced voice morphing using maximum likelihood transformations
EP1730728A1 (fr) Procede et systeme de conversion rapides d'un signal vocal
WO2005106852A1 (fr) Procede et systeme ameliores de conversion d'un signal vocal
Meyer et al. Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes
EP1789953B1 (fr) Procede et dispositif de selection d'unites acoustiques et procede et dispositif de synthese vocale
LU88189A1 (fr) Procédés de codage de segments de parole et de controlôle de hauteur de son pour des synthèse de la parole
EP1769489B1 (fr) Procede et systeme de reconnaissance vocale adaptes aux caracteristiques de locuteurs non-natifs
EP1593116A1 (fr) Procede pour le traitement numerique differencie de la voix et de la musique, le filtrage de bruit, la creation d'effets speciaux et dispositif pour la mise en oeuvre dudit procede
EP1944755A1 (fr) Modification d'un signal de parole
Rao et al. Speech processing in mobile environments
EP1606792B1 (fr) Procede d analyse d informations de frequence fondament ale et procede et systeme de conversion de voix mettant en oeuvre un tel procede d analyse
EP1526508B1 (fr) Procédé de sélection d'unités de synthèse
EP1846918B1 (fr) Procede d'estimation d'une fonction de conversion de voix
Kakouros et al. Comparison of spectral tilt measures for sentence prominence in speech—Effects of dimensionality and adverse noise conditions
Csapó et al. Modeling irregular voice in statistical parametric speech synthesis with residual codebook based excitation
Al-Radhi et al. Continuous vocoder applied in deep neural network based voice conversion
Xiao et al. Speech intelligibility enhancement by non-parallel speech style conversion using CWT and iMetricGAN based CycleGAN
Al-Radhi et al. A continuous vocoder using sinusoidal model for statistical parametric speech synthesis
Guennec Study of unit selection text-to-speech synthesis algorithms
Gupta et al. G-Cocktail: An Algorithm to Address Cocktail Party Problem of Gujarati Language Using Cat Boost
Bous A neural voice transformation framework for modification of pitch and intensity
Henter et al. Analysing shortcomings of statistical parametric speech synthesis
US11302300B2 (en) Method and apparatus for forced duration in neural speech synthesis
Falk Blind estimation of perceptual quality for modern speech communications
Do et al. Objective evaluation of HMM-based speech synthesis system using Kullback-Leibler divergence

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2005850632

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

WWP Wipo information: published in national office

Ref document number: 2005850632

Country of ref document: EP