WO2006082287A1 - Procede d'estimation d'une fonction de conversion de voix - Google Patents
Procede d'estimation d'une fonction de conversion de voix Download PDFInfo
- Publication number
- WO2006082287A1 WO2006082287A1 PCT/FR2005/003308 FR2005003308W WO2006082287A1 WO 2006082287 A1 WO2006082287 A1 WO 2006082287A1 FR 2005003308 W FR2005003308 W FR 2005003308W WO 2006082287 A1 WO2006082287 A1 WO 2006082287A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- speaker
- recorded
- estimating
- conversion
- Prior art date
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 65
- 238000000034 method Methods 0.000 title claims abstract description 56
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 45
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 45
- 230000006870 function Effects 0.000 claims description 41
- 239000000203 mixture Substances 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 abstract description 4
- 230000003595 spectral effect Effects 0.000 description 10
- 239000013598 vector Substances 0.000 description 10
- 230000009466 transformation Effects 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000012545 processing Methods 0.000 description 6
- 230000006978 adaptation Effects 0.000 description 5
- 238000013139 quantization Methods 0.000 description 5
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 241000220010 Rhode Species 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000000593 degrading effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000004377 microelectronic Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000011426 transformation method Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
- G10L2021/0135—Voice conversion or morphing
Definitions
- the present invention relates to a method for estimating a voice conversion function between, on the one hand, the voice of a speaker defined from a voice message recorded by said speaker, and, on the other hand, the voice of a reference speaker defined by a speech synthesis database. It also relates to a method for estimating a voice conversion function between, on the one hand, the voice of a source speaker defined from a first voice message recorded by said source speaker, and, on the other hand, on the other hand, the voice of a target speaker defined from a second voice message recorded by said target speaker.
- the invention finds an advantageous application whenever it is desired to have a speaker say a voice message recorded by another speaker. It is thus possible, for example, to diversify the voices used in speech synthesis systems, or, conversely, to anonymously render messages recorded by different speakers. It is also conceivable to implement the method according to the invention to perform film dubbing.
- voice conversion consists of estimating a transformation function, or conversion function, which, applied to a first speaker whose voice is defined from a recorded voice message, makes it possible to reproduce as faithfully as possible the voice of a second speaker.
- said second speaker may be a reference speaker whose voice is defined by a voice synthesis database or a so-called "target” speaker whose voice is also defined from a voice message registered, with the first speaker being called "source”.
- the principle of voice conversion consists, in known manner, in a learning operation which aims to estimate a function connecting the tone of the voice of the first speaker to that of the voice of the second speaker.
- two parallel recordings of the two speakers that is to say containing the same voice message, are necessary.
- An analysis is conducted on each of the recordings in order to extract representative parameters of the tone of the voice.
- Many transformation methods based on this principle have been proposed, for example, conversion by vector quantization (M. Abe, S. Nakamura, K. Shikano and H.
- HMM Hidden Markov Model
- a technical problem to be solved by the object of the present invention is to propose a method of estimating a voice conversion function between, on the one hand, the voice of a speaker defined from a voice message recorded by said speaker, and, on the other hand, the voice of a reference speaker defined by a voice synthesis database, which would provide a converted speech of better quality than that provided by the methods to non-parallel corpora known.
- the method according to the invention makes it possible to obtain two parallel recordings of the same voice message, one being recorded directly by the speaker, which constitutes in a way the basic message, and the other being a Synthetic reproduction of this basic message.
- the estimation of the conversion function sought is then performed by a conventional learning operation performed on two parallel recordings. The different stages of this treatment will be described in detail below.
- Two applications of the method according to the invention can be envisaged, namely, on the one hand, an application to the conversion of voice messages recorded by a source speaker into corresponding messages reproduced by said reference speaker, and, on the other hand , an application to the conversion of synthetic messages recorded by a speaker of reference in corresponding messages reproduced by a target speaker.
- the first application leads to make anonymous, because reproduced by the same reference speaker, voice messages recorded by different speakers.
- the second application aims, on the contrary, to diversify the voices used in speech synthesis.
- the same principle of parallelization of messages via a reference speaker can be applied to the conversion of voice between two speakers according to a method of estimating a voice conversion function between, on the one hand, the voice of a source speaker defined from a first voice message recorded by said source speaker, and, on the other hand, the voice of a target speaker defined from a second voice message recorded by said target speaker, which according to the invention is remarkable in that said method comprises the steps of:
- said voice synthesis database is a database of a concatenated speech synthesis system.
- said voice synthesis database is a database of a speech synthesis system by corpus.
- the acoustic database is not restricted to a dictionary of mono-represented diphones, but contains these same elements recorded in different contexts (grammatical, syntactic, phonemic, phonological or prosodic). Each element thus manipulated, also called “unit”, is thus a segment of speech characterized by a set of symbolic descriptors relative to the context in which it was recorded.
- the problematic of the synthesis changes radically: it is no longer a matter of distorting the speech signal with the aim of degrading the quality of the timbre as little as possible, but rather of having a sufficiently rich database.
- the selection of units can therefore be likened to a problem of minimizing a cost function composed of two types of metrics: a "target cost” which measures the adequacy of the units with the symbolic parameters resulting from the language processing modules of the system and a "concatenation cost” which accounts for the acoustic compatibility of two consecutive units.
- the unit selection module generally operates in two steps: firstly a "pre-selection" which consists in selecting sets of candidate units for each target sequence, then a “final selection” which aims to determine the sequence optimal according to a certain predetermined cost function.
- the pre-selection methods are mostly variants of the method called "Context Oriented Clustering" introduced by Nakajima (S. Nakajima and H.
- Fig. 1 is a block diagram showing the steps of a voice conversion method between a speaker and a reference speaker.
- Fig. 2 is a block diagram showing the steps of a voice conversion method between a source speaker and a target speaker.
- Figure 3 is a diagram of a voice conversion system implementing the estimation method according to the invention.
- FIG. 1 illustrates a method for estimating voice conversion between a speaker and a reference speaker.
- the voice of said speaker is defined from a recorded voice message while the voice of said reference speaker is defined from an acoustic data base of a concatenated speech synthesis system, preferably by corpus. although a mono-represented diphon synthesis system can also be used.
- a synthetic record parallel to the voice message recorded by the speaker is generated from said voice synthesis data base.
- a first block required for generation is intended to extract from the record of the speaker concerned information of a symbolic type relating to the message contained in said record.
- a first type of treatment envisaged consists in extracting the message delivered in text form from the voice recording. This can be obtained automatically by a voice recognition system, or manually by listening and retranscribing voice messages.
- the text thus recognized directly feeds the system 30 of speech synthesis, thereby generating the desired reference synthetic record.
- HMM HMM
- a prosodic annotation algorithm can be integrated in the method or a manual annotation phase of the corpus can be considered to take into account melodic markers deemed relevant.
- the acoustic analysis is carried out for example by means of the Harmony plus Noise Model (HNM) which supposes that a segment (also called frame) voiced of the speech signal s ( ⁇ ) can be decomposed into a harmonic part h ( n) representing the quasi-periodic component of the signal consisting of a sum of L harmonic sinusoids of amplitudes Ai and of phases ⁇ h and a noisy part b (n) representing the friction noise and the variation of the glottal excitation from one period to another, modeled by a LPC (Linear Prediction Coefficients) filter excited by a white noise Gaussian (Y. Stylianou, "Harmony plus Noise Model for speech, combined with statistical methods, for speech and modification", PhD thesis, National School of Telecommunications, France, 1996).
- HNM Harmony plus Noise Model
- the first step of the HNM analysis is to make a decision as to whether the analyzed frame is voiced. This processing is performed in asynchronous mode using an analysis step set at 10 ms.
- the fundamental frequency F 0 and the maximum frequency of voicing that is to say the frequency beyond which the signal is considered to consist solely of noise, are first determined. Then, a synchronized analysis on F 0 makes it possible to estimate the parameters of the harmonic part (the amplitudes and the phases) as well as the parameters of the noise.
- the harmonic parameters are calculated by minimizing a weighted least squares criterion (see the article by Y. Stylianou cited above):
- the parts of the spectrum corresponding to noise are modeled using a simple linear prediction.
- the frequency response of the AR model thus estimated is then sampled at constant pitch, which provides an estimate of the spectral envelope on the noisy areas.
- the parameters modeling this spectral envelope are deduced using the regularized discrete cepstrum method (O. Cappe, E. Moulines, Regularization techniques for discrete cepstrum estimation, IEEE Signal Processing Letters, Vol 3 (4), pp. 100-102, April 1996).
- the order of cepstral modeling was set at 20.
- a Bark scale transformation is performed.
- LSF Line Spectral Frequency
- LAR Log Area Ratio
- Dynamic Alignment DTW for Dynamic Time Warping
- the alignment path can be constrained so as to respect the segmentation marks.
- a joint classification of the acoustic vectors of the two aligned recordings is performed.
- N (z; ⁇ ; ⁇ ) is the probability density of the normal law of mean ⁇ and covariance matrix ⁇ , and where ⁇ , are the coefficients of the mixture ( ⁇ is the probability a priori that z is generated by the ith Gaussian).
- the estimation of the model parameters is carried out by applying a classical iterative procedure, namely the Expectation-Maximization (EM) algorithm (AP Dempster, Laird NM, Rubin D. R, Maximum Likelihood of incomplete data via the EM algorithm, Journal of the Royal Statistical Society B, 39, pp. 1-38, 1977).
- EM Expectation-Maximization
- the determination of the initial parameters of the GMM model is obtained using a standard vector quantization technique.
- the GMM model can be used to regression determine a conversion function between the speaker and the reference speaker.
- a conversion of a speaker x to a speaker y it is written in the form:
- FIG. 2 illustrates a method for estimating a voice conversion function between a source speaker and a target speaker whose voices are respectively defined from voice messages recorded by each of the speakers, these recordings being non-parallel.
- synthetic reference records are generated from said recorded voice messages according to a procedure similar to that just described with reference to FIG.
- a voice conversion system incorporating the described estimation method is shown in FIG. 3.
- the analysis step still relies on HNM modeling, but this time is conducted in a pitched manner. synchronous, because this allows pitch and spectral envelope changes of better quality (see the article by Y. Stylianou cited above).
- the extracted spectral parameters are then transformed using a conversion module 80 performing the conversion determined by the relation (6).
- modified parameters as well as the residual information necessary for the sound generation are transmitted to a synthesis module by HNM.
- the harmonic component of the signal defined by equation (2) and present for the voiced signal frames is generated by summation of sinusoids previously tabulated whose amplitudes are calculated from the converted spectral parameters.
- the stochastic portion is determined by inverse Fourier Transform (IFFT) on the spectrum calculated from the spectral parameters.
- IFFT inverse Fourier Transform
- the HNM model can be replaced by other models known to those skilled in the art, such as linear prediction coding (LPC) models, sinusoidal or MBE ("Multi-Band Excited") models. ").
- LPC linear prediction coding
- MBE sinusoidal or MBE
- VQ vector quantization
- Fuzzy VQ fuzzy vector quantization
- the steps of the method are determined by the instructions of a program for estimating a voice conversion function incorporated in a server, and the method according to the invention is implemented when this program is loaded into a computer whose operation is then controlled by the execution of the program.
- the invention also applies to a computer program, including a computer program on or in an information carrier, adapted to implement the invention.
- This program can use any programming language, and be in the form of source code, object code, or intermediate code between source code and object code such as in a partially compiled form, or in any other form desirable to implement the method according to the invention.
- the information carrier may be any entity or device capable of storing the program.
- the medium may comprise storage means, such as a ROM memory, for example a CD ROM or a microelectronic circuit ROM memory, or a means magnetic recording, for example a floppy disk or a hard disk.
- the information medium may be a transmissible medium such as an electrical or optical signal, which may be conveyed via an electrical or optical cable, by radio or by other means.
- the program according to the invention can be downloaded in particular on an Internet type network.
- the information carrier may be an integrated circuit in which the program is incorporated, the circuit being adapted to execute or to be used in the execution of the method in question.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
- Devices For Executing Special Programs (AREA)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP05850632A EP1846918B1 (fr) | 2005-01-31 | 2005-12-28 | Procede d'estimation d'une fonction de conversion de voix |
DE602005012998T DE602005012998D1 (de) | 2005-01-31 | 2005-12-28 | Verfahren zur schätzung einer sprachumsetzungsfunktion |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FR0550278 | 2005-01-31 | ||
FR0550278 | 2005-01-31 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2006082287A1 true WO2006082287A1 (fr) | 2006-08-10 |
Family
ID=34954674
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/FR2005/003308 WO2006082287A1 (fr) | 2005-01-31 | 2005-12-28 | Procede d'estimation d'une fonction de conversion de voix |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP1846918B1 (es) |
AT (1) | ATE424022T1 (es) |
DE (1) | DE602005012998D1 (es) |
ES (1) | ES2322909T3 (es) |
WO (1) | WO2006082287A1 (es) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1811497A2 (en) * | 2006-01-19 | 2007-07-25 | Kabushiki Kaisha Toshiba | Apparatus and method for voice conversion |
EP2017832A1 (en) * | 2005-12-02 | 2009-01-21 | Asahi Kasei Kogyo Kabushiki Kaisha | Voice quality conversion system |
WO2018090356A1 (en) * | 2016-11-21 | 2018-05-24 | Microsoft Technology Licensing, Llc | Automatic dubbing method and apparatus |
CN111179902A (zh) * | 2020-01-06 | 2020-05-19 | 厦门快商通科技股份有限公司 | 基于高斯模型模拟共鸣腔的语音合成方法、设备及介质 |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020173962A1 (en) * | 2001-04-06 | 2002-11-21 | International Business Machines Corporation | Method for generating pesonalized speech from text |
-
2005
- 2005-12-28 ES ES05850632T patent/ES2322909T3/es active Active
- 2005-12-28 EP EP05850632A patent/EP1846918B1/fr not_active Not-in-force
- 2005-12-28 DE DE602005012998T patent/DE602005012998D1/de active Active
- 2005-12-28 AT AT05850632T patent/ATE424022T1/de not_active IP Right Cessation
- 2005-12-28 WO PCT/FR2005/003308 patent/WO2006082287A1/fr active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020173962A1 (en) * | 2001-04-06 | 2002-11-21 | International Business Machines Corporation | Method for generating pesonalized speech from text |
Non-Patent Citations (1)
Title |
---|
KAIN A ET AL: "Spectral voice conversion for text-to-speech synthesis", ACOUSTICS, SPEECH AND SIGNAL PROCESSING, 1998. PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON SEATTLE, WA, USA 12-15 MAY 1998, NEW YORK, NY, USA,IEEE, US, vol. 1, 12 May 1998 (1998-05-12), pages 285 - 288, XP010279123, ISBN: 0-7803-4428-6 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2017832A1 (en) * | 2005-12-02 | 2009-01-21 | Asahi Kasei Kogyo Kabushiki Kaisha | Voice quality conversion system |
EP2017832A4 (en) * | 2005-12-02 | 2009-10-21 | Asahi Chemical Ind | VOICE QUALITY CONVERSION SYSTEM |
US8099282B2 (en) | 2005-12-02 | 2012-01-17 | Asahi Kasei Kabushiki Kaisha | Voice conversion system |
EP1811497A2 (en) * | 2006-01-19 | 2007-07-25 | Kabushiki Kaisha Toshiba | Apparatus and method for voice conversion |
EP1811497A3 (en) * | 2006-01-19 | 2008-06-25 | Kabushiki Kaisha Toshiba | Apparatus and method for voice conversion |
US7580839B2 (en) | 2006-01-19 | 2009-08-25 | Kabushiki Kaisha Toshiba | Apparatus and method for voice conversion using attribute information |
WO2018090356A1 (en) * | 2016-11-21 | 2018-05-24 | Microsoft Technology Licensing, Llc | Automatic dubbing method and apparatus |
US11514885B2 (en) | 2016-11-21 | 2022-11-29 | Microsoft Technology Licensing, Llc | Automatic dubbing method and apparatus |
CN111179902A (zh) * | 2020-01-06 | 2020-05-19 | 厦门快商通科技股份有限公司 | 基于高斯模型模拟共鸣腔的语音合成方法、设备及介质 |
CN111179902B (zh) * | 2020-01-06 | 2022-10-28 | 厦门快商通科技股份有限公司 | 基于高斯模型模拟共鸣腔的语音合成方法、设备及介质 |
Also Published As
Publication number | Publication date |
---|---|
EP1846918A1 (fr) | 2007-10-24 |
DE602005012998D1 (de) | 2009-04-09 |
ATE424022T1 (de) | 2009-03-15 |
ES2322909T3 (es) | 2009-07-01 |
EP1846918B1 (fr) | 2009-02-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ye et al. | Quality-enhanced voice morphing using maximum likelihood transformations | |
EP1730728A1 (fr) | Procede et systeme de conversion rapides d'un signal vocal | |
WO2005106852A1 (fr) | Procede et systeme ameliores de conversion d'un signal vocal | |
Meyer et al. | Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes | |
EP1789953B1 (fr) | Procede et dispositif de selection d'unites acoustiques et procede et dispositif de synthese vocale | |
LU88189A1 (fr) | Procédés de codage de segments de parole et de controlôle de hauteur de son pour des synthèse de la parole | |
EP1769489B1 (fr) | Procede et systeme de reconnaissance vocale adaptes aux caracteristiques de locuteurs non-natifs | |
EP1593116A1 (fr) | Procede pour le traitement numerique differencie de la voix et de la musique, le filtrage de bruit, la creation d'effets speciaux et dispositif pour la mise en oeuvre dudit procede | |
EP1944755A1 (fr) | Modification d'un signal de parole | |
Rao et al. | Speech processing in mobile environments | |
EP1606792B1 (fr) | Procede d analyse d informations de frequence fondament ale et procede et systeme de conversion de voix mettant en oeuvre un tel procede d analyse | |
EP1526508B1 (fr) | Procédé de sélection d'unités de synthèse | |
EP1846918B1 (fr) | Procede d'estimation d'une fonction de conversion de voix | |
Kakouros et al. | Comparison of spectral tilt measures for sentence prominence in speech—Effects of dimensionality and adverse noise conditions | |
Csapó et al. | Modeling irregular voice in statistical parametric speech synthesis with residual codebook based excitation | |
Al-Radhi et al. | Continuous vocoder applied in deep neural network based voice conversion | |
Xiao et al. | Speech intelligibility enhancement by non-parallel speech style conversion using CWT and iMetricGAN based CycleGAN | |
Al-Radhi et al. | A continuous vocoder using sinusoidal model for statistical parametric speech synthesis | |
Guennec | Study of unit selection text-to-speech synthesis algorithms | |
Gupta et al. | G-Cocktail: An Algorithm to Address Cocktail Party Problem of Gujarati Language Using Cat Boost | |
Bous | A neural voice transformation framework for modification of pitch and intensity | |
Henter et al. | Analysing shortcomings of statistical parametric speech synthesis | |
US11302300B2 (en) | Method and apparatus for forced duration in neural speech synthesis | |
Falk | Blind estimation of perceptual quality for modern speech communications | |
Do et al. | Objective evaluation of HMM-based speech synthesis system using Kullback-Leibler divergence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2005850632 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWP | Wipo information: published in national office |
Ref document number: 2005850632 Country of ref document: EP |