EP1730729A1 - Verbessertes sprachsignalumsetzungsverfahren und -system - Google Patents

Verbessertes sprachsignalumsetzungsverfahren und -system

Info

Publication number
EP1730729A1
EP1730729A1 EP05736936A EP05736936A EP1730729A1 EP 1730729 A1 EP1730729 A1 EP 1730729A1 EP 05736936 A EP05736936 A EP 05736936A EP 05736936 A EP05736936 A EP 05736936A EP 1730729 A1 EP1730729 A1 EP 1730729A1
Authority
EP
European Patent Office
Prior art keywords
speaker
source
fundamental frequency
spectral envelope
converted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP05736936A
Other languages
English (en)
French (fr)
Inventor
Touafik En-Najjary
Olivier Rosec
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Orange SA
Original Assignee
France Telecom SA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by France Telecom SA filed Critical France Telecom SA
Publication of EP1730729A1 publication Critical patent/EP1730729A1/de
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Definitions

  • the present invention relates to a method for converting a voice signal spoken by a source speaker into a converted voice signal, the acoustic characteristics of which resemble those of a target speaker, and a system for converting a voice signal. corresponding conversion.
  • voice conversion applications such as voice services, human-machine oral dialogue applications or even text-to-speech synthesis, hearing is essential and, in order to obtain acceptable quality, master the acoustic parameters of voice signals.
  • the main acoustic or prosodic parameters modified during voice conversion processes are the parameters relating to the spectral envelope, and for voiced sounds involving the vibration of the vocal cords, the parameters relating to a periodic structure, ie the fundamental period, the reverse of which is called the fundamental frequency or "pitch".
  • Conventional voice conversion methods are essentially based on modifications of the spectral envelope characteristics and global modifications of the fundamental frequency characteristics.
  • the object of the present invention is to solve these problems by defining a simple and more efficient voice conversion method.
  • the subject of the present invention is a method of converting a voice signal pronounced by a source speaker into a converted voice signal whose acoustic characteristics resemble those of a target speaker, comprising: less a function for transforming the acoustic characteristics of the source speaker into acoustic characteristics close to those of the target speaker, from voice samples of the source and target speakers; and - the transformation of acoustic characteristics of the voice signal to be converted from the source speaker, by the application of said at least one transformation function, characterized in that said determination comprises the determination of a joint transformation function of characteristics relating to the spectral envelope and characteristics relating to the fundamental frequency of the source speaker and in that said transformation includes the application of said joint transformation function.
  • the method of the invention allows the simultaneous modification during a single operation of the characteristics of the spectral envelope and of fundamental frequency without creating any dependence between them.
  • - said determination of a joint transformation function comprises: - a step of analyzing the voice samples of the source and target speakers grouped together in frames to obtain, for each frame of samples of a speaker, information relating to the spectral envelope and the fundamental frequency; a step of concatenation of the information relating to the spectral envelope and to the fundamental frequency for each of the source and target speakers; a step of determining a model representing common acoustic characteristics of the voice samples of the source speaker and the target speaker; and a step of determining, from this model and the voice samples, of said joint transformation function; - said steps of analyzing the voice samples of the source and target speakers are adapted to deliver said information relating to the spectral envelope in the form of cepstral coefficients; - said analysis steps each include modeling the voice samples according to a sum of a harmonic signal
  • - Said step of determining a model corresponds to the determination of a model of mixture of densities of Gaussian probabilities;
  • - Said step of determining a model comprises: - a substep of determining a model corresponding to a mixture of density of Gaussian probabilities, and - a substep of estimating the parameters of the mixture of densities of Gaussian probabilities from the estimation of the maximum likelihood between the acoustic characteristics of the samples of the source and target speakers and the model;
  • - Said determination of at least one transformation function further comprises a step of normalization of the fundamental frequency of the sample frames of the source and target speakers respectively with respect to the averages of the fundamental frequencies of the analyzed samples of the source and target speakers;
  • the method comprises a step of temporal alignment of the acoustic characteristics of the source speaker with the acoustic characteristics of the target speaker, this step being carried out before said step of determining a model;
  • the method comprises a step of separating, in the voice samples
  • the subject of the invention is also a system for converting a voice signal pronounced by a source speaker into a converted voice signal whose acoustic characteristics resemble those of a target speaker, comprising: means for determining at least a function for transforming the acoustic characteristics of the source speaker into acoustic characteristics close to the target speaker, from vocal samples spoken by the source and target speakers: and - means for transforming the acoustic characteristics of the voice signal to be converted from the source speaker by the application of said at least one transformation function, characterized in that said means for determining at least one transformation function, comprise a unit for determining a joint transformation function of characteristics relating to the spectral envelope and characteristics relating to the fundamental frequency of the locute ur source and in that said transformation means comprise means for applying said joint transformation function.
  • this system further comprises: means for analyzing the voice signal to be converted, adapted to output information relating to the spectral envelope and the fundamental frequency of the voice signal to be converted; and - synthesis means making it possible to form a converted voice signal from at least said spectral envelope and fundamental frequency information transformed simultaneously;
  • Said means for determining at least one function for transforming acoustic characteristics further comprises a unit for determining a function for transforming the spectral envelope of the unvoiced frames, said unit for determining the joint transformation function being suitable for determining the joint transformation function only for voiced frames.
  • FIG. 1A and 1B form a general flow diagram of a first embodiment of the method of the invention
  • - Figs. 2A and 2B form a general flow diagram of a second embodiment of the method of the invention
  • - Fig. 3 is a graph representing an experimental statement of the performance of the process of the invention
  • - Fig. 4 is a block diagram of a system implementing a method according to the invention.
  • Voice conversion involves modifying the voice signal of a reference speaker called the source speaker, so that the signal produced seems to have been spoken by another speaker, called the target speaker.
  • Such a method comprises first of all the determination of functions for transforming acoustic or prosodic characteristics of the voice signals of the source speaker into acoustic characteristics close to those of the voice signals of the target speaker, from voice samples pronounced by the source speaker and the target speaker. More particularly, determination 1 of. transformation functions is performed on voice sample databases corresponding to the acoustic realization of the same phonetic sequences pronounced respectively by the source and target speakers. This determination is designated in FIG. 1A by the general reference numeral 1 and is also commonly called "learning”. The method then comprises a transformation of the acoustic characteristics of a voice signal to be converted pronounced by the source speaker using the function or functions previously determined. This transformation is designated by the general reference numeral 2 in FIG. 1 B.
  • the method begins with steps 4X and 4Y for analyzing the vocal samples pronounced respectively by the source and target speakers. These steps make it possible to group the samples by frames, in order to obtain for each frame of samples, information relating to the spectral envelope and information relating to the fundamental frequency.
  • the analysis steps 4X and 4Y are based on the use of a sound signal model in the form of a sum of a harmonic signal with a noise signal according to a model commonly called "HNM" (in English: Harmonie plus Noise Model).
  • the HNM model includes the modeling of each voice signal frame into a harmonic part representing the periodic component of the signal, consisting of a sum of L harmonic sinusoids of amplitude Ai and phase ⁇ , and a noisy part representing the noise friction and variation of glottal excitation.
  • h (n) therefore represents the harmonic approximation of signal s (n).
  • the embodiment described is based on a representation of the spectral envelope by the discrete cepstrum.
  • Steps 4X and 4Y comprise sub-steps 8X and 8Y of estimation for each frame, of the fundamental frequency, for example by means of an autocorrelation method.
  • Sub-steps 8X and 8Y are each followed by a sub-step 10X and 10Y of synchronized analysis of each frame on its fundamental frequency, which makes it possible to estimate the parameters of the harmonic part as well as the parameters of the signal noise and in particular the maximum voicing frequency.
  • this frequency can be arbitrarily fixed or be estimated by other known means.
  • this synchronized analysis corresponds to the determination of the parameters of the harmonics by minimization of a criterion of weighted least squares between the complete signal and its harmonic decomposition corresponding in the embodiment described, to the estimated noise signal.
  • the criterion noted E is equal to:
  • w (n) is the analysis window and T ⁇ is the fundamental period of the current frame.
  • T ⁇ is the fundamental period of the current frame.
  • these analyzes are made asynchronously with a fixed analysis step and a window of fixed size.
  • the analysis steps 4X and 4Y finally include sub-steps 12X and 12Y for estimating the parameters of the spectral envelope of the signals by using for example a regularized discrete cepstrum method and a transformation into a Bark scale to reproduce the most faithfully possible the properties of the human ear.
  • the analysis steps 4X and 4Y respectively deliver for the vocal samples pronounced by the source and target speakers, for each frame of rank n of samples of the speech signals, a scalar denoted F n representing the fundamental frequency and a vector denoted c n comprising spectral envelope information in the form of a sequence of cepstral coefficients.
  • the method of calculating cepstral coefficients corresponds to a procedure known from the state of the art and, for this reason, will not be described in more detail.
  • the steps 4X and 4Y of analysis are each followed by a step 14 X and 14Y of normalization of the value of the fundamental frequency of each frame with respect respectively to the fundamental frequencies of the source and target speakers in order to replace, for each frame of voice samples, the value of the fundamental frequency by a value of fundamental frequency normalized according to the following formula:
  • step 16X makes it possible to define for each frame n a vector denoted x n grouping the cepstral coefficients c x (n) and the normalized fundamental frequency g x (n) according to the following equation:
  • T designates the transposition operator.
  • step 16Y makes it possible to form for each frame n, a vector y n incorporating the cepstral coefficients c y (n) and the normalized fundamental frequency g y (n) according to the following equation:
  • Steps 16 X and 16Y are followed by a step 18 of alignment between the source vector x n and the target vector y n , so as to form a pairing between these vectors obtained by a conventional algorithm of dynamic temporal alignment known as " DTW ”(in English: Dynamic Time Warping).
  • the alignment step 18 is implemented only from the cepstral coefficients without using the fundamental frequency information.
  • the alignment step 18 therefore delivers a couple vector formed of pairs of cepstral coefficients and of fundamental frequency information from the source and target speakers, aligned in time.
  • the alignment step 18 is followed by a step 20 of determining a model representing the common acoustic characteristics of the source speaker and the target speaker from the spectral envelope and fundamental frequency information of all the samples analyzed.
  • a model representing the common acoustic characteristics of the source speaker and the target speaker from the spectral envelope and fundamental frequency information of all the samples analyzed.
  • it is a probabilistic model of the acoustic characteristics of the target speaker and the source speaker, according to a model of Gaussian probability density densities, commonly noted "GMM", the parameters of which are estimated at starting from the source and target vectors containing, for each speaker, the normalized fundamental frequency and the discrete cepstrum.
  • GMM Gaussian probability density densities
  • Q denotes the number of components of the model
  • N (z; ⁇ , ⁇ ) is the probability density of the normal distribution of mean ⁇ and covariance matrix ⁇ i
  • the coefficients ai are the coefficients of the mixture.
  • the coefficient ctj corresponds to the a priori probability that the random variable z will be generated by the i th Gaussian component of the mixture.
  • step 20 of determining the model includes a sub-step 22 of modeling the joint density p (z) of the source vectors noted x and target noted y, so that: " TT Zn-
  • Step 20 then comprises a sub-step 24 for estimating GMM parameters ( ⁇ , ⁇ , ⁇ ) of the density p (z) .
  • This estimation can be carried out, for example, using an algorithm classic type called "EM" (Expectation -
  • the initial parameters of the GMM model are determined using a standard vector quantization technique.
  • the model determination step 20 thus delivers the parameters of a mixture of Gaussian densities, representative of the common acoustic characteristics and in particular of the spectral envelope and of fundamental frequency, of the voice samples of the source speaker and of the target speaker.
  • the method then comprises a step 30 of determining from the model and the voice samples, a joint function of transformation of the fundamental frequency and of the spectral envelope provided by the cepstrum, of the signal from the source speaker to the target speaker.
  • step 30 includes a sub-step 32 for determining the conditional expectation of the acoustic characteristics of the target speaker knowing the acoustic characteristic information of the source speaker.
  • h ⁇ (x) corresponds to the posterior probability that the source vector x is generated by the i th component of the mixture model of Gaussian densities of the model.
  • the determination of the conditional expectation thus makes it possible to obtain the function of joint transformation of the characteristics of the spectral envelope and of fundamental frequency between the source speaker and the target speaker. It therefore appears that the analysis method of the invention makes it possible, from the model and the voice samples, to obtain a function of joint transformation of the acoustic characteristics of fundamental frequency and spectral envelope. Referring to FIG.
  • the conversion method then comprises the transformation 2 of a voice signal to be converted pronounced by the source speaker, which signal to be converted may be different from the voice signals used previously.
  • This transformation 2 begins with an analysis step 36 carried out, in the embodiment described, using a decomposition according to the HNM model similar to those carried out in steps 4X and 4Y described previously.
  • This step 36 makes it possible to deliver spectral envelope information in the form of cepstral coefficients, fundamental frequency information as well as phase and maximum voicing frequency information.
  • Step 36 is followed by a step 38 of formatting the acoustic characteristics of the signal to be converted by normalization of the fundamental frequency and concatenation with the cepstral coefficients in order to form a single vector.
  • This single vector is used during a step 40 of transformation of the acoustic characteristics of the voice signal to be converted by the application of the transformation function determined in step 30, to the cepstral coefficients of the signal to convert defined in step 36, as well as the fundamental frequency information.
  • each frame of samples of the signal to be converted from the source speaker is thus associated with spectral envelope and fundamental frequency information transformed simultaneously, the characteristics of which are similar to those of the speaker samples. target.
  • the method then comprises a step 42 of denormalization of the transformed fundamental frequency information.
  • F 0 [F (x)] corresponds to the denormalized transformed fundamental frequency, F 0 avg (y) to the average of the values of the fundamental frequencies of the target speaker and F [g x (n)] to the transform of the fundamental frequency source speaker standard.
  • the conversion method then comprises a step 44 of synthesis of the output signal carried out, in the example described, by an HNM type synthesis which directly delivers the converted voice signal from the information of the spectral envelope and of transformed fundamental frequency delivered by step 40 and phase and maximum voicing frequency information delivered by step 36.
  • the conversion method implementing the analysis method of the invention thus makes it possible to obtain a conversion of voices jointly performing spectral envelope and fundamental frequency modifications, so as to obtain a good quality auditory rendering.
  • FIG. 2A we will now describe the general flowchart of a second embodiment of the method of the invention. As before, this method includes the determination 1 of functions for transforming the acoustic characteristics of the source speaker into acoustic characteristics close to those of the target speaker.
  • This determination 1 begins with the implementation of steps 4X and 4Y of analysis of the voice samples spoken respectively by the source speaker and the target speaker. These steps 4X and 4Y are based on the use of the HNM model as described above and each deliver a scalar denoted F (n) representing the fundamental frequency and a vector denoted c (n) comprising spectral envelope information. in the form of a sequence of cepstral coefficients.
  • these analysis steps 4X and 4Y are followed by a step 50 of alignment of the vectors of cepstral coefficients resulting from the analysis of the frames of the source speaker and of the frames of the target speaker. This step 50 is implemented by an algorithm such as the DTW algorithm, similarly to step 18 of the first embodiment.
  • the method has a pair vector formed of pairs of cepstral coefficients of the source speaker and the target speaker, aligned in time.
  • This torque vector is also associated with the fundamental frequency information.
  • the alignment step 50 is followed by a step 54 of separation, in the couple vector, of the voiced frames and of the unvoiced frames. Indeed, only the voiced frames have a fundamental frequency and a sorting can be carried out by considering whether or not fundamental frequency information exists for each pair of the pair vector.
  • This separation step 54 then makes it possible to carry out the determination 56 of a joint transformation function of the spectral envelope and fundamental frequency characteristics of the voiced frames and the determination 58 of a transformation function of the only spectral envelope characteristics of the unvoiced frames.
  • the determination 56 of a function of transformation of the voiced frames begins with steps 60X and 60Y of normalization of the fundamental frequency information respectively for the source and target speakers. These steps 60X and 60Y are carried out in a similar manner to steps 14X and 14Y of the first embodiment and result in obtaining, for each voiced frame, of the frequency standardized for the source speaker noted g x (n) and that of the target speaker noted g y (n). These normalization steps 60X and 60Y are each followed by a step 62X and 62Y of concatenation of the cepstral coefficients c x and c y of the source speaker and the target speaker respectively with the normalized frequencies g x and
  • steps 62X and 62Y are performed in a similar manner to steps 16X and 16Y and make it possible to deliver a vector x n containing spectral envelope and fundamental frequency information for the voiced frames of the source speaker and a vector y n containing normalized spectral envelope and fundamental frequency information for the voiced frames of the target speaker.
  • the alignment between these two vectors is preserved as obtained at the end of step 50, the modifications occurring during steps 60X and 60Y for normalization and 62X and 62Y for concatenation being carried out directly inside. of the vector delivered by the alignment step 50.
  • the method then includes a step 70 of determining a model representing the common characteristics of the source speaker and the target speaker. Unlike step 20 described with reference to FIG.
  • this step 70 is implemented on the basis of the fundamental frequency and spectral envelope information of the only analyzed samples analyzed.
  • this step 70 is based on a probabilistic model according to a mixture of Gaussian density called GMM.
  • Step 70 thus comprises a sub-step 72 of modeling the density joined between the vectors X and Y produced in a similar manner to sub-step 22 described above.
  • This sub-step 72 is followed by a sub-step 74 for estimating the GMM parameters (a, ⁇ and ⁇ ) of the density p (z).
  • this estimation is carried out using an “EM” type algorithm allowing obtaining a maximum likelihood estimator between the data of the speech samples and the model.
  • Gaussian mixture is carried out using an “EM” type algorithm allowing obtaining a maximum likelihood estimator between the data of the speech samples and the model. Gaussian mixture.
  • Step 70 therefore delivers the parameters of a mixture of Gaussian densities, representative of the common acoustic characteristics of the spectral envelope and of the fundamental frequency of the voice samples voiced by the source speaker and the target speaker.
  • Step 70 is followed by a step 80 of determining a joint function for transforming the fundamental frequency and the spectral envelope of the voice samples voiced from the source speaker to the target speaker.
  • This step 80 is implemented in a similar manner to step 30 of the first embodiment and in particular also includes a sub-step 82 for determining the conditional expectation of the acoustic characteristics of the target speaker knowing the acoustic characteristics of the source speaker , this sub-step being implemented according to the same formulas as above, applied to the voiced samples only.
  • Step 80 thus leads to the obtaining of a joint transformation function of the characteristics of the spectral envelope and of fundamental frequency between the source speaker and the target speaker, applicable to voiced frames.
  • the determination 58 of a transformation function of the only characteristics of the spectral envelope of the unvoiced frames is also implemented.
  • the determination 58 includes a step 90 of determining a filtering function defined globally on the spectral envelope parameters, from pairs of unvoiced frames. This step 90 is carried out in a conventional manner by determining a GMM model or else any other suitable and known technique.
  • a function for transforming the spectral envelope characteristics of the unvoiced frames is obtained.
  • the method then comprises the transformation 2 of the acoustic characteristics of a voice signal to be converted.
  • this transformation 2 begins with a step of analysis 36 of the voice signal to be converted carried out according to an HNM model and a step 38 of formatting.
  • these steps 36 and 38 make it possible to deliver, in the form of a single vector, the information of spectral envelope and of normalized fundamental frequency.
  • step 36 delivers phase information and maximum voicing frequency.
  • step 38 is followed by a step 100 of separating, in the analyzed signal to be converted, voiced frames and unvoiced frames. This separation is carried out using a criterion based on the presence of non-zero fundamental frequency information.
  • Step 100 is followed by a step 102 of transformation of the acoustic characteristics of the voice signal to be converted by the application of the transformation functions determined during steps 80 and 90. More particularly, this step 102 comprises a sub-step 104 d application of the joint transformation function of the spectral envelope and fundamental frequency information, determined in step 80, to the only voiced frames as separated at the end of step 100. At the same time, step 102 comprises a sub-step 106 of applying the function of transforming only the spectral envelope information, determined in step 90, to only unvoiced frames as separated during step 100.
  • Sub-step 104 thus delivers for each frame of voiced samples of the signal to be converted from the source speaker, spectral envelope and fundamental frequency information transformed simultaneously and whose characteristics are similar to those of voiced samples from the target speaker.
  • Sub-step 106 delivers, for each frame of unvoiced samples of the signal to be converted from the source speaker, transformed spectral envelope information whose characteristics are similar to those of the unvoiced samples of the target speaker.
  • the method further comprises a step 108 of denormalizing the transformed fundamental frequency information, implemented on the information delivered by the sub-step 104 of transformation, in a similar manner to step 42 described with reference to FIG. 1 B.
  • the conversion method then comprises a step 110 of synthesis of the output signal carried out, in the example described, by a synthesis of HNM type which delivers the converted voice signal from the transformed spectral envelope and fundamental frequency information as well as phase and maximum voicing frequency information for the voiced frames and from the transformed spectral envelope information for the frames not seen.
  • the method of the invention therefore makes it possible, in this embodiment, to carry out a separate processing on the voiced frames and the unvoiced frames, the voiced frames undergoing a simultaneous transformation of the spectral envelope and fundamental frequency characteristics and the unvoiced frames undergoing a transformation of their only spectral envelope characteristics.
  • Such an embodiment allows a more precise transformation than the previous embodiment while retaining a limited complexity.
  • the efficiency of a conversion process can be assessed from identical voice samples spoken by the source speaker and the target speaker.
  • the voice signal pronounced by the source speaker is converted using the method of the invention and the resemblance of the converted signal with the signal pronounced by the target speaker is evaluated.
  • this resemblance is calculated as a ratio between the acoustic distance separating the converted signal from the target signal and the acoustic distance separating the target signal from the source signal.
  • FIG. 3 represents a graph of results obtained in the case of a conversion from male voice to female voice, the transformation functions being obtained from learning databases each containing 5 minutes of speech sampled at 16 kHz , the cepstral vectors used being of size 20 and the GMM model being with 64 components.
  • This graph represents on the abscissa the frame numbers and on the ordinate the frequency in hertz of the signal. The results shown are characteristic for voiced frames which extend approximately from frames 20 to 85.
  • the curve Cx represents the fundamental frequency characteristics of the source signal and the curve Cy those of the target signal.
  • the curve Ci represents the fundamental frequency characteristics of a signal obtained by a conventional linear conversion. It appears that this signal has the same general shape as that of the source signal represented by the curve Cx.
  • the curve C 2 represents the fundamental frequency characteristics of a signal converted using the method of the invention as described with reference to Figures 2A and 2B. It is obvious that the fundamental frequency curve of the signal converted using the method of the invention has a general shape very close to the target fundamental frequency curve Cy.- In FIG.
  • a diagram has been represented. functional block of a voice conversion system implementing the method described with reference to FIGS. 2A and 2B.
  • This "system uses as input a 120 voice samples database spoken by the source speaker and a database 122 containing at least the same speech samples uttered by the target speaker.
  • These two databases are used by a module 124 for determining functions for transforming the acoustic characteristics of the source speaker into the acoustic characteristics of the target speaker.
  • This module 124 is suitable for the implementation of steps 56 and 58 of the method as described with reference to FIG. 2 and therefore allows the determination of a transformation function of the spectral envelope of the unvoiced frames and of a function of joint transformation of the spectral envelope and the fundamental frequency of the voiced frames.
  • the module 124 includes a unit 126 for determining the joint transformation function of the spectral envelope and the fundamental frequency of the voiced frames and a unit 128 for determining the transformation function of the envelope spectral of unvoiced frames.
  • the voice conversion system receives as input a voice signal 130 corresponding to a speech signal spoken by the source speaker and intended to be converted.
  • the signal 130 is introduced into a signal analysis module 132, implementing, for example, an HNM type decomposition making it possible to dissociate spectral envelope information from the signal 130 in the form of cepstral coefficients and frequency information. fundamental.
  • the module 132 also delivers phase information and maximum voicing frequency obtained by the application of the HNM model.
  • the module 132 therefore implements step 36 of the method described above and advantageously step 38.
  • this analysis can be done beforehand and the information is stored for later use.
  • the system then comprises a module 134 for separating voiced frames and unvoiced frames in the analyzed speech signal to be converted.
  • the voiced frames, separated by the module 134 are transmitted to a transformation module 136 adapted to apply the joint transformation function determined by the unit 126.
  • the transformation module 136 implements step 104 described with reference in Figure 2B.
  • the module 136 also implements the denormalization step 108.
  • the unvoiced frames, separated by the module 134, are transmitted to a transformation module 138 adapted to apply the transformation function determined by the unit 128 so as to transform the cepstral coefficients of the unvoiced frames.
  • the module 138 for transforming unvoiced frames implements step 106 described in FIG. 2B.
  • the system also includes a synthesis module 140 receiving as input, for the voiced frames the spectral envelope and fundamental frequency information transformed jointly and the phase and maximum voicing frequency information delivered by the module 136.
  • the module 140 receives also the cepstral coefficients of the unvoiced frames transformed and delivered by the module 138.
  • the module 140 thus implements step 110 of the method described with reference to FIG. 2B and delivers a signal 150 corresponding to the voice signal
  • the system described can be implemented in various ways and in particular using adapted computer programs and connected to hardware means of sound acquisition.
  • the system comprises in module 124, a single unit for determining a joint transformation function of the envelope spectral and fundamental frequency.
  • the modules 134 for separation and 138 for applying the transformation function of the unvoiced frames are not necessary.
  • the module 136 therefore makes it possible to apply the only joint transformation function to all the frames of the voice signal to be converted and delivers the transformed frames to the synthesis module 140.
  • the system is suitable for the implementation of all the steps of the methods described with reference to FIGS. 1 and 2.
  • the system can also be implemented on specific databases in order to form databases of converted signals ready for use.
  • the analysis is done in deferred time and the parameters of the HNM analysis are stored for later use in steps 40 or 100 by the module 134.
  • the method of the invention and the corresponding system can be implemented in real time.
  • the HNM and GMM models can be replaced by other techniques and models known to those skilled in the art.
  • the analysis is carried out using techniques called LPC (Linear Predictive Coding), sinusoidal models or MBE (Multi Band Excited), the spectral parameters are parameters called LSF (Line Spectrum Frequencies), or parameters related to formants or to a glottic signal.
  • LPC Linear Predictive Coding
  • MBE Multi Band Excited
  • the spectral parameters are parameters called LSF (Line Spectrum Frequencies), or parameters related to formants or to a glottic signal.
  • the GMM model is replaced by a vector quantization (Fuzzy VQ.).
  • the estimator implemented during step 30 is an a posteriori maximum criterion, called "MAP" and corresponding to the realization of the computation of • expectation only for the model best representing the pair of vectors target source.
  • MAP a posteriori maximum criterion
  • the determination of a joint transformation function is carried out using a so-called least squares technique instead of the estimation of the joint density described.
  • the determination of a transformation function comprises modeling the probability density of the source vectors using a GMM model and then determining the parameters of the model using an EM algorithm.
  • the modeling thus takes into account the speech segments of the source speaker whose correspondents spoken by the target speaker are not available.
  • the determination then includes the minimization of a least squares criterion between target and source parameters to obtain the transformation function. It should be noted that the estimator of this function is always expressed in the same way but that the parameters are estimated differently and that additional data are taken into account.

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)
  • Complex Calculations (AREA)
  • Fittings On The Vehicle Exterior For Carrying Loads, And Devices For Holding Or Mounting Articles (AREA)
EP05736936A 2004-03-31 2005-03-09 Verbessertes sprachsignalumsetzungsverfahren und -system Withdrawn EP1730729A1 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FR0403403A FR2868586A1 (fr) 2004-03-31 2004-03-31 Procede et systeme ameliores de conversion d'un signal vocal
PCT/FR2005/000564 WO2005106852A1 (fr) 2004-03-31 2005-03-09 Procede et systeme ameliores de conversion d'un signal vocal

Publications (1)

Publication Number Publication Date
EP1730729A1 true EP1730729A1 (de) 2006-12-13

Family

ID=34944344

Family Applications (1)

Application Number Title Priority Date Filing Date
EP05736936A Withdrawn EP1730729A1 (de) 2004-03-31 2005-03-09 Verbessertes sprachsignalumsetzungsverfahren und -system

Country Status (4)

Country Link
US (1) US7765101B2 (de)
EP (1) EP1730729A1 (de)
FR (1) FR2868586A1 (de)
WO (1) WO2005106852A1 (de)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006099467A2 (en) * 2005-03-14 2006-09-21 Voxonic, Inc. An automatic donor ranking and selection system and method for voice conversion
JP4241736B2 (ja) * 2006-01-19 2009-03-18 株式会社東芝 音声処理装置及びその方法
US7480641B2 (en) * 2006-04-07 2009-01-20 Nokia Corporation Method, apparatus, mobile terminal and computer program product for providing efficient evaluation of feature transformation
JP4966048B2 (ja) * 2007-02-20 2012-07-04 株式会社東芝 声質変換装置及び音声合成装置
JP5088030B2 (ja) * 2007-07-26 2012-12-05 ヤマハ株式会社 演奏音の類似度を評価する方法、装置およびプログラム
US8224648B2 (en) * 2007-12-28 2012-07-17 Nokia Corporation Hybrid approach in voice conversion
EP3296992B1 (de) * 2008-03-20 2021-09-22 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Vorrichtung und verfahren zur modifizierung einer parameterisierten darstellung
US8140326B2 (en) * 2008-06-06 2012-03-20 Fuji Xerox Co., Ltd. Systems and methods for reducing speech intelligibility while preserving environmental sounds
JP5038995B2 (ja) * 2008-08-25 2012-10-03 株式会社東芝 声質変換装置及び方法、音声合成装置及び方法
ATE557388T1 (de) * 2008-12-19 2012-05-15 Koninkl Philips Electronics Nv Verfahren und system zur anpassung von kommunikation
WO2011004579A1 (ja) * 2009-07-06 2011-01-13 パナソニック株式会社 声質変換装置、音高変換装置および声質変換方法
JP5961950B2 (ja) * 2010-09-15 2016-08-03 ヤマハ株式会社 音声処理装置
US8719930B2 (en) * 2010-10-12 2014-05-06 Sonus Networks, Inc. Real-time network attack detection and mitigation infrastructure
TWI413104B (zh) * 2010-12-22 2013-10-21 Ind Tech Res Inst 可調控式韻律重估測系統與方法及電腦程式產品
US8682670B2 (en) * 2011-07-07 2014-03-25 International Business Machines Corporation Statistical enhancement of speech output from a statistical text-to-speech synthesis system
US9984700B2 (en) * 2011-11-09 2018-05-29 Speech Morphing Systems, Inc. Method for exemplary voice morphing
US9711134B2 (en) * 2011-11-21 2017-07-18 Empire Technology Development Llc Audio interface
JP5772739B2 (ja) * 2012-06-21 2015-09-02 ヤマハ株式会社 音声処理装置
US9922641B1 (en) * 2012-10-01 2018-03-20 Google Llc Cross-lingual speaker adaptation for multi-lingual speech synthesis
US9195656B2 (en) 2013-12-30 2015-11-24 Google Inc. Multilingual prosody generation
JP6271748B2 (ja) 2014-09-17 2018-01-31 株式会社東芝 音声処理装置、音声処理方法及びプログラム
JP6446993B2 (ja) 2014-10-20 2019-01-09 ヤマハ株式会社 音声制御装置およびプログラム
EP3340240B1 (de) * 2015-08-20 2021-04-14 Sony Corporation Informationsverarbeitungsvorrichtung, informationsverarbeitungsverfahren und programm
US20180018973A1 (en) 2016-07-15 2018-01-18 Google Inc. Speaker verification
US10706867B1 (en) * 2017-03-03 2020-07-07 Oben, Inc. Global frequency-warping transformation estimation for voice timbre approximation
US11410684B1 (en) * 2019-06-04 2022-08-09 Amazon Technologies, Inc. Text-to-speech (TTS) processing with transfer of vocal characteristics
WO2021127985A1 (zh) * 2019-12-24 2021-07-01 深圳市优必选科技股份有限公司 语音转换方法、系统、装置及存储介质
CN113643687B (zh) * 2021-07-08 2023-07-18 南京邮电大学 融合DSNet与EDSR网络的非平行多对多语音转换方法

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS61252596A (ja) * 1985-05-02 1986-11-10 株式会社日立製作所 文字音声通信方式及び装置
JPH02239292A (ja) * 1989-03-13 1990-09-21 Canon Inc 音声合成装置
IT1229725B (it) * 1989-05-15 1991-09-07 Face Standard Ind Metodo e disposizione strutturale per la differenziazione tra elementi sonori e sordi del parlato
WO1993018505A1 (en) * 1992-03-02 1993-09-16 The Walt Disney Company Voice transformation system
US5504834A (en) * 1993-05-28 1996-04-02 Motrola, Inc. Pitch epoch synchronous linear predictive coding vocoder and method
US5574823A (en) * 1993-06-23 1996-11-12 Her Majesty The Queen In Right Of Canada As Represented By The Minister Of Communications Frequency selective harmonic coding
US5572624A (en) * 1994-01-24 1996-11-05 Kurzweil Applied Intelligence, Inc. Speech recognition system accommodating different sources
ATE277405T1 (de) * 1997-01-27 2004-10-15 Microsoft Corp Stimmumwandlung
US6029124A (en) * 1997-02-21 2000-02-22 Dragon Systems, Inc. Sequential, nonparametric speech recognition and speaker identification
US6041297A (en) * 1997-03-10 2000-03-21 At&T Corp Vocoder for coding speech by using a correlation between spectral magnitudes and candidate excitations
US6336092B1 (en) * 1997-04-28 2002-01-01 Ivl Technologies Ltd Targeted vocal transformation
US6098037A (en) * 1998-05-19 2000-08-01 Texas Instruments Incorporated Formant weighted vector quantization of LPC excitation harmonic spectral amplitudes
US6199036B1 (en) * 1999-08-25 2001-03-06 Nortel Networks Limited Tone detection using pitch period
US6879952B2 (en) * 2000-04-26 2005-04-12 Microsoft Corporation Sound source separation using convolutional mixing and a priori sound source knowledge
US7412377B2 (en) * 2003-12-19 2008-08-12 International Business Machines Corporation Voice model for speech processing based on ordered average ranks of spectral features

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO2005106852A1 *

Also Published As

Publication number Publication date
WO2005106852A1 (fr) 2005-11-10
US20070208566A1 (en) 2007-09-06
US7765101B2 (en) 2010-07-27
FR2868586A1 (fr) 2005-10-07

Similar Documents

Publication Publication Date Title
WO2005106852A1 (fr) Procede et systeme ameliores de conversion d'un signal vocal
EP1730728A1 (de) Verfahren und system zur schnellen umwandlung eines voice-signals
EP2415047B1 (de) Klassifizieren von in einem Tonsignal enthaltenem Hintergrundrauschen
McLoughlin Line spectral pairs
Geiser et al. Bandwidth extension for hierarchical speech and audio coding in ITU-T Rec. G. 729.1
EP1606792B1 (de) Verfahren zur analyse der grundfrequenz, verfahren und vorrichtung zur sprachkonversion unter dessen verwendung
EP1593116B1 (de) Verfahren zur differenzierten digitalen Sprach- und Musikbearbeitung, Rauschfilterung, Erzeugung von Spezialeffekten und Einrichtung zum Ausführen des Verfahrens
JPH075892A (ja) 音声認識方法
EP3040989B1 (de) Verbessertes trennverfahren und computerprogrammprodukt
US7505950B2 (en) Soft alignment based on a probability of time alignment
EP1526508B1 (de) Verfahren zum Auswählen von Syntheseneinheiten
Lim et al. Robust low rate speech coding based on cloned networks and wavenet
EP1846918B1 (de) Verfahren zur schätzung einer sprachumsetzungsfunktion
US7225124B2 (en) Methods and apparatus for multiple source signal separation
Kato et al. HMM-based speech enhancement using sub-word models and noise adaptation
Gupta et al. A new framework for artificial bandwidth extension using H∞ filtering
FR2627887A1 (fr) Systeme de reconnaissance de parole et procede de formation de modeles pouvant etre utilise dans ce systeme
Srivastava Fundamentals of linear prediction
EP1605440A1 (de) Verfahren zur Quellentrennung eines Signalgemisches
Falk Blind estimation of perceptual quality for modern speech communications
En-Najjary et al. Fast GMM-based voice conversion for text-to-speech synthesis systems.
EP1194923B1 (de) Verfahren und system für audio analyse und synthese
WO2008081141A2 (fr) Codage d'unites acoustiques par interpolation
Grekas On Speaker Interpolation and Speech Conversion for parallel corpora.
WO2002082424A1 (fr) Procede et dispositif d'extraction de parametres acoustiques d'un signal vocal

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20060915

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU MC NL PL PT RO SE SI SK TR

DAX Request for extension of the european patent (deleted)
17Q First examination report despatched

Effective date: 20110520

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: ORANGE

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20171003