CN104123933A

CN104123933A - Self-adaptive non-parallel training based voice conversion method

Info

Publication number: CN104123933A
Application number: CN201410377091.0A
Authority: CN
Inventors: 王飞跃; 孔庆杰; 熊刚; 朱凤华; 朱春雷
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2014-08-01
Filing date: 2014-08-01
Publication date: 2014-10-29

Abstract

The invention discloses a self-adaptive non-parallel training based voice conversion method. The method includes the steps: detecting effective voice signals from an acquired voice sample and preprocessing the effective voice signals; extracting voice characteristic parameters from the preprocessed effective voice signals; performing UBM (universal background model) training based on the voice characteristic parameters so as to obtain a UBM irrelevant to a speaker; obtaining an independent speaker voice model relevant to the speaker based on the UBM, and obtaining a conversion function of frequency spectrum parameters and base frequency parameters based on the independent speaker voice model; inputting the voice characteristic parameters of to-be-converted voice into the conversion function so as to obtain converted voice characteristic parameters of a target speaker; synthesizing the converted voice characteristic parameters of the target speaker so as to obtain target voice. The self-adaptive non-parallel training based voice conversion method not only has good conversion performance but also has good system expandability.

Description

Phonetics transfer method based on the non-parallel training of self-adaptation

Technical field

The present invention relates to the fields such as speech signal analysis, voice signal processing, speech conversion and phonetic synthesis, be specifically related to a kind of phonetics transfer method based on the non-parallel training of self-adaptation, belong to the speech conversion branch in field of voice signal.

Background technology

Speech conversion refers to and is keeping under the prerequisite that semantic content is constant, change speaker's personal characteristics, and making source speaker's voice is that target speaker says sounding like after conversion.Speech conversion is the depth development to speech synthesis and recognition technology, and speech conversion is as the new branch of field of voice signal, and the theoretical research with height is worth and application future.Use for reference the knowledge in the fields such as speech analysis and synthetic, speech recognition technology, encoding and decoding speech technology, voice enhancing and speaker verification and identification, for the development of Voice Conversion Techniques provides technical support, and the research of Voice Conversion Techniques, to promote the development in these fields again, for the further research in these fields provides valuable reference significance.

At present, speech conversion can be divided into speech conversion between language of the same race and across the speech conversion of language large classification.For the speech conversion between language of the same race, in the training stage, different because of the selection of language material, be divided into again parallel corpora training and non-parallel corpus training.For the speech conversion across language, it is impossible obtaining parallel corpora, can only train by non-parallel corpus.By several generations' effort, the research of speech conversion has obtained very large development, a lot of scholars have proposed different conversion methods, summary is got up, roughly there is following a few class: vector quantization method, the Linear Multivariable Return Law, artificial neural network method, many speakers interpolation transformation approach, gauss hybrid models etc.But above method is all the speech conversion based on parallel corpora joint training, also has in actual applications some problems: 1. in a lot of situations, the very difficult acquisition of parallel corpora even can not get; 2. the training calculated amount based on union feature vector is very large, and the accuracy requirement that phonetic element is aimed at is very high; 3. associating speech model adopts the method for joint training to make the expansion of system inconvenient, and dirigibility is very poor.

For these problems, although researchist had carried out the research of speech conversion under non-parallel corpus in the last few years, but these methods are mostly still confined to solve that the restriction of parallel corpora adopts is associating voice training methods, can't solve second and third problem.Such as the people such as Mouchtaris were published in " IEEE Transactions on Audio in 2006, Speech and Language Processing (audio frequency, pronunciation and language processing IEEE journal) " the paper of by name " Nonparallel training for voice conversion based on a parameter adaptation approach (based on the non-parallel training utterance conversion of parameter adaptive method) " of the 14th volume the 3rd phase adopt parameter adaptive method to remove conversion spectrum envelope, the people such as Tao Jianhua were published in " IEEE Transactions on Audio in 2010, Speech and Language Processing (audio frequency, pronunciation and language processing IEEE proceedings) " the paper of by name " Supervisory Data Alignment for Text-Independent Voice Conversion (based on the text-independent sound conversion of monitoring data alignment) " of the 18th volume the 5th phase propose speech conversion realized to the exercise supervision method of data arrangement of non-parallel corpus, the people such as Ling-Hui Chen were in " the IEEE International Conference on Acoustics of 2011, Speech and Signal Processing (acoustics, the ieee international conference of voice and signal transacting) " on delivered in the paper of " Non-Parallel Training For Voice Conversion Based On FT-GMM (based on the non-parallel training utterance conversion of FT-GMM model) " by name and adopt the gauss hybrid models (FT-GMM) of eigentransformation to carry out the research of non-parallel training utterance conversion, the people such as Daojian Zeng were in " the 2010 IEEE 10th International Conference on Signal Processing of 2010, (2010 IEEE association signal transacting international conference) " on delivered " Voice Conversion Using Structrued Gaussian Mixture Model by name, (speech conversion of structure based gauss hybrid models) " paper in use structuring gauss hybrid models to achieve the speech conversion based on independent speaker model.

Because the phonetics transfer method based on parallel corpora has been subject to above-mentioned all constraints, caused Voice Conversion Techniques to be difficult to comprehensively move towards practical application, as obtained independently speaker's speech model by non-parallel training method, change source speaker's personal characteristics parameter, the personal characteristics that adds target speaker, realize the conversion between source-target, this will be huge contribution to the development in speech conversion field.

Summary of the invention

In order to overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of phonetics transfer method of new non-parallel corpus training, to solve the following problem existing in parallel corpora joint training phonetics transfer method: 1, need parallel corpora training to obtain transfer function in traditional voice converting system, and parallel corpora is difficult to obtain; 2, traditional voice converting system need to be carried out joint training to eigenvector; 3, the expansion of traditional voice converting system is inconvenient.

First the inventive method extracts fundamental frequency and the short-time spectrum of all voice signals, from short-time spectrum, obtain corresponding LPCC parameter, then all characteristic parameters are carried out to universal background model (UBM:Universal Background Model) training, recycling maximum a posteriori probability (MAP:Maximum a Posterior Probability) adaptive approach is derived and is talked about specifically human model, finally obtains corresponding transfer function and carries out speech conversion.

The phonetics transfer method of the non-parallel training of a kind of self-adaptation that particularly, the present invention proposes comprises the following steps:

Step 1 detects effective voice signal from the speech samples collecting, and described efficient voice signal is carried out to pre-service;

Step 2, for the efficient voice signal extraction speech characteristic parameter obtaining after pre-service;

Step 3, carries out UBM training based on described speech characteristic parameter, obtain one with the irrelevant UBM model of speaker;

Step 4, based on described UBM model, obtains the independent speaker speech model relevant with speaker, based on described independent speaker's speech model, obtains the transfer function of frequency spectrum parameter and base frequency parameters;

Step 5, is input to the speech characteristic parameter of voice to be converted in the transfer function that described step 4 obtains the speech characteristic parameter of the target speaker after being changed;

Step 6, synthesizes the speech characteristic parameter of the target speaker after conversion, obtains target voice.

Compared with prior art, the invention has the advantages that:

Traditional phonetics transfer method mostly adopts parallel corpora training source-target speaker to combine speech model and the corresponding speech conversion function of deriving thus, but in practical application, be not only difficult to obtain completely parallel language material, and training associating speech model need to consume a large amount of calculating, system extension is inconvenient.The present invention has avoided the harsh requirement of parallel training to language material, adopts non-parallel corpus to train and change, and without joint training, and system extension is flexible.

Accompanying drawing explanation

Fig. 1 is the process flow diagram that the present invention optimizes the phonetics transfer method of the non-parallel training of self-adaptation;

Fig. 2 is the derivation schematic diagram of frequency spectrum parameter transfer function of the present invention.

Embodiment

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.

Fig. 1 is the process flow diagram of the phonetics transfer method of the non-parallel training of optimization self-adaptation that adopts of the present invention, as shown in Figure 1, said method comprising the steps of:

In an embodiment of the present invention, described pre-service comprises pre-emphasis, adds the processing such as Hamming window and minute frame.

Described speech characteristic parameter can be fundamental frequency, linear prediction cepstrum coefficient coefficient (LPCC), Mel cepstral coefficients (MFCC), the speech characteristic parameters such as line spectrum pair (LSP).

In an embodiment of the present invention, all efficient voice signals are obtained to fundamental frequency F0 and the short-time spectrum parameter of every frame signal by STRAIGHT platform, short-time spectrum parameter based on trying to achieve utilizes Levenson-Durbin algorithm to ask for the LPC coefficient of every frame voice signal, then LPC coefficient is converted into LPCC coefficient, obtain the speaker's of all participation training speech characteristic parameter, wherein, for obtaining the fundamental frequency model of fundamental frequency F0, by the Gaussian distribution of single order, describe.

In this step, carrying out UBM when training, first speak difference that human nature do not go up and the size of each speaker's training corpus of balance, then merges the speech characteristic parameter of the training that is useful on, and by EM Algorithm for Training, obtains UBM model.Wherein, in initial UBM model, the initializes weights of each composition is 1/M, and M is mixed Gaussian number of components in UBM model.

UBM (universal background model) be one with the irrelevant global context model of speaker, global context model is a large-scale gauss hybrid models (GMM) in essence, generally the language material training by a large amount of speakers obtains, its thought is exactly that all speakers' information is included in the formed super vector of mixed Gaussian density function, it has reflected the statistical average distribution character of all speaker's sound characteristics, thereby has eliminated personal characteristics.As master pattern, UBM has been contained a plurality of subspaces, and wherein the corresponding cluster centre of every sub spaces, describes with Gaussian probability-density function, each subspace representation a part of feature space.

Described step 4 is further comprising the steps:

Step 41, carries out respectively pre-service to source speaker and target speaker's training utterance;

Step 42, extracts respectively both LPCC parameter and base frequency parameters;

Step 43 based on LPCC parameter, obtains respectively source speaker and target speaker's GMM model from UBM model;

In an embodiment of the present invention, by the adaptive method of MAP, from UBM model, obtain respectively source speaker and target speaker's GMM model.

Each speaker's GMM model is to be described by mean vector, covariance matrix and hybrid weight, is expressed as:

λ = {ω_{i}, μ_{i}, {\overset{&RightArrow;}{Σ}}_{i}}, i = 1,2, \cdot \cdot \cdot, M,

And have wherein, ω _irepresent hybrid weight, μ _irepresent mean parameter vector, represent covariance matrix, M is the exponent number of GMM model, and the probability density function of the GMM model on M rank is obtained by M Gaussian probability-density function weighted sum.

If trained, obtain a UBM model λ={ ω _i, μ _i, Σ _i, certain speaker's eigenvector is expressed as X={x ₁..., x _t..., x _t, by the adaptive method of MAP, from UBM model, obtain respectively source speaker and target speaker's the concrete steps of GMM model as follows:

First, calculate the weight of each gaussian component in GMM model:

\Pr (i | x_{t}) = \frac{w_{i} p_{i} (x_{t} | μ_{i}, Σ_{i})}{Σ_{j = 1}^{M} w_{j} p_{j} (x_{t} | μ_{j}, Σ_{j})},

Wherein, x _trepresent t dimensional feature vector, w _ithe weight that represents i gaussian component, p _i(x _t| μ _i, Σ _i) represent the posterior probability of gaussian component i, μ _irepresent mean vector, Σ _irepresent covariance matrix, w _jthe weight that represents gaussian component j, p _j(x _t| μ _j, Σ _j) represent the posterior probability of gaussian component j.

Then, utilize the weight Pr (i|x obtaining _t) and characteristic component x _tcalculate for upgrading the statistic n of average and variance _i:

n_{i} = Σ_{t = 1}^{T} \Pr (i | x_{t}),

Wherein, T represents the length (being frame number) of trained vector.

Then, utilize statistic n _ithe average and the variance that are directed to each gaussian component i with old UBM model parameter are upgraded, and then obtain source speaker and target speaker's GMM model.

Wherein, utilize statistic n _ibe directed to the average of each gaussian component i with old UBM model parameter and formula that variance is upgraded as follows:

{\hat{μ}}_{i} = \frac{Σ_{t = 1}^{T} \Pr (i | x_{t}) x_{t} + τ μ_{i}}{n_{i} + τ},

{\hat{σ}}_{i}^{2} = \frac{{τσ}_{i}^{2} + τ ({\hat{μ}}_{i} - μ_{i}) {({\hat{μ}}_{i} - μ_{i})}^{T} + Σ_{t = 1}^{T} \Pr (i | x_{t}) (x_{t} - {\hat{μ}}_{i}) {(x_{t} - {\hat{μ}}_{i})}^{T}}{n_{i} + τ},

Wherein, the average that represents the gaussian component i after upgrading, the variance that represents the gaussian component i after upgrading, the variance that represents former gaussian component i, τ is a fixed value, in an embodiment of the present invention, τ=20.

By upper, from the UBM model of training, after self-adaptation, can obtain the GMM model relevant with speaker.These models all from this benchmark model of UBM self-adaptation obtain, each component of all models and each component in UBM are consistent, thereby between each component of these models, are automatically corresponding alignment in order.Like this, the conversion of model is just converted into the conversion between each gaussian component, by deriving, can obtain the transfer function of frequency spectrum parameter and fundamental frequency.

Step 44, asks for average and the variance of base frequency parameters, and uses the Gauss model of single order to carry out modeling to it;

Step 45, the base frequency parameters model that the GMM model obtaining according to described step 43 and described step 44 obtain, obtains the transfer function of frequency spectrum parameter and base frequency parameters.

In this step, the derivation of the transfer function of frequency spectrum parameter is as follows:

Fig. 2 has described the relation between source speaker and two gaussian component of target speaker personal characteristics, uses respectively (μ _s, σ _s) and (μ _t, σ _t) represent, X represents source speaker's frequency spectrum parameter to be converted, Y represents the frequency spectrum parameter of the target speaker after conversion, the formula below can deriving from Fig. 2:

\frac{Y - μ_{t}}{X - μ_{x}} = \frac{σ_{t}}{σ_{s}},

Thereby have:

Y = μ_{t} + \frac{σ_{t}}{σ_{s}} (x - μ_{s}),

Consider that, after the weighted sum of all gaussian component, the transfer function of frequency spectrum parameter is expressed as:

F (X) = Σ_{i = 1}^{Q} p_{i} (X) [μ_{i}^{T} + \frac{Σ_{i}^{T}}{Σ_{i}^{S}} (X - μ_{i}^{S})],

Wherein, p _i(X) be the posterior probability of i gaussian component of source speaker GMM model, Q represents the dimension of gaussian component, average and the covariance matrix of i gaussian component of source speaker GMM model, average and the covariance matrix of i gaussian component of target speaker GMM model.

In this step, adopt Gauss model transformation approach to obtain the transfer function of base frequency parameters, the method supposition fundamental frequency of source speaker and target speaker's fundamental frequency be Normal Distribution all, and the transfer function of described base frequency parameters is expressed as:

F_{0}^{T} = μ^{T} + \frac{σ^{T}}{σ^{S}} (F_{0}^{S} - μ^{S}),

Wherein, μ ^sand μ ^tthe average that represents respectively source and target speaker speech pitch, σ ^sand σ ^tthe variance that represents source and target speaker speech pitch, it is the fundamental frequency of source speaker's voice.

This step is further comprising the steps:

Step 51, the short-time spectrum of extraction source speaker voice to be converted and fundamental frequency F0;

In an embodiment of the present invention, use short-time spectrum and the fundamental frequency F0 of STRAIGHT extraction source speaker voice to be converted.

Step 52, goes out LPCC parameter by short-time spectrum envelope extraction;

Step 53, changes source speaker's LPCC parameter and fundamental frequency F0 according to described frequency spectrum parameter transfer function and base frequency parameters transfer function respectively, obtains target speaker's LPCC parameter and base frequency parameters.

This step is further comprising the steps:

Step 61, the LPCC parameter revaluation based on after conversion goes out target speaker's short-time spectrum envelope;

Step 62, in conjunction with the fundamental frequency F0 after described short-time spectrum envelope and conversion, obtains having the voice of target speaker characteristic.

In described step 62, the fundamental frequency F0 by STRAIGHT platform for described short-time spectrum envelope and after changing synthesizes.

Above-described specific embodiment; object of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the phonetics transfer method based on the non-parallel training of self-adaptation, is characterized in that, the method comprises the following steps:

2. method according to claim 1, is characterized in that, described pre-service includes but not limited to pre-emphasis, adds Hamming window and minute frame processing.

3. method according to claim 1, is characterized in that, described speech characteristic parameter includes but not limited to fundamental frequency, linear prediction cepstrum coefficient coefficient LPCC, Mel cepstral coefficients MFCC and line spectrum pair LSP.

4. method according to claim 1, is characterized in that, in described step 2, first obtains fundamental frequency F0 and the short-time spectrum parameter of every frame efficient voice signal; Then the short-time spectrum parameter based on trying to achieve is asked for the LPC coefficient of every frame voice signal; Then LPC coefficient is converted into LPCC coefficient.

5. method according to claim 1, is characterized in that, in described step 3, is carrying out UBM when training, first speak difference that human nature do not go up and the size of each speaker's training corpus of balance; Then merge the speech characteristic parameter of the training that is useful on, by EM Algorithm for Training, obtain UBM model.

6. method according to claim 1, is characterized in that, described step 4 is further comprising the steps:

7. method according to claim 6, is characterized in that, by MAP adaptive approach, obtains respectively source speaker and target speaker's GMM model from UBM model.

8. method according to claim 6, is characterized in that, the transfer function of described frequency spectrum parameter is expressed as:

F (X) = Σ_{i = 1}^{Q} p_{i} (X) [μ_{i}^{T} + \frac{Σ_{i}^{T}}{Σ_{i}^{S}} (X - μ_{i}^{S})],

Wherein, p _i(X) be the posterior probability of i gaussian component of source speaker GMM model, Q represents the dimension of gaussian component, average and the covariance matrix of i gaussian component of source speaker GMM model, average and the covariance matrix of i gaussian component of target speaker GMM model;

The transfer function of described base frequency parameters is expressed as:

F_{0}^{T} = μ^{T} + \frac{σ^{T}}{σ^{S}} (F_{0}^{S} - μ^{S}),

9. method according to claim 1, is characterized in that, described step 5 is further comprising the steps:

Step 52, goes out LPCC parameter by short-time spectrum envelope extraction;

10. method according to claim 1, is characterized in that, described step 6 is further comprising the steps: