CN107301859B

CN107301859B - Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering

Info

Publication number: CN107301859B
Application number: CN201710474281.8A
Authority: CN
Inventors: 李燕萍; 左宇涛
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2017-06-21
Filing date: 2017-06-21
Publication date: 2020-02-21
Anticipated expiration: 2037-06-21
Also published as: CN107301859A

Abstract

The invention discloses a voice conversion method under a non-parallel text condition based on self-adaptive Gaussian clustering, and belongs to the technical field of voice signal processing. Firstly, the method based on the combination of unit selection and vocal tract length normalization is utilized to align the voice characteristic parameters of the non-parallel corpus, then the training of self-adaptive Gaussian mixture model and bilinear frequency warping plus amplitude adjustment is carried out to obtain the conversion function required by voice conversion, and finally the conversion function is used to realize high-quality voice conversion. The invention not only overcomes the limitation of parallel corpora required in the training stage, realizes voice conversion under the condition of non-parallel texts, and has stronger applicability and universality, but also uses the self-adaptive Gaussian mixture model to replace the traditional Gaussian mixture model, solves the problem that the Gaussian mixture model is inaccurate when carrying out voice characteristic parameter classification, combines the self-adaptive Gaussian mixture model with bilinear frequency warping and amplitude adjustment, and has better individual similarity and voice quality of conversion.

Description

Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering

Technical Field

The invention relates to a voice conversion technology, in particular to a voice conversion method under a non-parallel text condition, and belongs to the technical field of voice signal processing.

Background

Speech conversion is an emerging branch of research in the field of speech signal processing in recent years, and is carried out and developed on the basis of research on speech analysis, recognition and synthesis.

The goal of speech conversion is to change the speech personality characteristics of the source speaker to have the speech personality characteristics of the target speaker, i.e., to make one person speaking speech sound like another person speaking speech after conversion, while preserving semantics.

Most speech conversion methods, especially the GMM-based speech conversion method, require that the corpus used for training is parallel text, i.e. the source speaker and the target speaker need to send out sentences with the same speech content and speech duration, and the pronunciation rhythm and emotion are consistent as much as possible. However, in the practical application of voice conversion, it is very difficult, even impossible, to obtain a large amount of parallel corpora, and in addition, the accuracy of the alignment of the voice feature parameter vectors during training also becomes a constraint on the performance of the voice conversion system. The research of the voice conversion method under the condition of non-parallel texts has great practical significance and application value in consideration of the universality and the practicability of the voice conversion system.

At present, two methods for voice conversion under the condition of non-parallel text mainly exist, namely a method based on voice clustering and a method based on parameter self-adaption. The method based on speech clustering is to select corresponding speech units for conversion by measuring the distance between speech frames or under the guidance of phoneme information, and essentially converts non-parallel texts into parallel texts for processing under certain conditions. The method has a simple principle, but the content of the voice text needs to be pre-extracted, and the pre-extraction result can directly influence the conversion quality of the voice. The parameter self-adapting method is to process the parameters of the conversion model by adopting a speaker normalization or self-adapting method in the speech recognition, and the essence is to convert the pre-established model to the model based on the target speaker. The method can reasonably utilize the pre-stored speaker information, but generally, the self-adaptive process can cause the smoothing of frequency spectrum, which results in the weak personality information of the speaker in the converted voice.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the voice conversion method can adaptively determine the GMM mixing degree according to the difference of target speakers under the condition of non-parallel texts, thereby enhancing the individual characteristics of the speakers in the converted voice and improving the quality of the converted voice.

The invention adopts the following technical scheme for solving the technical problems:

the invention provides a voice conversion method under a non-parallel text condition based on self-adaptive Gaussian clustering, which comprises a training stage and a conversion stage, wherein the training stage comprises the following steps:

step 1, inputting non-parallel training corpora of a source speaker and a target speaker;

step 2, respectively extracting MFCC characteristic parameters X of the non-parallel training corpus of the source speaker, MFCC characteristic parameters Y of the non-parallel training corpus of the target speaker and source speech fundamental frequency log f by using an AHOcoder speech analysis model_0XAnd the target fundamental frequency log f of speech_0Y；

Step 3, performing speech characteristic parameter alignment and dynamic time warping combining unit selection and vocal tract length normalization on the MFCC characteristic parameters X, Y in the step 2, thereby converting the non-parallel linguistic data into parallel linguistic data;

step 4, performing adaptive Gaussian mixture model AGMM training by using an expectation maximization EM algorithm, finishing the AGMM training to obtain a posterior conditional probability matrix P (X | lambda), and storing an AGMM parameter lambda;

step 5, utilizing the source speech characteristic parameter X and the target speech characteristic parameter Y obtained in the step 3, utilizing the posterior conditional probability matrix P (X | Lambda) in the step 4 to carry out bilinear frequency warping BLFW + amplitude adjustment AS training to obtain frequency warping factors α (X, Lambda) and amplitude adjustment factors s (X, Lambda) so AS to construct a BLFW + AS conversion function, and utilizing the mean value and the variance of logarithmic fundamental frequency to establish source speech fundamental frequency log f_0XAnd the target fundamental frequency log f of speech_0YA fundamental frequency transfer function therebetween;

the conversion phase comprises the following steps:

step 6, inputting the voice of a source speaker to be converted;

step 7, extracting MFCC characteristic parameters X' and logarithmic fundamental frequency log f of source speaker voice by using AHOcoder voice analysis model_0X′；

Step 8, using the parameter lambda obtained in the AGMM training in the step 4 to obtain a posterior conditional probability matrix P' (X | lambda);

step 9, obtaining a converted MFCC characteristic parameter Y' by using the BLFW + AS conversion function obtained in the step 5;

step 10, using the base frequency conversion function obtained in step 5 to convert the base frequency into the logarithmic base frequency log f_0X′Obtaining the converted logarithmic fundamental frequency log f_0Y′；

Step 11, using AHOdecoder speech synthesis model to convert the converted MFCC characteristic parameters Y' and the logarithmic fundamental frequency log f_0Y′The synthesis results in a converted speech.

Further, the voice conversion method provided by the present invention comprises the following specific steps in step 3:

3-1) carrying out sound channel length normalization processing on the source speech MFCC characteristic parameters by adopting a bilinear frequency warping method;

3-2) MFCC feature parameter vector { X) for a given N source voices_kThe N target voice characteristic parameter vectors { Y } are dynamically searched through the formula (1)_kLet the distance cost function value C ({ Y)_k}) minimum;

C({Y_k})＝C₁({Y_k})+C₂({Y_k}) (1)

wherein, C₁({Y_k}) and C₂({Y_k}) are respectively represented by the following formulae:

wherein, D (X)_k,Y_k) The function represents the spectral distance between the characteristic parameter vectors of source speech and target speech, the parameter gamma represents the balance coefficient between the accuracy of the characteristic parameter frame alignment and the continuity between frames, and gamma is more than or equal to 0 and less than or equal to 1; c₁({Y_k}) represents the spectral distance cost function between the source speech feature parameter vector and the target speech feature parameter vector, C₂({Y_k}) represents a spectral distance cost function between target speech characteristic parameter vectors selected by the unit;

3-3) by performing a multivariate linear regression on the formula (1)Analyzing to obtain a target speech characteristic parameter sequence set aligned with the source speech characteristic parameter vector

Namely:

through the above steps, the non-parallel MFCC characteristic X, Y is transformed into parallel corpora.

Further, the speech conversion method proposed by the present invention optimizes the execution efficiency of the algorithm by using the viterbi search method for the solution of the formula (4).

Further, the training process of step 4 of the speech conversion method provided by the present invention is as follows:

4-1) setting the AGMM initial mixing number M and the Gaussian component weight coefficient threshold t₁,t₂The Euclidean distance threshold D and the covariance threshold sigma between the characteristic parameter vectors;

4-2) obtaining an initial value of EM training by using a K-mean iterative algorithm;

4-3) carrying out iterative training by using an EM algorithm; the gaussian mixture model GMM is expressed as follows:

wherein, X is a voice characteristic parameter vector of P dimension, and P is 39; p (w)_i) Weight coefficients representing the respective Gaussian components, and having

M is the number of Gaussian components, N (X, μ)_i,Σ_i) A P-dimensional joint gaussian probability distribution representing a gaussian component is expressed as follows:

wherein mu_iIs a mean vector, Σ_iIs a covariance matrix, λ ═ P (w)_i),μ_i,Σ_iLambda is a model parameter of the GMM model, and the estimation of the lambda is realized by a maximum likelihood estimation method, and for a speech characteristic parameter vector set X ═ X_nN ═ 1,2,. N } has:

at this time:

λ＝arg_λmax(P(X|λ)) (8)

solving equation (8) using the EM algorithm, with the iteration condition satisfying P (X | λ) during the EM calculation process^k)≥P(X|λ^k-1)，

K is the number of iterations until the model parameter lambda, the weight coefficient P of the Gaussian component in the iteration process (w)_i) Mean vector μ_iCovariance matrix sigma_iThe iterative formula of (a) is as follows:

4-4) some Gaussian component N (P (w) in the model obtained by training_i),μ_i,∑_i) The weight coefficient being less than t₁And its nearest neighbor component N (P (w)_j),μ_j,Σ_i) If the Euclidean distance between the two is smaller than the threshold value D, merging the two:

at the moment, the number of the Gaussian components is changed into M-1, the step 4-3) is returned to for next training, and if a plurality of Gaussian components meeting the merging condition exist, the Gaussian component with the minimum distance is selected for merging;

4-5) some Gaussian component N (P (w) in the model obtained by training_i),μ_i,∑_i) The weight coefficient is greater than t₂And if the variance of at least one dimension in the covariance matrix is larger than sigma, the Gaussian component is considered to contain excessive information, and the Gaussian component is split:

e is a column vector of all 1, n is used for adjusting Gaussian distribution, the number of the Gaussian components is changed into M +1 after splitting, if a plurality of Gaussian components meeting the splitting condition exist, the component with the largest weight coefficient is selected for splitting, and the step 4-3) is returned for next training;

4-6) finishing the AGMM training to obtain a posterior conditional probability matrix P (X | lambda), and storing the lambda.

Further, the BLFW + AS conversion function constructed in step 5 of the voice conversion method proposed by the present invention is expressed AS follows:

F(x)＝W_α(x,λ)x+s(x,λ) (15)

wherein M is the number of gaussian components of the mixture gaussian model in step 4, α (x, λ) represents the frequency warping factor, and s (x, λ) represents the amplitude adjustment factor.

Further, the speech conversion method provided by the present invention establishes a conversion relationship between the source speech pitch frequency and the target speech pitch frequency in step 5:

wherein μ, σ²Respectively representing logarithmic pitch frequency log f₀Mean and variance of.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

1. the invention realizes the voice conversion under the condition of non-parallel texts, solves the problem that parallel linguistic data are not easy to obtain, and improves the universality and the practicability of the voice conversion system.

2. The invention uses the combination of AGMM and BLFW + AS to realize a voice conversion system, and the system can adaptively adjust the classification number of GMM according to the voice characteristic parameter distribution of different speakers, improves the voice quality while enhancing the voice personality similarity, and realizes high-quality voice conversion.

Drawings

FIG. 1 is a schematic illustration of the non-parallel text to speech conversion of the present invention.

FIG. 2 is a flow chart of adaptive Gaussian mixture model training.

FIG. 3 is a comparison of the speech spectra of the converted speech.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings:

it will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The high-quality voice conversion method of the invention is divided into two parts: the training part is used for obtaining parameters and conversion functions required by voice conversion, and the conversion part is used for realizing the conversion from the voice of a source speaker to the voice of a target speaker.

As shown in fig. 1, the training part implements the steps of:

step 1, inputting the voice non-parallel corpora of a source speaker and a target speaker, wherein the non-parallel corpora is taken from a CMU _ US _ ARCTIC corpus, the corpus is established by the language technology research institute of Chimeron university in a card, the voice in the corpus is recorded by 5 men and 2 women, and each speaker records 1132 sections of 1-6 s of unequal voice.

Step 2, the invention uses AHOcoder speech analysis model to respectively extract Mel-Frequency Cepstral Coefficient (MFCC) X, Y and logarithmic base audio Frequency parameter logf of source speaker and target speaker_0XAnd log f_0Y. Wherein AHOcoder is a high-performance voice analysis and synthesis tool constructed by the group of Aholoab Signal processing laboratory scholars of Bilbao (Bilbao) Spanish;

and 3, performing voice characteristic parameter alignment and Dynamic Time Warping (DTW) combining unit selection (UnitSelection) and Vocal Tract Length normalization (VTLN, Vocal Tract Length No. 6 normalization) on the MFCC parameters X, Y of the source and target voices in the step 2. The specific process of aligning the voice characteristic parameters is as follows:

3-1) performing sound channel length normalization processing on the source speech characteristic parameters by adopting a bilinear frequency warping method to enable formants of the source speech to be close to target speech, thereby increasing the accuracy of unit selection of the target speech characteristic parameters.

3-2) for a given N source speech feature parameter vectors { X_kThe N target voice characteristic parameter vectors { Y } can be dynamically searched through the formula (1)_kLet the distance cost function value C ({ Y)_k}) is smallest. Two factors are considered in the cell selection process: one aspect is to ensure that the spectral distance between the aligned source speech feature parameter vector and the feature parameter vector of the target speech is minimized to enhance the phonemeMatching degree of information; and on the other hand, the selected target speech feature parameter vector is ensured to have frame continuity, so that the phoneme information is more complete.

C({Y_k})＝C₁({Y_k})+C₂({Y_k}) (1)

Wherein, C₁({Y_k}) and C₂({Y_k}) may be respectively represented by the following formulae:

wherein, D (X)_k,Y_k) The function represents the spectral distance between the source and target characteristic parameter vectors, and the Euclidean distance is used as a distance measurement scale in the invention. The parameter gamma represents a balance coefficient between accuracy of feature parameter frame alignment and continuity between frames, and has 0. ltoreq. gamma. ltoreq.1. C₁({Y_k}) is the spectral distance cost function between the source speech feature parameter vector and the feature parameter vector of the target speech, C₂({Y_kAnd) represents a spectral distance cost function between feature parameter vectors of the unit-selected target speech.

3-3) obtaining a characteristic parameter sequence set aligned with the source speech characteristic parameter vector by carrying out multiple linear regression analysis on the formula (1)

Namely:

for the solution of equation (4), a Viterbi (Viterbi) search method may be used to optimize the execution efficiency of the algorithm.

The non-parallel MFCC parameters X, Y are transformed to be parallel through the steps described above.

And 4, establishing an Adaptive Gaussian Mixture Model (AGMM), training by adopting an Expectation-Maximization (EM) algorithm, and obtaining an initial value of EM training by using a K-mean iteration method. The AGMM parameters λ, P (X | λ) are obtained by training.

As shown in fig. 2, training the AGMM parameters by using the adaptive clustering algorithm first requires a comprehensive analysis of the euclidean distances between the weight coefficients, the mean vectors, the covariance matrices, and the feature parameter vectors of each gaussian component, and dynamically adjusts the gaussian mixture. The training process is as follows:

4-1) setting the AGMM initial mixing number M and the Gaussian component weight coefficient threshold t₁,t₂The Euclidean distance threshold D and the covariance threshold sigma between the feature parameter vectors.

4-2) obtaining an initial value of the EM training by using a K-mean iterative algorithm.

4-3) iterative training using EM algorithm.

The conventional gaussian mixture model is represented as follows:

wherein, X is the speech characteristic parameter vector of P dimension, the invention adopts P ═ 39, P (w)_i) Weight coefficients representing the respective Gaussian components, and having

M is the number of Gaussian components, N (X, μ)_i,∑_i) A P-dimensional joint gaussian probability distribution representing a gaussian component is expressed as follows:

wherein mu_iIs a mean vector, Σ_iIs a covariance matrix, λ ═ P (w)_i),μ_i,Σ_iIs the model parameter of the GMM model, and the estimation of the lambda can be realized by a Maximum Likelihood estimation method (ML, Maximum Likelihood estimation)To maximize the conditional probability P (X | λ), for the speech feature parameter vector set X ═ X_nN ═ 1,2,. N } has:

at this time:

λ＝arg_λmax(P(X|λ)) (8)

solving equation (8) may use the EM algorithm, with the iteration condition satisfying P (X | λ) during the EM calculation process^k)≥P(X|λ^k-1) And K is the number of iterations until the model parameter λ. Weight coefficient P (w) of Gaussian component in iterative process_i) Mean vector μ_iCovariance matrix sigma_iThe iterative formula of (a) is as follows:

4-4) some Gaussian component N (P (w) in the model obtained by training_i),μ_i,Σ_i) The weight coefficient being less than t₁And its nearest neighbor component N (P (w)_j),μ_j,Σ_i) If the euclidean distance between the two components is smaller than the threshold D, the two components are considered to contain less information and have similar components, and the two components can be merged:

at this time, the number of Gaussian components is changed to M-1, the process returns to the step (3) to carry out the next training, and if a plurality of Gaussian components meeting the merging condition exist, the Gaussian component with the minimum distance is selected to be merged.

4-5) some Gaussian component N (P (w) in the model obtained by training_i),μ_i,∑_i) The weight coefficient is greater than t₂And if the variance of at least one dimension in the covariance matrix (the element on the diagonal of the covariance matrix is the variance) is larger than σ, the gaussian component is considered to contain excessive information, and the process should be split:

and (3) wherein E is a column vector of all 1, n is used for adjusting Gaussian distribution, the number of the Gaussian components after splitting is changed into M +1, if a plurality of Gaussian components meeting the splitting condition exist, the component with the largest weight coefficient is selected for splitting, and the step (3) is returned for next training.

Step 5, training by using the source speech characteristic parameter X and the target speech characteristic parameter Y obtained in step 3 and the posterior conditional probability matrix P (X | λ) obtained in step 4 to obtain a Frequency Warping factor and an amplitude adjusting factor, thereby constructing Bilinear Frequency Warping (BLFW) and amplitude Adjusting (AS) speech conversion functions, which are expressed AS follows:

F(x)＝W_α(x,λ)x+s(x,λ) (15)

establishing a conversion relation between the source voice pitch frequency and the target voice pitch frequency:

wherein μ, σ²For representing logarithmic pitch frequency log f₀Mean and variance of.

As shown in fig. 1, the conversion part includes the following steps:

step 6, inputting the voice of a source speaker to be converted;

step 7, extracting 39-order MFCC characteristic parameters X' of the source speaker voice and source voice logarithmic fundamental tone frequency parameters log f by using an AHOdecoder voice analysis model_0X′；

Step 8, using λ ═ P (w) obtained during AGMM training in step 4_i),μ_i,Σ_iSubstituting the characteristic parameters X 'extracted in the step 7 into a formula (5) to obtain a posterior conditional probability matrix P' (X | lambda);

step 9, respectively substituting the frequency warping factor α (X, λ) and the amplitude adjusting factor s (X, λ) obtained by BLFW + AS training in step 5 and the posterior conditional probability matrix P '(X | λ) obtained in step 8 into formulas (15), (16), (17) and (18) to obtain an MFCC characteristic parameter Y' of the converted speech;

step 10, using the source speech logarithmic pitch frequency parameter log f obtained in step 7_0X′Substituting the result into formula (19) to obtain the logarithmic pitch frequency parameter log f of the converted speech_0Y′；

Step 11, using AHOdecoder speech synthesis model to combine Y' in step 9 and log f in step 10_0Y′The converted speech is obtained as input.

Further, as shown in fig. 3, the spectrogram of the converted speech obtained by the method of the present invention is compared with the INCA method, and the conversion direction is F1-M2 (female voice 1-male voice 2), which further verifies the advantage of higher spectrum similarity of the method adopted by the present invention compared with the INCA method. Among them, the INCA method is proposed in the literature (Erro D, Moreno A, Bonafontea. INCA algorithm for transforming voice conversion system from nonparallelcorra [ J ]. IEEE Transactions on Audio, Speech, and Language Processing,2010,18(5): 944-953.).

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A speech conversion method under the condition of non-parallel texts based on adaptive Gaussian clustering is characterized by comprising a training stage and a conversion stage, wherein the training stage comprises the following steps:

the conversion phase comprises the following steps:

step 6, inputting the voice of a source speaker to be converted;

Step 11, using AHOdecoder speech synthesis model to convert the converted MFCC characteristic parameters Y' and logarithmic fundamental frequency logf_0Y′The synthesis results in a converted speech.

2. The speech conversion method according to claim 1, wherein the step 3 comprises the following steps:

C({Y_k})＝C₁({Y_k})+C₂({Y_k}) (1)

wherein, D (X)_k,Y_k) The function represents the spectral distance, D (Y), between the source speech and target speech feature parameter vectors_k,Y_k-1) The function represents the spectral distance between the target voice characteristic parameter vectors selected by the unit, the parameter gamma represents the balance coefficient between the accuracy of characteristic parameter frame alignment and the continuity between frames, and gamma is more than or equal to 0 and less than or equal to 1; c₁({Y_k}) represents the spectral distance cost function between the source speech feature parameter vector and the target speech feature parameter vector, C₂({Y_k}) represents a spectral distance cost function between target speech characteristic parameter vectors selected by the unit;

3-3) obtaining a target speech characteristic parameter sequence set aligned with the source speech characteristic parameter vector by carrying out multiple linear regression analysis on the formula (1)

Namely:

through the above steps, the MFCC feature X, Y in the non-parallel corpus is transformed into the alignment feature parameter set in the similar parallel corpus.

3. The speech conversion method according to claim 2, wherein for the solution of equation (4), a viterbi search method is used to optimize the execution efficiency of the algorithm.

4. The speech conversion method of claim 1, wherein the training process of step 4 is as follows:

wherein X is a P-dimensional speech feature parameter vector, P (w)_i) Weight coefficients representing the respective Gaussian components, and havingM is the number of Gaussian components, N (X, μ)_i,Σ_i) A P-dimensional joint gaussian probability distribution representing a gaussian component is expressed as follows:

at this time:

λ＝arg_λmax(P(X|λ)) (8)

k is the number of iterations until a model parameter λ is obtained, the weight coefficient of the Gaussian component P (w) in the iteration process_i) Mean vector μ_iCovariance matrix sigma_iThe iterative formula of (a) is as follows:

5. The method of claim 4, wherein P-39.

6. The speech conversion method of claim 1, wherein the BLFW + AS conversion function constructed in step 5 is expressed AS follows:

F(x)＝W_α(x,λ)x+s(x,λ) (15)

wherein M is the number of Gaussian components of the Gaussian mixture model in the step 4,

representing the posterior probability of the mth Gaussian component of the Speech feature vector x in the AGMM model λ, α_mAnd s_mRespectively, the frequency warping factor and the amplitude adjustment factor of the mth gaussian component in the AGMM model λ, α (x, λ) represents a weighted combination of the frequency warping factors of all gaussian components of the AGMM model λ, and s (x, λ) represents a weighted combination of the amplitude adjustment factors of all gaussian components of the AGMM model λ.

7. The speech conversion method of claim 1, wherein the conversion relationship between the source speech pitch frequency and the target speech pitch frequency is established in step 5: