CN110085254A

CN110085254A - Multi-to-multi phonetics transfer method based on beta-VAE and i-vector

Info

Publication number: CN110085254A
Application number: CN201910323677.1A
Authority: CN
Inventors: 李燕萍; 张成飞; 许吉良; 张燕
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2019-04-22
Filing date: 2019-04-22
Publication date: 2019-08-02

Abstract

The invention discloses a kind of multi-to-multi phonetics transfer methods based on beta-VAE and i-vector, variation autocoder (VAE) frame is modified by introducing customized parameter β and C, and i-vector (identity characteristic vector) is got up with improved VAE network integration, it improves hidden variable solution and tangles ability, improve it in the deficiency of bottleneck layer code capacity, and sufficiently enrich the individualized feature of speaker, the individual character similarity and voice quality that voice after converting can preferably be promoted, effectively improve the voice conversion performance of existing VAE network.

Description

Multi-to-multi phonetics transfer method based on beta-VAE and i-vector

Technical field

The present invention relates to a kind of multi-to-multi phonetics transfer methods, are based on beta-VAE and i-vector more particularly to one kind Multi-to-multi phonetics transfer method.

Background technique

Voice Conversion Techniques pass through years of researches, many classical conversion methods have been emerged in large numbers, including Gauss Mixed model (Gaussian Mixed Model, GMM), frequency bending, deep neural network (Deep Neural Network, ) and the method based on Unit selection etc. DNN.But these phonetics transfer methods need to be instructed using Parallel Corpus mostly Practice, i.e., source speaker and target speaker need to issue voice content, the identical sentence of voice duration, and pronounce rhythm and feelings Thread etc. is consistent as far as possible.However in the practical application of voice conversion, obtaining a large amount of parallel corpora is very to be not easy, or even can not expire Foot, the accuracy of speech characteristic parameter alignment also becomes a kind of restriction of speech conversion system performance when furthermore training.No matter from The versatility of speech conversion system or practicability consider that the research of phonetics transfer method all has under non-parallel text condition Great practical significance and application value.

Phonetics transfer method based on variation autocoder (Variational Autoencoder, VAE) model is straight The system for establishing voice conversion using the identity label (one-hot) of speaker is connect, this speech conversion system is in model training When do not need to be aligned the speech frame of source speaker and target speaker, the dependence to parallel text can be released, For nonparallel voice transformation model.In traditional non-parallel voice conversion based on VAE, encoder is joined from input voice The hidden variable of the representative semantic content unrelated with speaker is extracted in number, then decoder reconstruction parameter from hidden variable.However Due to the excessive regularization effect in the hidden variable of VAE, this makes hidden variable to the characterization scarce capacity of voice data, and very Difficulty is extended to increasingly complex voice data, therefore this non-parallel corpus converting speech based on original VAE is of poor quality, noise Many deficiencies such as more.Simultaneously as the label for the only speaker that one-hot is indicated, although having indicative function, nothing Method provides more speaker's identity information, and how technical staff can give full expression to the personalized spy of each speaker if needing to study The i-vector (identity characteristic vector) of sign gets up with VAE models coupling.

Summary of the invention

Goal of the invention: the technical problem to be solved in the present invention is to provide a kind of based on the multipair of beta-VAE and i-vector More voice conversion method, solving in existing VAE network indicates that speaker's individual information leads to not abundant table with one-hot Up to the defect of speaker's individualized feature, i-vector (identity characteristic vector) and VAE models coupling are got up abundant speaker Individual information, can preferably promote the individual character similarity and voice quality of voice after conversion, the effective VAE network that improves Voice conversion performance.

Technical solution: the multi-to-multi phonetics transfer method of the present invention based on beta-VAE and i-vector, including Training stage and conversion the stage, the training stage the following steps are included:

(1.1) non-parallel training corpus is obtained, includes source speaker and target speaker；

(1.2) training corpus is extracted into each speaker's sentence frequency by WORLD speech analysis/synthetic model Spectrum envelope feature X, aperiodic feature, logarithm fundamental frequency log f₀；

(1.3) the identity characteristic vector I of each speaker is extracted；

(1.4) by spectrum envelope feature X, speaker's label y, identity characteristic vector I input by encoder and decoder group At beta-VAE network be trained, obtain trained beta-VAE network；

(1.5) the fundamental frequency transfer function of speech pitch of the building from the speech pitch of source speaker to target speaker；

The conversion stage the following steps are included:

(2.1) source speaker's voice is extracted to the spectrum envelope of every sentence by WORLD speech analysis/synthetic model Feature X_s, aperiodic feature, logarithm fundamental frequency

(2.2) by the spectrum signature X of source speaker_s, target speaker label y_t, target speaker's identity feature vector I_t, in the trained beta-VAE network of input step (1.4), export the spectrum signature X of target speaker_t；

(2.3) fundamental frequency transfer function is obtained by step (1.5), the source speaker's logarithm that will be extracted in step (2.1) Fundamental frequency frequencyBe converted to the logarithm fundamental frequency frequency of target speaker

(2.4) by spectrum signature X obtained in aperiodic feature obtained in step (2.1), step (2.2)_tAnd (2.3) target speaker logarithm fundamental frequency obtained inWORLD speech analysis/synthetic model is inputted, after being converted Target speaker's voice.

Further, input and the step trained in step (1.4) are as follows:

(1) by the encoder of X input beta-VAE network, encoder output semantic feature z；

(2) by z, y and I, the decoder of beta-VAE network is inputted, minimizes X and X_t’Distance D (X, X_t‘), X_t’For The spectrum envelope feature that decoder generates；

(3) it repeats the above steps until the number of iterations；

(4) the MCD value for calculating beta-VAE network, according to the smallest MCD value and subjective assessment standard Mean Opinion Score phase In conjunction with preference pattern parameter beta and C.

Further, D (X, X_t‘) measured using KL divergence, the KL divergence is Wherein, D is the dimension of z, μ_(i)WithI-th of the mean vector of the general normal distribution of respectively X and variance vectors to Amount.

Further, the input process in step (2.2) are as follows: by source speaker's spectrum signature X_s, input beta-VAE net The encoder of network, by the output of encoder, y_tWith I_tThe decoder of beta-VAE network is inputted, conversion obtains target speaker frequency Spectrum signature X_t。

Further, the encoder uses two-dimensional convolution neural network, including 5 convolutional layers and 1 full articulamentum, The filter size of 5 convolutional layers is 7*1, and step-length is 3, and filter depth is respectively 16,32,64,128,256；It is described Decoder use two-dimensional convolution neural network, including 4 convolutional layers, the filter size of 4 convolutional layers is respectively 9*1,7* 1,7*1,1025*1, step-length are respectively 3,3,3,1, and filter depth is respectively 32,16,8,1.

Further, the fundamental frequency transfer function are as follows:

Wherein,For the fundamental frequency of source speaker,For the fundamental frequency of target speaker after conversion, source speaker's fundamental frequency is right The mean value and variance of number field are respectively μ_sAnd σ_s, target speaker fundamental frequency is respectively μ in the mean value and variance of log-domain_tAnd σ_t。

The utility model has the advantages that this method improves existing VAE model, and i-vector is applied to improved In VAE model, the voice quality after converting can not only be preferably promoted, the personalization of each speaker also can be fully expressed Feature enriches speaker's identity information.

Detailed description of the invention

Fig. 1 is the overall flow figure of this method.

Specific embodiment

As shown in Figure 1, the present embodiment provides a kind of multi-to-multi phonetics transfer method based on beta-VAE and i-vector, It is divided into two steps of training and conversion:

1, speaker's voice training stage

1.1 obtain non-parallel training corpus, and for the sound bank that this place uses for VCC2018, which includes that 8 sources are spoken People (SF1, SF2, SM1, SM2, SF3, SF4, SM3, SM4) and 4 target speakers (TF1, TF2, TM1, TM2).Choosing herein The non-parallel training corpus taken is 4 source speakers: SF3, SF4, SM3, SM4 and 4 target speaker TF1, TF2, TM1, TM2.Wherein, S (source) represents source speaker, T (target) represents target speaker, F (female) represents female, M (male) male is represented.Since the target of this paper is the conversion of non-parallel speech, the training corpus chosen also be it is nonparallel, i.e., Source speaker is different with target speaker voice content.Speaker everyone 81 sentences are sufficiently instructed as training corpus Practice, 35 sentences carry out model evaluation as testing material.

1.2 include the spectrum envelope of each frame using the feature that speech analysis/synthetic model WORLD extracts speaker's sentence Sp ', voice logarithm fundamental frequency log f₀, harmonic spectrum envelope ap, wherein speech sample frequency f_s=16000.Here what is carried out is 1024 points of Fast Fourier Transform (FFT), therefore obtained spectrum envelope feature sp and aperiodic feature ap are 1024/2+1= 513 dimensions.Ap and sp is the two-dimensional matrix of n*513 dimension, and speaker's label y is that each speaker's subset is concentrated in training voice Number, is finally expressed as X=[sp] for the spectrum signature of each frame of extraction.

1.3 extract the identity characteristic vector i-vector of each speaker, are expressed as identity characteristic vector I；

I-vector is in gauss hybrid models-universal background model (Gaussian Mixture Model- Universal Background Model, GMM-UBM) and Multiple Channel Analysis on the basis of the novel low-dimensional fixed length of one kind that proposes Feature vector is modeled to obtain by speaker and channel variable.

A given Duan Yuyin, speaker and the relevant GMM super vector of channel can be given by:

M=m+T ω

Wherein M indicates that the Gaussian mean super vector of speaker, m indicate speaker and channel under universal background model (UBM) Independent Gaussian mean super vector, T are the global disparity space matrixs of low-dimensional, and ω is for global disparity steric factor, a priori It obeys standardized normal distribution Ν (0, Ι), that is, identity characteristic vector i-vector.

Total factor ω is a hidden variable, can be defined by its Posterior distrbutionp, and posteriority, which is also distributed, meets normal state point Cloth can be extracted by Bao Mu-Wei Erqi (Baum-Welch) statistic using universal background model (UBM).It is given to speak Voice sequence { the s of people s and its L frame₁,s₂,s₃……s_L, for each Gaussian component c, define herein mixed weight-value, The corresponding following formula of Baum-Welch statistic of mean value vector:

Wherein, c=1 ... ..., C are Gauss subscripts, and P (c | s_t) correspond to generation vector s_tBlending constituent c Posterior distrbutionp, In order to estimate to obtain i-vector, it is also necessary to calculate based on UBM be averaged blending constituent center first item Baum-Welch count Amount:

Wherein, m_cIt is the mean value of UBM blending constituent c.The ω factor of given speaker s can be obtained by following formula:

Wherein, N (s) is defined as the diagonal matrix of CF × CF dimension, diagonal blocks N_cΙ (c=1 ..., C).For CF × The super vector of 1 dimension, by the Baum-Welch statistic for splicing all given speaker sFirst item obtain.Σ is The diagonal covariance matrix of one CF × CF dimension.

I-vector includes speaker information and channel information.Pass through Linear Discriminant Analysis (LinearDiscriminant Analysis, LDA) and class covariance normalization (WithinClass CovarianceNormalization, WCCN) remove Channel information.In concrete operations, i-vector can extract to obtain by Kaldi frame, and i-vector is one in the present embodiment The characteristic parameter of a 100 dimension.

1.4 training for beta-VAE network, the coding side that the spectrum signature X in 1.2 is input to VAE model carry out mould Type training, and semantic feature z, the speaker label y that the speaker of coding output is unrelated, and represent speaker's identity feature The identity characteristic vector Ι of vector i-vector, composition joint vector (z, y, Ι) input VAE solution to model code end, complete to voice The training of transformation model.In VAE network training process, VAE models encoder uses two-dimensional convolution neural network, packet in Fig. 1 Include 5 convolutional layers and 1 full articulamentum.The filter size of 5 convolutional layers is 7*1, and step-length is 3, filter depth point It Wei 16,32,64,128,256.Decoder uses two-dimensional convolution neural network, including 4 convolutional layers.The filtering of 4 convolutional layers Device size is respectively 9*1,7*1,7*1,1025*1, and step-length is respectively 3,3,3,1, and filter depth is respectively 32,16,8,1.

1.5 original VAE model discrimination modelsRemove the posterior probability p of approaching to reality_θ(z | X), and measure two The similarity degree of distribution uses KL divergence, as shown in formula 1-1:

Wherein,Indicate discrimination modelWith true posterior model p_θ(z | X) between KL divergence.

Formula 1-1 is made into the transformation of Bayesian formula and arrangement can obtain formula 1-2:

The log probability of each frame can be rewritten as equation 1-3 in VAE frame:

Wherein,For variation posteriority, p_θ(z | X) it is true posteriority, D_KL(| |) dissipates come the KL for calculating between the two Degree, Γ (θ, φ；x⁽ⁱ⁾) be marginal probability variation lower bound.

Can further it be write as 1-4:

Above-mentioned equation 1-4 is the purpose function of original VAE network.

Beta-VAE described herein and i-vector network model is modified on original VAE frame, it will be adjustable While parameter beta and C introduce original VAE basic framework, the i- comprising more abundant speaker's individual information has also been introduced Vector is indicated herein with Ι.

In equation 1-5, the right first itemThe KL divergence of expression is hidden layer loss, the right the BinomialIndicate generational loss.The degree of applying pressure during model learning can be changed by changing β, Ability is tangled to obtain different hidden variable solutions.As β=1, original VAE model is indicated；As β > 1, it is to potential Bottleneck application have stronger constraint, to obtain the ability that better solution tangles data.The definition that so-called solution is tangled is single Latent variable is sensitive to the single factor variations that generate, and the ability of the variation relative insensitivity to other factors.The change that solution is tangled Measuring a benefit usually coming at is that variable has good interpretation and an easy generalization for various tasks, but exactly by The ability of VAE model bottleneck layer feature efficient coding is limited in the promotion of this solution Entanglement, so that reconstruct data distortion.

Therefore, it while the value of β being set greater than 1 herein, also to be compiled by increasing parameter C to improve bottleneck layer The capacity of code.I.e. while the solution for obtaining hidden variable z tangles ability, also obtains hidden variable z and voice data is preferably characterized Ability, so that p_θ(x⁽ⁱ⁾| z) it is more nearly p_θ(x⁽ⁱ⁾), improve system performance.

Meanwhile speaker's label y in original VAE model is only an one-hot label, and one-hot label is only It is that, although having certain indicative function, more speakers can not be provided for distinguishing different speakers label Identity information, therefore promoted and be not obvious in the individual character similarity of converting speech.Herein by one speaker characteristic of addition Vector I, Lai Fengfu speaker's identity characteristic information promotes the individual character similarity of voice after conversion.

General objective function Γ (θ, φ using beta-VAE network in equation 1-5；x⁽ⁱ⁾, β), carry out Optimized Coding Based device ginseng NumberWith decoder parameter θ.The mode of sampling is generallyd use for above formula to estimate it is expected item, it may be assumed that

Wherein, L represents the sample number of every frame sampling, generally use Reparameterization skill, by generate standard normal with Machine variable is sampled from the distribution of z, is applied to data-driven qualitative function really:

ε~N (0, Ι)

Wherein, ° it indicates by element product,WithIt is the nonlinear function being made of feedforward neural network,It is the parameter sets of encoder.For generating the mean value of hidden variable z,For generating the side of hidden variable Difference.It can be rewritten as by Reparameterization equation 1-6:

1 is set by L to simplify above formula, to obtain final objective function:

Wherein, beta-VAE model hypothesis z is distributed as isotropic standardized normal distribution, therefore the loss of hidden variable (KL divergence) can be rewritten into:

Wherein, D is the dimension of hidden variable z, and μ_(i)WithRespectively represent the mean vector and variance of general normal distribution I-th of vector of vector.

Assuming that the visible variable X of feature (logarithmic spectrum), which is obeyed, has diagonal line variance matrix (diagonal variance Matrix Gaussian Profile), it may be assumed that

Wherein,WithIt is the nonlinear function being made of feedforward neural network, θ={ θ₁,θ₂Be decoder parameter Set.Therefore the log probability item in equation 1-8 can be rewritten into:

Wherein, D is the dimension of hidden variable z.

Final goal function can be obtained by the way that formula 1-9 and 1-10 are substituted into 1-8, the process phase of training beta-VAE When in the parameter for iteratively finding maximization variation lower bound:

Generally above formula is optimized using stochastic gradient descent, selects the number of iterations for 20000 times in this experiment.

After model training is good, frequency spectrum conversion only needs the y of specified target speaker_tAnd I_t, encoder becomes input spectrum frame For hidden variable z, then decoder is by (z, y_t,I_t) it is reconstructed into X_t。

1.6 so far beta-VAE and the trained completions of i-vector model.

2, speaker's speech synthesis stage

2.1 utilize WORLD speech analysis/synthetic model extraction source speaker's speech characteristic parameter, including spectrum envelope Sp ', voice logarithm fundamental frequency logf₀, aperiodic feature ap, the spectrum signature expression of each frame finally extracted are as follows: X_S=[sp]；

2.2 by source speaker's spectrum signature X_s, target speaker's label y_tAnd target speaker's identity feature vector Ι_t, Input trained transformation model.Wherein target speaker label y_tAs beta-VAE and i-vector frequency spectrum switching network solution The control condition of code process, and then target speaker's voice spectrum parameters X after being converted_t；

The 2.3 source speaker logarithm fundamental frequency log f that will be extracted in 2.1₀Target is obtained using log-domain linear transformation to speak The fundamental frequency of people.

Log-domain linear transformation is a kind of simple, while being also currently used widest fundamental frequency conversion method.It is this Method is based on a hypothesis, that is, the fundamental frequency of each speaker obeys a Gaussian Profile in log-domain.So, as long as statistics The mean value and variance of the logarithm fundamental frequency of each speaker out just can construct the fundamental frequency transformational relation between two speakers:

Wherein, the fundamental frequency of source speaker is respectively μ in the mean value and variance of log-domain_sAnd σ_s, the fundamental frequency of target speaker exists The mean value and variance of log-domain are respectively μ_tAnd σ_t,For the fundamental frequency of source speaker,For target speaker's base after conversion Frequently.

2.4 finally by target speaker's spectrum signature X_tAnd the target speaker after aperiodic feature ap and conversion Fundamental frequencyPass through target speaker's voice after speech synthesis tool WORLD synthesis conversion.

3 parameters are chosen

3.1 selection for parameter beta and C specific value, embodiment use and objectively evaluate standard mel cepstrum distortion distance (Mel-Cepstral Distortion, MCD) and subjective assessment standard Mean Opinion Score (Mean Opinion Score, MOS) The mode combined is evaluated and tested.MCD be mel cepstrum coefficients (Mel Frequency Cepstrum Coefficient, MFCC the computation model of the objective distortion factor) is indicated as the characteristic parameter of voice signal, MOS scoring is to measure voice quality The key index of (clarity and naturalness).Between speaker's voice after conversion used herein and target speaker's voice MCD value compare the conversion performance of not homologous ray as standard is objectively evaluated.Wherein MCD value is calculate by the following formula to obtain:

Wherein c_dWithThe d Jan Vermeer cepstrum coefficient of voice, N are represented after respectively representing target speaker and converting The dimension of mel cepstrum.MCD value is smaller, and the distortion between voice and target voice after illustrating conversion is smaller, that is, after converting Speaker's personal characteristics parameter is more similar to target speaker's personal characteristics parameter, and transformation model performance is better.MCD is a kind of visitor The relatively high objective quality evaluation method of sight degree, and be widely used in practice.

The built-up pattern of 3.2 pairs of different betas and C value is trained, and calculates MCD average value under SF3-TM1 transition cases, It is as shown in table 1:

1 different parameters built-up pattern of table MCD average value under SF3-TM1 transition cases

Show that MCD value is gradually decreased with β value increase by data in table 1, but reduction amplitude is little, with the increase of C value MCD value first reduces to be gradually increased afterwards.This experiment comprehensively considers subjective evaluation standard, and model parameter is β=150, when C=20, Voice quality and individual character similitude are optimal compared to other combinations, and human auditory system effect is best.

Claims

1. a kind of multi-to-multi phonetics transfer method based on beta-VAE and i-vector, it is characterised in that: including the training stage With conversion the stage, the training stage the following steps are included:

(1.2) training corpus is extracted into each speaker's sentence frequency spectrum packet by WORLD speech analysis/synthetic model Network feature X, aperiodic feature, logarithm fundamental frequency log f₀；

(1.3) the identity characteristic vector I of each speaker is extracted；

(1.4) spectrum envelope feature X, speaker's label y, identity characteristic vector I input are made of encoder and decoder Beta-VAE network is trained, and obtains trained beta-VAE network；

The conversion stage the following steps are included:

(2.1) source speaker's voice is extracted to the spectrum envelope feature of every sentence by WORLD speech analysis/synthetic model X_s, aperiodic feature, logarithm fundamental frequency

(2.2) by the spectrum signature X of source speaker_s, target speaker label y_t, target speaker's identity feature vector I_t, defeated Enter in the trained beta-VAE network of step (1.4), exports the spectrum signature X of target speaker_t；

(2.3) fundamental frequency transfer function is obtained by step (1.5), the source speaker's logarithm fundamental frequency that will be extracted in step (2.1) FrequentlyBe converted to the logarithm fundamental frequency frequency of target speaker

(2.4) by spectrum signature X obtained in aperiodic feature obtained in step (2.1), step (2.2)_tAnd in (2.3) Obtained target speaker's logarithm fundamental frequencyWORLD speech analysis/synthetic model is inputted, the target after being converted is said Talk about human speech sound.

2. the multi-to-multi phonetics transfer method according to claim 1 based on beta-VAE and i-vector, feature exist The step of input and training in step (1.4) are as follows:

(2) by z, y and I, the decoder of beta-VAE network is inputted, minimizes X and X_t′Distance D (X, X_t‘), X_t′For decoding The spectrum envelope feature that device generates；

(3) it repeats the above steps until the number of iterations；

(4) the MCD value for calculating beta-VAE network, combines according to the smallest MCD value and subjective assessment standard Mean Opinion Score Preference pattern parameter beta and C.

3. the multi-to-multi phonetics transfer method according to claim 2 based on beta-VAE and i-vector, feature exist In: D (X, X_t′) measured using KL divergence, the KL divergence isWherein, D z Dimension, μ_(i)WithThe mean vector of the general normal distribution of respectively X and i-th of vector of variance vectors.

4. the multi-to-multi phonetics transfer method according to claim 1 based on beta-VAE and i-vector, feature exist Input process in step (2.2) are as follows: by source speaker's spectrum signature X_s, the encoder of beta-VAE network is inputted, will be compiled The output of code device, y_tWith I_tThe decoder of beta-VAE network is inputted, conversion obtains target speaker spectrum signature X_t。

5. the multi-to-multi phonetics transfer method according to claim 1 based on beta-VAE and i-vector, feature exist In: the encoder uses two-dimensional convolution neural network, including 5 convolutional layers and 1 full articulamentum, the mistake of 5 convolutional layers Filter size is 7*1, and step-length is 3, and filter depth is respectively 16,32,64,128,256；The decoder uses two Convolutional neural networks, including 4 convolutional layers are tieed up, the filter size of 4 convolutional layers is respectively 9*1,7*1,7*1,1025*1, step Long is respectively 3,3,3,1, and filter depth is respectively 32,16,8,1.

6. the multi-to-multi phonetics transfer method according to claim 1 based on beta-VAE and i-vector, feature exist In: the fundamental frequency transfer function are as follows:

Wherein,For the fundamental frequency of source speaker,For the fundamental frequency of target speaker after conversion, source speaker's fundamental frequency is in log-domain Mean value and variance be respectively μ_sAnd σ_s, target speaker fundamental frequency is respectively μ in the mean value and variance of log-domain_tAnd σ_t。