CN110085254A - Multi-to-multi phonetics transfer method based on beta-VAE and i-vector - Google Patents
Multi-to-multi phonetics transfer method based on beta-VAE and i-vector Download PDFInfo
- Publication number
- CN110085254A CN110085254A CN201910323677.1A CN201910323677A CN110085254A CN 110085254 A CN110085254 A CN 110085254A CN 201910323677 A CN201910323677 A CN 201910323677A CN 110085254 A CN110085254 A CN 110085254A
- Authority
- CN
- China
- Prior art keywords
- speaker
- vae
- beta
- vector
- fundamental frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000013598 vector Substances 0.000 title claims abstract description 61
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000012546 transfer Methods 0.000 title claims abstract description 22
- 238000006243 chemical reaction Methods 0.000 claims abstract description 31
- 238000001228 spectrum Methods 0.000 claims description 32
- 238000012549 training Methods 0.000 claims description 24
- 238000004458 analytical method Methods 0.000 claims description 11
- 238000013528 artificial neural network Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 5
- 239000004576 sand Substances 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 claims 1
- 230000007812 deficiency Effects 0.000 abstract description 2
- 230000010354 integration Effects 0.000 abstract 1
- 230000006870 function Effects 0.000 description 12
- 230000009466 transformation Effects 0.000 description 7
- 239000011159 matrix material Substances 0.000 description 5
- 230000015572 biosynthetic process Effects 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000002156 mixing Methods 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 239000000470 constituent Substances 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 238000005452 bending Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000003825 pressing Methods 0.000 description 1
- 238000013441 quality evaluation Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
- G10L2021/0135—Voice conversion or morphing
Abstract
The invention discloses a kind of multi-to-multi phonetics transfer methods based on beta-VAE and i-vector, variation autocoder (VAE) frame is modified by introducing customized parameter β and C, and i-vector (identity characteristic vector) is got up with improved VAE network integration, it improves hidden variable solution and tangles ability, improve it in the deficiency of bottleneck layer code capacity, and sufficiently enrich the individualized feature of speaker, the individual character similarity and voice quality that voice after converting can preferably be promoted, effectively improve the voice conversion performance of existing VAE network.
Description
Technical field
The present invention relates to a kind of multi-to-multi phonetics transfer methods, are based on beta-VAE and i-vector more particularly to one kind
Multi-to-multi phonetics transfer method.
Background technique
Voice Conversion Techniques pass through years of researches, many classical conversion methods have been emerged in large numbers, including Gauss
Mixed model (Gaussian Mixed Model, GMM), frequency bending, deep neural network (Deep Neural Network,
) and the method based on Unit selection etc. DNN.But these phonetics transfer methods need to be instructed using Parallel Corpus mostly
Practice, i.e., source speaker and target speaker need to issue voice content, the identical sentence of voice duration, and pronounce rhythm and feelings
Thread etc. is consistent as far as possible.However in the practical application of voice conversion, obtaining a large amount of parallel corpora is very to be not easy, or even can not expire
Foot, the accuracy of speech characteristic parameter alignment also becomes a kind of restriction of speech conversion system performance when furthermore training.No matter from
The versatility of speech conversion system or practicability consider that the research of phonetics transfer method all has under non-parallel text condition
Great practical significance and application value.
Phonetics transfer method based on variation autocoder (Variational Autoencoder, VAE) model is straight
The system for establishing voice conversion using the identity label (one-hot) of speaker is connect, this speech conversion system is in model training
When do not need to be aligned the speech frame of source speaker and target speaker, the dependence to parallel text can be released,
For nonparallel voice transformation model.In traditional non-parallel voice conversion based on VAE, encoder is joined from input voice
The hidden variable of the representative semantic content unrelated with speaker is extracted in number, then decoder reconstruction parameter from hidden variable.However
Due to the excessive regularization effect in the hidden variable of VAE, this makes hidden variable to the characterization scarce capacity of voice data, and very
Difficulty is extended to increasingly complex voice data, therefore this non-parallel corpus converting speech based on original VAE is of poor quality, noise
Many deficiencies such as more.Simultaneously as the label for the only speaker that one-hot is indicated, although having indicative function, nothing
Method provides more speaker's identity information, and how technical staff can give full expression to the personalized spy of each speaker if needing to study
The i-vector (identity characteristic vector) of sign gets up with VAE models coupling.
Summary of the invention
Goal of the invention: the technical problem to be solved in the present invention is to provide a kind of based on the multipair of beta-VAE and i-vector
More voice conversion method, solving in existing VAE network indicates that speaker's individual information leads to not abundant table with one-hot
Up to the defect of speaker's individualized feature, i-vector (identity characteristic vector) and VAE models coupling are got up abundant speaker
Individual information, can preferably promote the individual character similarity and voice quality of voice after conversion, the effective VAE network that improves
Voice conversion performance.
Technical solution: the multi-to-multi phonetics transfer method of the present invention based on beta-VAE and i-vector, including
Training stage and conversion the stage, the training stage the following steps are included:
(1.1) non-parallel training corpus is obtained, includes source speaker and target speaker;
(1.2) training corpus is extracted into each speaker's sentence frequency by WORLD speech analysis/synthetic model
Spectrum envelope feature X, aperiodic feature, logarithm fundamental frequency log f0;
(1.3) the identity characteristic vector I of each speaker is extracted;
(1.4) by spectrum envelope feature X, speaker's label y, identity characteristic vector I input by encoder and decoder group
At beta-VAE network be trained, obtain trained beta-VAE network;
(1.5) the fundamental frequency transfer function of speech pitch of the building from the speech pitch of source speaker to target speaker;
The conversion stage the following steps are included:
(2.1) source speaker's voice is extracted to the spectrum envelope of every sentence by WORLD speech analysis/synthetic model
Feature Xs, aperiodic feature, logarithm fundamental frequency
(2.2) by the spectrum signature X of source speakers, target speaker label yt, target speaker's identity feature vector
It, in the trained beta-VAE network of input step (1.4), export the spectrum signature X of target speakert;
(2.3) fundamental frequency transfer function is obtained by step (1.5), the source speaker's logarithm that will be extracted in step (2.1)
Fundamental frequency frequencyBe converted to the logarithm fundamental frequency frequency of target speaker
(2.4) by spectrum signature X obtained in aperiodic feature obtained in step (2.1), step (2.2)tAnd
(2.3) target speaker logarithm fundamental frequency obtained inWORLD speech analysis/synthetic model is inputted, after being converted
Target speaker's voice.
Further, input and the step trained in step (1.4) are as follows:
(1) by the encoder of X input beta-VAE network, encoder output semantic feature z;
(2) by z, y and I, the decoder of beta-VAE network is inputted, minimizes X and Xt’Distance D (X, Xt‘), Xt’For
The spectrum envelope feature that decoder generates;
(3) it repeats the above steps until the number of iterations;
(4) the MCD value for calculating beta-VAE network, according to the smallest MCD value and subjective assessment standard Mean Opinion Score phase
In conjunction with preference pattern parameter beta and C.
Further, D (X, Xt‘) measured using KL divergence, the KL divergence is Wherein, D is the dimension of z, μ(i)WithI-th of the mean vector of the general normal distribution of respectively X and variance vectors to
Amount.
Further, the input process in step (2.2) are as follows: by source speaker's spectrum signature Xs, input beta-VAE net
The encoder of network, by the output of encoder, ytWith ItThe decoder of beta-VAE network is inputted, conversion obtains target speaker frequency
Spectrum signature Xt。
Further, the encoder uses two-dimensional convolution neural network, including 5 convolutional layers and 1 full articulamentum,
The filter size of 5 convolutional layers is 7*1, and step-length is 3, and filter depth is respectively 16,32,64,128,256;It is described
Decoder use two-dimensional convolution neural network, including 4 convolutional layers, the filter size of 4 convolutional layers is respectively 9*1,7*
1,7*1,1025*1, step-length are respectively 3,3,3,1, and filter depth is respectively 32,16,8,1.
Further, the fundamental frequency transfer function are as follows:
Wherein,For the fundamental frequency of source speaker,For the fundamental frequency of target speaker after conversion, source speaker's fundamental frequency is right
The mean value and variance of number field are respectively μsAnd σs, target speaker fundamental frequency is respectively μ in the mean value and variance of log-domaintAnd σt。
The utility model has the advantages that this method improves existing VAE model, and i-vector is applied to improved
In VAE model, the voice quality after converting can not only be preferably promoted, the personalization of each speaker also can be fully expressed
Feature enriches speaker's identity information.
Detailed description of the invention
Fig. 1 is the overall flow figure of this method.
Specific embodiment
As shown in Figure 1, the present embodiment provides a kind of multi-to-multi phonetics transfer method based on beta-VAE and i-vector,
It is divided into two steps of training and conversion:
1, speaker's voice training stage
1.1 obtain non-parallel training corpus, and for the sound bank that this place uses for VCC2018, which includes that 8 sources are spoken
People (SF1, SF2, SM1, SM2, SF3, SF4, SM3, SM4) and 4 target speakers (TF1, TF2, TM1, TM2).Choosing herein
The non-parallel training corpus taken is 4 source speakers: SF3, SF4, SM3, SM4 and 4 target speaker TF1, TF2, TM1,
TM2.Wherein, S (source) represents source speaker, T (target) represents target speaker, F (female) represents female, M
(male) male is represented.Since the target of this paper is the conversion of non-parallel speech, the training corpus chosen also be it is nonparallel, i.e.,
Source speaker is different with target speaker voice content.Speaker everyone 81 sentences are sufficiently instructed as training corpus
Practice, 35 sentences carry out model evaluation as testing material.
1.2 include the spectrum envelope of each frame using the feature that speech analysis/synthetic model WORLD extracts speaker's sentence
Sp ', voice logarithm fundamental frequency log f0, harmonic spectrum envelope ap, wherein speech sample frequency fs=16000.Here what is carried out is
1024 points of Fast Fourier Transform (FFT), therefore obtained spectrum envelope feature sp and aperiodic feature ap are 1024/2+1=
513 dimensions.Ap and sp is the two-dimensional matrix of n*513 dimension, and speaker's label y is that each speaker's subset is concentrated in training voice
Number, is finally expressed as X=[sp] for the spectrum signature of each frame of extraction.
1.3 extract the identity characteristic vector i-vector of each speaker, are expressed as identity characteristic vector I;
I-vector is in gauss hybrid models-universal background model (Gaussian Mixture Model-
Universal Background Model, GMM-UBM) and Multiple Channel Analysis on the basis of the novel low-dimensional fixed length of one kind that proposes
Feature vector is modeled to obtain by speaker and channel variable.
A given Duan Yuyin, speaker and the relevant GMM super vector of channel can be given by:
M=m+T ω
Wherein M indicates that the Gaussian mean super vector of speaker, m indicate speaker and channel under universal background model (UBM)
Independent Gaussian mean super vector, T are the global disparity space matrixs of low-dimensional, and ω is for global disparity steric factor, a priori
It obeys standardized normal distribution Ν (0, Ι), that is, identity characteristic vector i-vector.
Total factor ω is a hidden variable, can be defined by its Posterior distrbutionp, and posteriority, which is also distributed, meets normal state point
Cloth can be extracted by Bao Mu-Wei Erqi (Baum-Welch) statistic using universal background model (UBM).It is given to speak
Voice sequence { the s of people s and its L frame1,s2,s3……sL, for each Gaussian component c, define herein mixed weight-value,
The corresponding following formula of Baum-Welch statistic of mean value vector:
Wherein, c=1 ... ..., C are Gauss subscripts, and P (c | st) correspond to generation vector stBlending constituent c Posterior distrbutionp,
In order to estimate to obtain i-vector, it is also necessary to calculate based on UBM be averaged blending constituent center first item Baum-Welch count
Amount:
Wherein, mcIt is the mean value of UBM blending constituent c.The ω factor of given speaker s can be obtained by following formula:
Wherein, N (s) is defined as the diagonal matrix of CF × CF dimension, diagonal blocks NcΙ (c=1 ..., C).For CF ×
The super vector of 1 dimension, by the Baum-Welch statistic for splicing all given speaker sFirst item obtain.Σ is
The diagonal covariance matrix of one CF × CF dimension.
I-vector includes speaker information and channel information.Pass through Linear Discriminant Analysis (LinearDiscriminant
Analysis, LDA) and class covariance normalization (WithinClass CovarianceNormalization, WCCN) remove
Channel information.In concrete operations, i-vector can extract to obtain by Kaldi frame, and i-vector is one in the present embodiment
The characteristic parameter of a 100 dimension.
1.4 training for beta-VAE network, the coding side that the spectrum signature X in 1.2 is input to VAE model carry out mould
Type training, and semantic feature z, the speaker label y that the speaker of coding output is unrelated, and represent speaker's identity feature
The identity characteristic vector Ι of vector i-vector, composition joint vector (z, y, Ι) input VAE solution to model code end, complete to voice
The training of transformation model.In VAE network training process, VAE models encoder uses two-dimensional convolution neural network, packet in Fig. 1
Include 5 convolutional layers and 1 full articulamentum.The filter size of 5 convolutional layers is 7*1, and step-length is 3, filter depth point
It Wei 16,32,64,128,256.Decoder uses two-dimensional convolution neural network, including 4 convolutional layers.The filtering of 4 convolutional layers
Device size is respectively 9*1,7*1,7*1,1025*1, and step-length is respectively 3,3,3,1, and filter depth is respectively 32,16,8,1.
1.5 original VAE model discrimination modelsRemove the posterior probability p of approaching to realityθ(z | X), and measure two
The similarity degree of distribution uses KL divergence, as shown in formula 1-1:
Wherein,Indicate discrimination modelWith true posterior model pθ(z | X) between
KL divergence.
Formula 1-1 is made into the transformation of Bayesian formula and arrangement can obtain formula 1-2:
The log probability of each frame can be rewritten as equation 1-3 in VAE frame:
Wherein,For variation posteriority, pθ(z | X) it is true posteriority, DKL(| |) dissipates come the KL for calculating between the two
Degree, Γ (θ, φ;x(i)) be marginal probability variation lower bound.
Can further it be write as 1-4:
Above-mentioned equation 1-4 is the purpose function of original VAE network.
Beta-VAE described herein and i-vector network model is modified on original VAE frame, it will be adjustable
While parameter beta and C introduce original VAE basic framework, the i- comprising more abundant speaker's individual information has also been introduced
Vector is indicated herein with Ι.
In equation 1-5, the right first itemThe KL divergence of expression is hidden layer loss, the right the
BinomialIndicate generational loss.The degree of applying pressure during model learning can be changed by changing β,
Ability is tangled to obtain different hidden variable solutions.As β=1, original VAE model is indicated;As β > 1, it is to potential
Bottleneck application have stronger constraint, to obtain the ability that better solution tangles data.The definition that so-called solution is tangled is single
Latent variable is sensitive to the single factor variations that generate, and the ability of the variation relative insensitivity to other factors.The change that solution is tangled
Measuring a benefit usually coming at is that variable has good interpretation and an easy generalization for various tasks, but exactly by
The ability of VAE model bottleneck layer feature efficient coding is limited in the promotion of this solution Entanglement, so that reconstruct data distortion.
Therefore, it while the value of β being set greater than 1 herein, also to be compiled by increasing parameter C to improve bottleneck layer
The capacity of code.I.e. while the solution for obtaining hidden variable z tangles ability, also obtains hidden variable z and voice data is preferably characterized
Ability, so that pθ(x(i)| z) it is more nearly pθ(x(i)), improve system performance.
Meanwhile speaker's label y in original VAE model is only an one-hot label, and one-hot label is only
It is that, although having certain indicative function, more speakers can not be provided for distinguishing different speakers label
Identity information, therefore promoted and be not obvious in the individual character similarity of converting speech.Herein by one speaker characteristic of addition
Vector I, Lai Fengfu speaker's identity characteristic information promotes the individual character similarity of voice after conversion.
General objective function Γ (θ, φ using beta-VAE network in equation 1-5;x(i), β), carry out Optimized Coding Based device ginseng
NumberWith decoder parameter θ.The mode of sampling is generallyd use for above formula to estimate it is expected item, it may be assumed that
Wherein, L represents the sample number of every frame sampling, generally use Reparameterization skill, by generate standard normal with
Machine variable is sampled from the distribution of z, is applied to data-driven qualitative function really:
ε~N (0, Ι)
Wherein, ° it indicates by element product,WithIt is the nonlinear function being made of feedforward neural network,It is the parameter sets of encoder.For generating the mean value of hidden variable z,For generating the side of hidden variable
Difference.It can be rewritten as by Reparameterization equation 1-6:
1 is set by L to simplify above formula, to obtain final objective function:
Wherein, beta-VAE model hypothesis z is distributed as isotropic standardized normal distribution, therefore the loss of hidden variable
(KL divergence) can be rewritten into:
Wherein, D is the dimension of hidden variable z, and μ(i)WithRespectively represent the mean vector and variance of general normal distribution
I-th of vector of vector.
Assuming that the visible variable X of feature (logarithmic spectrum), which is obeyed, has diagonal line variance matrix (diagonal variance
Matrix Gaussian Profile), it may be assumed that
Wherein,WithIt is the nonlinear function being made of feedforward neural network, θ={ θ1,θ2Be decoder parameter
Set.Therefore the log probability item in equation 1-8 can be rewritten into:
Wherein, D is the dimension of hidden variable z.
Final goal function can be obtained by the way that formula 1-9 and 1-10 are substituted into 1-8, the process phase of training beta-VAE
When in the parameter for iteratively finding maximization variation lower bound:
Generally above formula is optimized using stochastic gradient descent, selects the number of iterations for 20000 times in this experiment.
After model training is good, frequency spectrum conversion only needs the y of specified target speakertAnd It, encoder becomes input spectrum frame
For hidden variable z, then decoder is by (z, yt,It) it is reconstructed into Xt。
1.6 so far beta-VAE and the trained completions of i-vector model.
2, speaker's speech synthesis stage
2.1 utilize WORLD speech analysis/synthetic model extraction source speaker's speech characteristic parameter, including spectrum envelope
Sp ', voice logarithm fundamental frequency logf0, aperiodic feature ap, the spectrum signature expression of each frame finally extracted are as follows: XS=[sp];
2.2 by source speaker's spectrum signature Xs, target speaker's label ytAnd target speaker's identity feature vector Ιt,
Input trained transformation model.Wherein target speaker label ytAs beta-VAE and i-vector frequency spectrum switching network solution
The control condition of code process, and then target speaker's voice spectrum parameters X after being convertedt;
The 2.3 source speaker logarithm fundamental frequency log f that will be extracted in 2.10Target is obtained using log-domain linear transformation to speak
The fundamental frequency of people.
Log-domain linear transformation is a kind of simple, while being also currently used widest fundamental frequency conversion method.It is this
Method is based on a hypothesis, that is, the fundamental frequency of each speaker obeys a Gaussian Profile in log-domain.So, as long as statistics
The mean value and variance of the logarithm fundamental frequency of each speaker out just can construct the fundamental frequency transformational relation between two speakers:
Wherein, the fundamental frequency of source speaker is respectively μ in the mean value and variance of log-domainsAnd σs, the fundamental frequency of target speaker exists
The mean value and variance of log-domain are respectively μtAnd σt,For the fundamental frequency of source speaker,For target speaker's base after conversion
Frequently.
2.4 finally by target speaker's spectrum signature XtAnd the target speaker after aperiodic feature ap and conversion
Fundamental frequencyPass through target speaker's voice after speech synthesis tool WORLD synthesis conversion.
3 parameters are chosen
3.1 selection for parameter beta and C specific value, embodiment use and objectively evaluate standard mel cepstrum distortion distance
(Mel-Cepstral Distortion, MCD) and subjective assessment standard Mean Opinion Score (Mean Opinion Score, MOS)
The mode combined is evaluated and tested.MCD be mel cepstrum coefficients (Mel Frequency Cepstrum Coefficient,
MFCC the computation model of the objective distortion factor) is indicated as the characteristic parameter of voice signal, MOS scoring is to measure voice quality
The key index of (clarity and naturalness).Between speaker's voice after conversion used herein and target speaker's voice
MCD value compare the conversion performance of not homologous ray as standard is objectively evaluated.Wherein MCD value is calculate by the following formula to obtain:
Wherein cdWithThe d Jan Vermeer cepstrum coefficient of voice, N are represented after respectively representing target speaker and converting
The dimension of mel cepstrum.MCD value is smaller, and the distortion between voice and target voice after illustrating conversion is smaller, that is, after converting
Speaker's personal characteristics parameter is more similar to target speaker's personal characteristics parameter, and transformation model performance is better.MCD is a kind of visitor
The relatively high objective quality evaluation method of sight degree, and be widely used in practice.
The built-up pattern of 3.2 pairs of different betas and C value is trained, and calculates MCD average value under SF3-TM1 transition cases,
It is as shown in table 1:
1 different parameters built-up pattern of table MCD average value under SF3-TM1 transition cases
Show that MCD value is gradually decreased with β value increase by data in table 1, but reduction amplitude is little, with the increase of C value
MCD value first reduces to be gradually increased afterwards.This experiment comprehensively considers subjective evaluation standard, and model parameter is β=150, when C=20,
Voice quality and individual character similitude are optimal compared to other combinations, and human auditory system effect is best.
Claims (6)
1. a kind of multi-to-multi phonetics transfer method based on beta-VAE and i-vector, it is characterised in that: including the training stage
With conversion the stage, the training stage the following steps are included:
(1.1) non-parallel training corpus is obtained, includes source speaker and target speaker;
(1.2) training corpus is extracted into each speaker's sentence frequency spectrum packet by WORLD speech analysis/synthetic model
Network feature X, aperiodic feature, logarithm fundamental frequency log f0;
(1.3) the identity characteristic vector I of each speaker is extracted;
(1.4) spectrum envelope feature X, speaker's label y, identity characteristic vector I input are made of encoder and decoder
Beta-VAE network is trained, and obtains trained beta-VAE network;
(1.5) the fundamental frequency transfer function of speech pitch of the building from the speech pitch of source speaker to target speaker;
The conversion stage the following steps are included:
(2.1) source speaker's voice is extracted to the spectrum envelope feature of every sentence by WORLD speech analysis/synthetic model
Xs, aperiodic feature, logarithm fundamental frequency
(2.2) by the spectrum signature X of source speakers, target speaker label yt, target speaker's identity feature vector It, defeated
Enter in the trained beta-VAE network of step (1.4), exports the spectrum signature X of target speakert;
(2.3) fundamental frequency transfer function is obtained by step (1.5), the source speaker's logarithm fundamental frequency that will be extracted in step (2.1)
FrequentlyBe converted to the logarithm fundamental frequency frequency of target speaker
(2.4) by spectrum signature X obtained in aperiodic feature obtained in step (2.1), step (2.2)tAnd in (2.3)
Obtained target speaker's logarithm fundamental frequencyWORLD speech analysis/synthetic model is inputted, the target after being converted is said
Talk about human speech sound.
2. the multi-to-multi phonetics transfer method according to claim 1 based on beta-VAE and i-vector, feature exist
The step of input and training in step (1.4) are as follows:
(1) by the encoder of X input beta-VAE network, encoder output semantic feature z;
(2) by z, y and I, the decoder of beta-VAE network is inputted, minimizes X and Xt′Distance D (X, Xt‘), Xt′For decoding
The spectrum envelope feature that device generates;
(3) it repeats the above steps until the number of iterations;
(4) the MCD value for calculating beta-VAE network, combines according to the smallest MCD value and subjective assessment standard Mean Opinion Score
Preference pattern parameter beta and C.
3. the multi-to-multi phonetics transfer method according to claim 2 based on beta-VAE and i-vector, feature exist
In: D (X, Xt′) measured using KL divergence, the KL divergence isWherein, D z
Dimension, μ(i)WithThe mean vector of the general normal distribution of respectively X and i-th of vector of variance vectors.
4. the multi-to-multi phonetics transfer method according to claim 1 based on beta-VAE and i-vector, feature exist
Input process in step (2.2) are as follows: by source speaker's spectrum signature Xs, the encoder of beta-VAE network is inputted, will be compiled
The output of code device, ytWith ItThe decoder of beta-VAE network is inputted, conversion obtains target speaker spectrum signature Xt。
5. the multi-to-multi phonetics transfer method according to claim 1 based on beta-VAE and i-vector, feature exist
In: the encoder uses two-dimensional convolution neural network, including 5 convolutional layers and 1 full articulamentum, the mistake of 5 convolutional layers
Filter size is 7*1, and step-length is 3, and filter depth is respectively 16,32,64,128,256;The decoder uses two
Convolutional neural networks, including 4 convolutional layers are tieed up, the filter size of 4 convolutional layers is respectively 9*1,7*1,7*1,1025*1, step
Long is respectively 3,3,3,1, and filter depth is respectively 32,16,8,1.
6. the multi-to-multi phonetics transfer method according to claim 1 based on beta-VAE and i-vector, feature exist
In: the fundamental frequency transfer function are as follows:
Wherein,For the fundamental frequency of source speaker,For the fundamental frequency of target speaker after conversion, source speaker's fundamental frequency is in log-domain
Mean value and variance be respectively μsAnd σs, target speaker fundamental frequency is respectively μ in the mean value and variance of log-domaintAnd σt。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910323677.1A CN110085254A (en) | 2019-04-22 | 2019-04-22 | Multi-to-multi phonetics transfer method based on beta-VAE and i-vector |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910323677.1A CN110085254A (en) | 2019-04-22 | 2019-04-22 | Multi-to-multi phonetics transfer method based on beta-VAE and i-vector |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110085254A true CN110085254A (en) | 2019-08-02 |
Family
ID=67416095
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910323677.1A Pending CN110085254A (en) | 2019-04-22 | 2019-04-22 | Multi-to-multi phonetics transfer method based on beta-VAE and i-vector |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110085254A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110600047A (en) * | 2019-09-17 | 2019-12-20 | 南京邮电大学 | Perceptual STARGAN-based many-to-many speaker conversion method |
CN110853616A (en) * | 2019-10-22 | 2020-02-28 | 武汉水象电子科技有限公司 | Speech synthesis method, system and storage medium based on neural network |
CN111247585A (en) * | 2019-12-27 | 2020-06-05 | 深圳市优必选科技股份有限公司 | Voice conversion method, device, equipment and storage medium |
CN113077810A (en) * | 2021-03-19 | 2021-07-06 | 杨予诺 | Sound source separation method based on beta-VAE algorithm |
CN114420142A (en) * | 2022-03-28 | 2022-04-29 | 北京沃丰时代数据科技有限公司 | Voice conversion method, device, equipment and storage medium |
CN115050087A (en) * | 2022-08-16 | 2022-09-13 | 之江实验室 | Method and device for decoupling identity and expression of key points of human face |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100198600A1 (en) * | 2005-12-02 | 2010-08-05 | Tsuyoshi Masuda | Voice Conversion System |
CN104217721A (en) * | 2014-08-14 | 2014-12-17 | 东南大学 | Speech conversion method based on asymmetric speech database conditions of speaker model alignment |
CN108461079A (en) * | 2018-02-02 | 2018-08-28 | 福州大学 | A kind of song synthetic method towards tone color conversion |
WO2018159612A1 (en) * | 2017-02-28 | 2018-09-07 | 国立大学法人電気通信大学 | Voice quality conversion device, voice quality conversion method and program |
CN108777140A (en) * | 2018-04-27 | 2018-11-09 | 南京邮电大学 | Phonetics transfer method based on VAE under a kind of training of non-parallel corpus |
CN109326283A (en) * | 2018-11-23 | 2019-02-12 | 南京邮电大学 | Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder |
CN109377978A (en) * | 2018-11-12 | 2019-02-22 | 南京邮电大学 | Multi-to-multi voice conversion method under non-parallel text condition based on i vector |
CN109584893A (en) * | 2018-12-26 | 2019-04-05 | 南京邮电大学 | Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition |
-
2019
- 2019-04-22 CN CN201910323677.1A patent/CN110085254A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100198600A1 (en) * | 2005-12-02 | 2010-08-05 | Tsuyoshi Masuda | Voice Conversion System |
CN104217721A (en) * | 2014-08-14 | 2014-12-17 | 东南大学 | Speech conversion method based on asymmetric speech database conditions of speaker model alignment |
WO2018159612A1 (en) * | 2017-02-28 | 2018-09-07 | 国立大学法人電気通信大学 | Voice quality conversion device, voice quality conversion method and program |
CN108461079A (en) * | 2018-02-02 | 2018-08-28 | 福州大学 | A kind of song synthetic method towards tone color conversion |
CN108777140A (en) * | 2018-04-27 | 2018-11-09 | 南京邮电大学 | Phonetics transfer method based on VAE under a kind of training of non-parallel corpus |
CN109377978A (en) * | 2018-11-12 | 2019-02-22 | 南京邮电大学 | Multi-to-multi voice conversion method under non-parallel text condition based on i vector |
CN109326283A (en) * | 2018-11-23 | 2019-02-12 | 南京邮电大学 | Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder |
CN109584893A (en) * | 2018-12-26 | 2019-04-05 | 南京邮电大学 | Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition |
Non-Patent Citations (3)
Title |
---|
IRINA HIGGINS .ETC: "β-VAE: LEARNING BASIC VISUAL CONCEPTS WITH A CONSTRAINED VARIATIONAL FRAMEWORK", 《ICLR》 * |
凌云志: "非平行文本下基于变分自编码模型和瓶颈特征的高质量语音转换研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
黄国捷 等: "增强变分自编码器做非平行语料语音转换", 《信号处理》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110600047A (en) * | 2019-09-17 | 2019-12-20 | 南京邮电大学 | Perceptual STARGAN-based many-to-many speaker conversion method |
CN110853616A (en) * | 2019-10-22 | 2020-02-28 | 武汉水象电子科技有限公司 | Speech synthesis method, system and storage medium based on neural network |
CN111247585A (en) * | 2019-12-27 | 2020-06-05 | 深圳市优必选科技股份有限公司 | Voice conversion method, device, equipment and storage medium |
WO2021128256A1 (en) * | 2019-12-27 | 2021-07-01 | 深圳市优必选科技股份有限公司 | Voice conversion method, apparatus and device, and storage medium |
CN111247585B (en) * | 2019-12-27 | 2024-03-29 | 深圳市优必选科技股份有限公司 | Voice conversion method, device, equipment and storage medium |
CN113077810A (en) * | 2021-03-19 | 2021-07-06 | 杨予诺 | Sound source separation method based on beta-VAE algorithm |
CN114420142A (en) * | 2022-03-28 | 2022-04-29 | 北京沃丰时代数据科技有限公司 | Voice conversion method, device, equipment and storage medium |
CN115050087A (en) * | 2022-08-16 | 2022-09-13 | 之江实验室 | Method and device for decoupling identity and expression of key points of human face |
CN115050087B (en) * | 2022-08-16 | 2022-11-18 | 之江实验室 | Method and device for decoupling identity and expression of key points of human face |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110085254A (en) | Multi-to-multi phonetics transfer method based on beta-VAE and i-vector | |
CN110600047B (en) | Perceptual STARGAN-based multi-to-multi speaker conversion method | |
CN109671442B (en) | Many-to-many speaker conversion method based on STARGAN and x vectors | |
CN101064104B (en) | Emotion voice creating method based on voice conversion | |
CN108777140A (en) | Phonetics transfer method based on VAE under a kind of training of non-parallel corpus | |
CN110060701B (en) | Many-to-many voice conversion method based on VAWGAN-AC | |
CN108305616A (en) | A kind of audio scene recognition method and device based on long feature extraction in short-term | |
CN110060690A (en) | Multi-to-multi voice conversion method based on STARGAN and ResNet | |
CN109599091B (en) | Star-WAN-GP and x-vector based many-to-many speaker conversion method | |
CN110047501A (en) | Multi-to-multi phonetics transfer method based on beta-VAE | |
CN108461079A (en) | A kind of song synthetic method towards tone color conversion | |
CN105096955B (en) | A kind of speaker's method for quickly identifying and system based on model growth cluster | |
CN103544963A (en) | Voice emotion recognition method based on core semi-supervised discrimination and analysis | |
CN105023570B (en) | A kind of method and system for realizing sound conversion | |
CN102568476B (en) | Voice conversion method based on self-organizing feature map network cluster and radial basis network | |
CN104123933A (en) | Self-adaptive non-parallel training based voice conversion method | |
CN103456302B (en) | A kind of emotional speaker recognition method based on the synthesis of emotion GMM Model Weight | |
CN109584893A (en) | Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition | |
CN106997765B (en) | Quantitative characterization method for human voice timbre | |
CN110246488A (en) | Half optimizes the phonetics transfer method and device of CycleGAN model | |
Rammo et al. | Detecting the speaker language using CNN deep learning algorithm | |
CN110060691A (en) | Multi-to-multi phonetics transfer method based on i vector sum VARSGAN | |
CN104240706A (en) | Speaker recognition method based on GMM Token matching similarity correction scores | |
CN103413548B (en) | A kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine | |
CN110189766A (en) | A kind of voice style transfer method neural network based |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190802 |
|
RJ01 | Rejection of invention patent application after publication |