CN110085254A - Multi-to-multi phonetics transfer method based on beta-VAE and i-vector - Google Patents

Multi-to-multi phonetics transfer method based on beta-VAE and i-vector Download PDF

Info

Publication number
CN110085254A
CN110085254A CN201910323677.1A CN201910323677A CN110085254A CN 110085254 A CN110085254 A CN 110085254A CN 201910323677 A CN201910323677 A CN 201910323677A CN 110085254 A CN110085254 A CN 110085254A
Authority
CN
China
Prior art keywords
speaker
vae
beta
vector
fundamental frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910323677.1A
Other languages
Chinese (zh)
Inventor
李燕萍
张成飞
许吉良
张燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201910323677.1A priority Critical patent/CN110085254A/en
Publication of CN110085254A publication Critical patent/CN110085254A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Abstract

The invention discloses a kind of multi-to-multi phonetics transfer methods based on beta-VAE and i-vector, variation autocoder (VAE) frame is modified by introducing customized parameter β and C, and i-vector (identity characteristic vector) is got up with improved VAE network integration, it improves hidden variable solution and tangles ability, improve it in the deficiency of bottleneck layer code capacity, and sufficiently enrich the individualized feature of speaker, the individual character similarity and voice quality that voice after converting can preferably be promoted, effectively improve the voice conversion performance of existing VAE network.

Description

Multi-to-multi phonetics transfer method based on beta-VAE and i-vector
Technical field
The present invention relates to a kind of multi-to-multi phonetics transfer methods, are based on beta-VAE and i-vector more particularly to one kind Multi-to-multi phonetics transfer method.
Background technique
Voice Conversion Techniques pass through years of researches, many classical conversion methods have been emerged in large numbers, including Gauss Mixed model (Gaussian Mixed Model, GMM), frequency bending, deep neural network (Deep Neural Network, ) and the method based on Unit selection etc. DNN.But these phonetics transfer methods need to be instructed using Parallel Corpus mostly Practice, i.e., source speaker and target speaker need to issue voice content, the identical sentence of voice duration, and pronounce rhythm and feelings Thread etc. is consistent as far as possible.However in the practical application of voice conversion, obtaining a large amount of parallel corpora is very to be not easy, or even can not expire Foot, the accuracy of speech characteristic parameter alignment also becomes a kind of restriction of speech conversion system performance when furthermore training.No matter from The versatility of speech conversion system or practicability consider that the research of phonetics transfer method all has under non-parallel text condition Great practical significance and application value.
Phonetics transfer method based on variation autocoder (Variational Autoencoder, VAE) model is straight The system for establishing voice conversion using the identity label (one-hot) of speaker is connect, this speech conversion system is in model training When do not need to be aligned the speech frame of source speaker and target speaker, the dependence to parallel text can be released, For nonparallel voice transformation model.In traditional non-parallel voice conversion based on VAE, encoder is joined from input voice The hidden variable of the representative semantic content unrelated with speaker is extracted in number, then decoder reconstruction parameter from hidden variable.However Due to the excessive regularization effect in the hidden variable of VAE, this makes hidden variable to the characterization scarce capacity of voice data, and very Difficulty is extended to increasingly complex voice data, therefore this non-parallel corpus converting speech based on original VAE is of poor quality, noise Many deficiencies such as more.Simultaneously as the label for the only speaker that one-hot is indicated, although having indicative function, nothing Method provides more speaker's identity information, and how technical staff can give full expression to the personalized spy of each speaker if needing to study The i-vector (identity characteristic vector) of sign gets up with VAE models coupling.
Summary of the invention
Goal of the invention: the technical problem to be solved in the present invention is to provide a kind of based on the multipair of beta-VAE and i-vector More voice conversion method, solving in existing VAE network indicates that speaker's individual information leads to not abundant table with one-hot Up to the defect of speaker's individualized feature, i-vector (identity characteristic vector) and VAE models coupling are got up abundant speaker Individual information, can preferably promote the individual character similarity and voice quality of voice after conversion, the effective VAE network that improves Voice conversion performance.
Technical solution: the multi-to-multi phonetics transfer method of the present invention based on beta-VAE and i-vector, including Training stage and conversion the stage, the training stage the following steps are included:
(1.1) non-parallel training corpus is obtained, includes source speaker and target speaker;
(1.2) training corpus is extracted into each speaker's sentence frequency by WORLD speech analysis/synthetic model Spectrum envelope feature X, aperiodic feature, logarithm fundamental frequency log f0
(1.3) the identity characteristic vector I of each speaker is extracted;
(1.4) by spectrum envelope feature X, speaker's label y, identity characteristic vector I input by encoder and decoder group At beta-VAE network be trained, obtain trained beta-VAE network;
(1.5) the fundamental frequency transfer function of speech pitch of the building from the speech pitch of source speaker to target speaker;
The conversion stage the following steps are included:
(2.1) source speaker's voice is extracted to the spectrum envelope of every sentence by WORLD speech analysis/synthetic model Feature Xs, aperiodic feature, logarithm fundamental frequency
(2.2) by the spectrum signature X of source speakers, target speaker label yt, target speaker's identity feature vector It, in the trained beta-VAE network of input step (1.4), export the spectrum signature X of target speakert
(2.3) fundamental frequency transfer function is obtained by step (1.5), the source speaker's logarithm that will be extracted in step (2.1) Fundamental frequency frequencyBe converted to the logarithm fundamental frequency frequency of target speaker
(2.4) by spectrum signature X obtained in aperiodic feature obtained in step (2.1), step (2.2)tAnd (2.3) target speaker logarithm fundamental frequency obtained inWORLD speech analysis/synthetic model is inputted, after being converted Target speaker's voice.
Further, input and the step trained in step (1.4) are as follows:
(1) by the encoder of X input beta-VAE network, encoder output semantic feature z;
(2) by z, y and I, the decoder of beta-VAE network is inputted, minimizes X and Xt’Distance D (X, Xt‘), Xt’For The spectrum envelope feature that decoder generates;
(3) it repeats the above steps until the number of iterations;
(4) the MCD value for calculating beta-VAE network, according to the smallest MCD value and subjective assessment standard Mean Opinion Score phase In conjunction with preference pattern parameter beta and C.
Further, D (X, Xt‘) measured using KL divergence, the KL divergence is Wherein, D is the dimension of z, μ(i)WithI-th of the mean vector of the general normal distribution of respectively X and variance vectors to Amount.
Further, the input process in step (2.2) are as follows: by source speaker's spectrum signature Xs, input beta-VAE net The encoder of network, by the output of encoder, ytWith ItThe decoder of beta-VAE network is inputted, conversion obtains target speaker frequency Spectrum signature Xt
Further, the encoder uses two-dimensional convolution neural network, including 5 convolutional layers and 1 full articulamentum, The filter size of 5 convolutional layers is 7*1, and step-length is 3, and filter depth is respectively 16,32,64,128,256;It is described Decoder use two-dimensional convolution neural network, including 4 convolutional layers, the filter size of 4 convolutional layers is respectively 9*1,7* 1,7*1,1025*1, step-length are respectively 3,3,3,1, and filter depth is respectively 32,16,8,1.
Further, the fundamental frequency transfer function are as follows:
Wherein,For the fundamental frequency of source speaker,For the fundamental frequency of target speaker after conversion, source speaker's fundamental frequency is right The mean value and variance of number field are respectively μsAnd σs, target speaker fundamental frequency is respectively μ in the mean value and variance of log-domaintAnd σt
The utility model has the advantages that this method improves existing VAE model, and i-vector is applied to improved In VAE model, the voice quality after converting can not only be preferably promoted, the personalization of each speaker also can be fully expressed Feature enriches speaker's identity information.
Detailed description of the invention
Fig. 1 is the overall flow figure of this method.
Specific embodiment
As shown in Figure 1, the present embodiment provides a kind of multi-to-multi phonetics transfer method based on beta-VAE and i-vector, It is divided into two steps of training and conversion:
1, speaker's voice training stage
1.1 obtain non-parallel training corpus, and for the sound bank that this place uses for VCC2018, which includes that 8 sources are spoken People (SF1, SF2, SM1, SM2, SF3, SF4, SM3, SM4) and 4 target speakers (TF1, TF2, TM1, TM2).Choosing herein The non-parallel training corpus taken is 4 source speakers: SF3, SF4, SM3, SM4 and 4 target speaker TF1, TF2, TM1, TM2.Wherein, S (source) represents source speaker, T (target) represents target speaker, F (female) represents female, M (male) male is represented.Since the target of this paper is the conversion of non-parallel speech, the training corpus chosen also be it is nonparallel, i.e., Source speaker is different with target speaker voice content.Speaker everyone 81 sentences are sufficiently instructed as training corpus Practice, 35 sentences carry out model evaluation as testing material.
1.2 include the spectrum envelope of each frame using the feature that speech analysis/synthetic model WORLD extracts speaker's sentence Sp ', voice logarithm fundamental frequency log f0, harmonic spectrum envelope ap, wherein speech sample frequency fs=16000.Here what is carried out is 1024 points of Fast Fourier Transform (FFT), therefore obtained spectrum envelope feature sp and aperiodic feature ap are 1024/2+1= 513 dimensions.Ap and sp is the two-dimensional matrix of n*513 dimension, and speaker's label y is that each speaker's subset is concentrated in training voice Number, is finally expressed as X=[sp] for the spectrum signature of each frame of extraction.
1.3 extract the identity characteristic vector i-vector of each speaker, are expressed as identity characteristic vector I;
I-vector is in gauss hybrid models-universal background model (Gaussian Mixture Model- Universal Background Model, GMM-UBM) and Multiple Channel Analysis on the basis of the novel low-dimensional fixed length of one kind that proposes Feature vector is modeled to obtain by speaker and channel variable.
A given Duan Yuyin, speaker and the relevant GMM super vector of channel can be given by:
M=m+T ω
Wherein M indicates that the Gaussian mean super vector of speaker, m indicate speaker and channel under universal background model (UBM) Independent Gaussian mean super vector, T are the global disparity space matrixs of low-dimensional, and ω is for global disparity steric factor, a priori It obeys standardized normal distribution Ν (0, Ι), that is, identity characteristic vector i-vector.
Total factor ω is a hidden variable, can be defined by its Posterior distrbutionp, and posteriority, which is also distributed, meets normal state point Cloth can be extracted by Bao Mu-Wei Erqi (Baum-Welch) statistic using universal background model (UBM).It is given to speak Voice sequence { the s of people s and its L frame1,s2,s3……sL, for each Gaussian component c, define herein mixed weight-value, The corresponding following formula of Baum-Welch statistic of mean value vector:
Wherein, c=1 ... ..., C are Gauss subscripts, and P (c | st) correspond to generation vector stBlending constituent c Posterior distrbutionp, In order to estimate to obtain i-vector, it is also necessary to calculate based on UBM be averaged blending constituent center first item Baum-Welch count Amount:
Wherein, mcIt is the mean value of UBM blending constituent c.The ω factor of given speaker s can be obtained by following formula:
Wherein, N (s) is defined as the diagonal matrix of CF × CF dimension, diagonal blocks NcΙ (c=1 ..., C).For CF × The super vector of 1 dimension, by the Baum-Welch statistic for splicing all given speaker sFirst item obtain.Σ is The diagonal covariance matrix of one CF × CF dimension.
I-vector includes speaker information and channel information.Pass through Linear Discriminant Analysis (LinearDiscriminant Analysis, LDA) and class covariance normalization (WithinClass CovarianceNormalization, WCCN) remove Channel information.In concrete operations, i-vector can extract to obtain by Kaldi frame, and i-vector is one in the present embodiment The characteristic parameter of a 100 dimension.
1.4 training for beta-VAE network, the coding side that the spectrum signature X in 1.2 is input to VAE model carry out mould Type training, and semantic feature z, the speaker label y that the speaker of coding output is unrelated, and represent speaker's identity feature The identity characteristic vector Ι of vector i-vector, composition joint vector (z, y, Ι) input VAE solution to model code end, complete to voice The training of transformation model.In VAE network training process, VAE models encoder uses two-dimensional convolution neural network, packet in Fig. 1 Include 5 convolutional layers and 1 full articulamentum.The filter size of 5 convolutional layers is 7*1, and step-length is 3, filter depth point It Wei 16,32,64,128,256.Decoder uses two-dimensional convolution neural network, including 4 convolutional layers.The filtering of 4 convolutional layers Device size is respectively 9*1,7*1,7*1,1025*1, and step-length is respectively 3,3,3,1, and filter depth is respectively 32,16,8,1.
1.5 original VAE model discrimination modelsRemove the posterior probability p of approaching to realityθ(z | X), and measure two The similarity degree of distribution uses KL divergence, as shown in formula 1-1:
Wherein,Indicate discrimination modelWith true posterior model pθ(z | X) between KL divergence.
Formula 1-1 is made into the transformation of Bayesian formula and arrangement can obtain formula 1-2:
The log probability of each frame can be rewritten as equation 1-3 in VAE frame:
Wherein,For variation posteriority, pθ(z | X) it is true posteriority, DKL(| |) dissipates come the KL for calculating between the two Degree, Γ (θ, φ;x(i)) be marginal probability variation lower bound.
Can further it be write as 1-4:
Above-mentioned equation 1-4 is the purpose function of original VAE network.
Beta-VAE described herein and i-vector network model is modified on original VAE frame, it will be adjustable While parameter beta and C introduce original VAE basic framework, the i- comprising more abundant speaker's individual information has also been introduced Vector is indicated herein with Ι.
In equation 1-5, the right first itemThe KL divergence of expression is hidden layer loss, the right the BinomialIndicate generational loss.The degree of applying pressure during model learning can be changed by changing β, Ability is tangled to obtain different hidden variable solutions.As β=1, original VAE model is indicated;As β > 1, it is to potential Bottleneck application have stronger constraint, to obtain the ability that better solution tangles data.The definition that so-called solution is tangled is single Latent variable is sensitive to the single factor variations that generate, and the ability of the variation relative insensitivity to other factors.The change that solution is tangled Measuring a benefit usually coming at is that variable has good interpretation and an easy generalization for various tasks, but exactly by The ability of VAE model bottleneck layer feature efficient coding is limited in the promotion of this solution Entanglement, so that reconstruct data distortion.
Therefore, it while the value of β being set greater than 1 herein, also to be compiled by increasing parameter C to improve bottleneck layer The capacity of code.I.e. while the solution for obtaining hidden variable z tangles ability, also obtains hidden variable z and voice data is preferably characterized Ability, so that pθ(x(i)| z) it is more nearly pθ(x(i)), improve system performance.
Meanwhile speaker's label y in original VAE model is only an one-hot label, and one-hot label is only It is that, although having certain indicative function, more speakers can not be provided for distinguishing different speakers label Identity information, therefore promoted and be not obvious in the individual character similarity of converting speech.Herein by one speaker characteristic of addition Vector I, Lai Fengfu speaker's identity characteristic information promotes the individual character similarity of voice after conversion.
General objective function Γ (θ, φ using beta-VAE network in equation 1-5;x(i), β), carry out Optimized Coding Based device ginseng NumberWith decoder parameter θ.The mode of sampling is generallyd use for above formula to estimate it is expected item, it may be assumed that
Wherein, L represents the sample number of every frame sampling, generally use Reparameterization skill, by generate standard normal with Machine variable is sampled from the distribution of z, is applied to data-driven qualitative function really:
ε~N (0, Ι)
Wherein, ° it indicates by element product,WithIt is the nonlinear function being made of feedforward neural network,It is the parameter sets of encoder.For generating the mean value of hidden variable z,For generating the side of hidden variable Difference.It can be rewritten as by Reparameterization equation 1-6:
1 is set by L to simplify above formula, to obtain final objective function:
Wherein, beta-VAE model hypothesis z is distributed as isotropic standardized normal distribution, therefore the loss of hidden variable (KL divergence) can be rewritten into:
Wherein, D is the dimension of hidden variable z, and μ(i)WithRespectively represent the mean vector and variance of general normal distribution I-th of vector of vector.
Assuming that the visible variable X of feature (logarithmic spectrum), which is obeyed, has diagonal line variance matrix (diagonal variance Matrix Gaussian Profile), it may be assumed that
Wherein,WithIt is the nonlinear function being made of feedforward neural network, θ={ θ12Be decoder parameter Set.Therefore the log probability item in equation 1-8 can be rewritten into:
Wherein, D is the dimension of hidden variable z.
Final goal function can be obtained by the way that formula 1-9 and 1-10 are substituted into 1-8, the process phase of training beta-VAE When in the parameter for iteratively finding maximization variation lower bound:
Generally above formula is optimized using stochastic gradient descent, selects the number of iterations for 20000 times in this experiment.
After model training is good, frequency spectrum conversion only needs the y of specified target speakertAnd It, encoder becomes input spectrum frame For hidden variable z, then decoder is by (z, yt,It) it is reconstructed into Xt
1.6 so far beta-VAE and the trained completions of i-vector model.
2, speaker's speech synthesis stage
2.1 utilize WORLD speech analysis/synthetic model extraction source speaker's speech characteristic parameter, including spectrum envelope Sp ', voice logarithm fundamental frequency logf0, aperiodic feature ap, the spectrum signature expression of each frame finally extracted are as follows: XS=[sp];
2.2 by source speaker's spectrum signature Xs, target speaker's label ytAnd target speaker's identity feature vector Ιt, Input trained transformation model.Wherein target speaker label ytAs beta-VAE and i-vector frequency spectrum switching network solution The control condition of code process, and then target speaker's voice spectrum parameters X after being convertedt
The 2.3 source speaker logarithm fundamental frequency log f that will be extracted in 2.10Target is obtained using log-domain linear transformation to speak The fundamental frequency of people.
Log-domain linear transformation is a kind of simple, while being also currently used widest fundamental frequency conversion method.It is this Method is based on a hypothesis, that is, the fundamental frequency of each speaker obeys a Gaussian Profile in log-domain.So, as long as statistics The mean value and variance of the logarithm fundamental frequency of each speaker out just can construct the fundamental frequency transformational relation between two speakers:
Wherein, the fundamental frequency of source speaker is respectively μ in the mean value and variance of log-domainsAnd σs, the fundamental frequency of target speaker exists The mean value and variance of log-domain are respectively μtAnd σt,For the fundamental frequency of source speaker,For target speaker's base after conversion Frequently.
2.4 finally by target speaker's spectrum signature XtAnd the target speaker after aperiodic feature ap and conversion Fundamental frequencyPass through target speaker's voice after speech synthesis tool WORLD synthesis conversion.
3 parameters are chosen
3.1 selection for parameter beta and C specific value, embodiment use and objectively evaluate standard mel cepstrum distortion distance (Mel-Cepstral Distortion, MCD) and subjective assessment standard Mean Opinion Score (Mean Opinion Score, MOS) The mode combined is evaluated and tested.MCD be mel cepstrum coefficients (Mel Frequency Cepstrum Coefficient, MFCC the computation model of the objective distortion factor) is indicated as the characteristic parameter of voice signal, MOS scoring is to measure voice quality The key index of (clarity and naturalness).Between speaker's voice after conversion used herein and target speaker's voice MCD value compare the conversion performance of not homologous ray as standard is objectively evaluated.Wherein MCD value is calculate by the following formula to obtain:
Wherein cdWithThe d Jan Vermeer cepstrum coefficient of voice, N are represented after respectively representing target speaker and converting The dimension of mel cepstrum.MCD value is smaller, and the distortion between voice and target voice after illustrating conversion is smaller, that is, after converting Speaker's personal characteristics parameter is more similar to target speaker's personal characteristics parameter, and transformation model performance is better.MCD is a kind of visitor The relatively high objective quality evaluation method of sight degree, and be widely used in practice.
The built-up pattern of 3.2 pairs of different betas and C value is trained, and calculates MCD average value under SF3-TM1 transition cases, It is as shown in table 1:
1 different parameters built-up pattern of table MCD average value under SF3-TM1 transition cases
Show that MCD value is gradually decreased with β value increase by data in table 1, but reduction amplitude is little, with the increase of C value MCD value first reduces to be gradually increased afterwards.This experiment comprehensively considers subjective evaluation standard, and model parameter is β=150, when C=20, Voice quality and individual character similitude are optimal compared to other combinations, and human auditory system effect is best.

Claims (6)

1. a kind of multi-to-multi phonetics transfer method based on beta-VAE and i-vector, it is characterised in that: including the training stage With conversion the stage, the training stage the following steps are included:
(1.1) non-parallel training corpus is obtained, includes source speaker and target speaker;
(1.2) training corpus is extracted into each speaker's sentence frequency spectrum packet by WORLD speech analysis/synthetic model Network feature X, aperiodic feature, logarithm fundamental frequency log f0
(1.3) the identity characteristic vector I of each speaker is extracted;
(1.4) spectrum envelope feature X, speaker's label y, identity characteristic vector I input are made of encoder and decoder Beta-VAE network is trained, and obtains trained beta-VAE network;
(1.5) the fundamental frequency transfer function of speech pitch of the building from the speech pitch of source speaker to target speaker;
The conversion stage the following steps are included:
(2.1) source speaker's voice is extracted to the spectrum envelope feature of every sentence by WORLD speech analysis/synthetic model Xs, aperiodic feature, logarithm fundamental frequency
(2.2) by the spectrum signature X of source speakers, target speaker label yt, target speaker's identity feature vector It, defeated Enter in the trained beta-VAE network of step (1.4), exports the spectrum signature X of target speakert
(2.3) fundamental frequency transfer function is obtained by step (1.5), the source speaker's logarithm fundamental frequency that will be extracted in step (2.1) FrequentlyBe converted to the logarithm fundamental frequency frequency of target speaker
(2.4) by spectrum signature X obtained in aperiodic feature obtained in step (2.1), step (2.2)tAnd in (2.3) Obtained target speaker's logarithm fundamental frequencyWORLD speech analysis/synthetic model is inputted, the target after being converted is said Talk about human speech sound.
2. the multi-to-multi phonetics transfer method according to claim 1 based on beta-VAE and i-vector, feature exist The step of input and training in step (1.4) are as follows:
(1) by the encoder of X input beta-VAE network, encoder output semantic feature z;
(2) by z, y and I, the decoder of beta-VAE network is inputted, minimizes X and Xt′Distance D (X, Xt‘), Xt′For decoding The spectrum envelope feature that device generates;
(3) it repeats the above steps until the number of iterations;
(4) the MCD value for calculating beta-VAE network, combines according to the smallest MCD value and subjective assessment standard Mean Opinion Score Preference pattern parameter beta and C.
3. the multi-to-multi phonetics transfer method according to claim 2 based on beta-VAE and i-vector, feature exist In: D (X, Xt′) measured using KL divergence, the KL divergence isWherein, D z Dimension, μ(i)WithThe mean vector of the general normal distribution of respectively X and i-th of vector of variance vectors.
4. the multi-to-multi phonetics transfer method according to claim 1 based on beta-VAE and i-vector, feature exist Input process in step (2.2) are as follows: by source speaker's spectrum signature Xs, the encoder of beta-VAE network is inputted, will be compiled The output of code device, ytWith ItThe decoder of beta-VAE network is inputted, conversion obtains target speaker spectrum signature Xt
5. the multi-to-multi phonetics transfer method according to claim 1 based on beta-VAE and i-vector, feature exist In: the encoder uses two-dimensional convolution neural network, including 5 convolutional layers and 1 full articulamentum, the mistake of 5 convolutional layers Filter size is 7*1, and step-length is 3, and filter depth is respectively 16,32,64,128,256;The decoder uses two Convolutional neural networks, including 4 convolutional layers are tieed up, the filter size of 4 convolutional layers is respectively 9*1,7*1,7*1,1025*1, step Long is respectively 3,3,3,1, and filter depth is respectively 32,16,8,1.
6. the multi-to-multi phonetics transfer method according to claim 1 based on beta-VAE and i-vector, feature exist In: the fundamental frequency transfer function are as follows:
Wherein,For the fundamental frequency of source speaker,For the fundamental frequency of target speaker after conversion, source speaker's fundamental frequency is in log-domain Mean value and variance be respectively μsAnd σs, target speaker fundamental frequency is respectively μ in the mean value and variance of log-domaintAnd σt
CN201910323677.1A 2019-04-22 2019-04-22 Multi-to-multi phonetics transfer method based on beta-VAE and i-vector Pending CN110085254A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910323677.1A CN110085254A (en) 2019-04-22 2019-04-22 Multi-to-multi phonetics transfer method based on beta-VAE and i-vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910323677.1A CN110085254A (en) 2019-04-22 2019-04-22 Multi-to-multi phonetics transfer method based on beta-VAE and i-vector

Publications (1)

Publication Number Publication Date
CN110085254A true CN110085254A (en) 2019-08-02

Family

ID=67416095

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910323677.1A Pending CN110085254A (en) 2019-04-22 2019-04-22 Multi-to-multi phonetics transfer method based on beta-VAE and i-vector

Country Status (1)

Country Link
CN (1) CN110085254A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110600047A (en) * 2019-09-17 2019-12-20 南京邮电大学 Perceptual STARGAN-based many-to-many speaker conversion method
CN110853616A (en) * 2019-10-22 2020-02-28 武汉水象电子科技有限公司 Speech synthesis method, system and storage medium based on neural network
CN111247585A (en) * 2019-12-27 2020-06-05 深圳市优必选科技股份有限公司 Voice conversion method, device, equipment and storage medium
CN113077810A (en) * 2021-03-19 2021-07-06 杨予诺 Sound source separation method based on beta-VAE algorithm
CN114420142A (en) * 2022-03-28 2022-04-29 北京沃丰时代数据科技有限公司 Voice conversion method, device, equipment and storage medium
CN115050087A (en) * 2022-08-16 2022-09-13 之江实验室 Method and device for decoupling identity and expression of key points of human face

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100198600A1 (en) * 2005-12-02 2010-08-05 Tsuyoshi Masuda Voice Conversion System
CN104217721A (en) * 2014-08-14 2014-12-17 东南大学 Speech conversion method based on asymmetric speech database conditions of speaker model alignment
CN108461079A (en) * 2018-02-02 2018-08-28 福州大学 A kind of song synthetic method towards tone color conversion
WO2018159612A1 (en) * 2017-02-28 2018-09-07 国立大学法人電気通信大学 Voice quality conversion device, voice quality conversion method and program
CN108777140A (en) * 2018-04-27 2018-11-09 南京邮电大学 Phonetics transfer method based on VAE under a kind of training of non-parallel corpus
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
CN109377978A (en) * 2018-11-12 2019-02-22 南京邮电大学 Multi-to-multi voice conversion method under non-parallel text condition based on i vector
CN109584893A (en) * 2018-12-26 2019-04-05 南京邮电大学 Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100198600A1 (en) * 2005-12-02 2010-08-05 Tsuyoshi Masuda Voice Conversion System
CN104217721A (en) * 2014-08-14 2014-12-17 东南大学 Speech conversion method based on asymmetric speech database conditions of speaker model alignment
WO2018159612A1 (en) * 2017-02-28 2018-09-07 国立大学法人電気通信大学 Voice quality conversion device, voice quality conversion method and program
CN108461079A (en) * 2018-02-02 2018-08-28 福州大学 A kind of song synthetic method towards tone color conversion
CN108777140A (en) * 2018-04-27 2018-11-09 南京邮电大学 Phonetics transfer method based on VAE under a kind of training of non-parallel corpus
CN109377978A (en) * 2018-11-12 2019-02-22 南京邮电大学 Multi-to-multi voice conversion method under non-parallel text condition based on i vector
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
CN109584893A (en) * 2018-12-26 2019-04-05 南京邮电大学 Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
IRINA HIGGINS .ETC: "β-VAE: LEARNING BASIC VISUAL CONCEPTS WITH A CONSTRAINED VARIATIONAL FRAMEWORK", 《ICLR》 *
凌云志: "非平行文本下基于变分自编码模型和瓶颈特征的高质量语音转换研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
黄国捷 等: "增强变分自编码器做非平行语料语音转换", 《信号处理》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110600047A (en) * 2019-09-17 2019-12-20 南京邮电大学 Perceptual STARGAN-based many-to-many speaker conversion method
CN110853616A (en) * 2019-10-22 2020-02-28 武汉水象电子科技有限公司 Speech synthesis method, system and storage medium based on neural network
CN111247585A (en) * 2019-12-27 2020-06-05 深圳市优必选科技股份有限公司 Voice conversion method, device, equipment and storage medium
WO2021128256A1 (en) * 2019-12-27 2021-07-01 深圳市优必选科技股份有限公司 Voice conversion method, apparatus and device, and storage medium
CN111247585B (en) * 2019-12-27 2024-03-29 深圳市优必选科技股份有限公司 Voice conversion method, device, equipment and storage medium
CN113077810A (en) * 2021-03-19 2021-07-06 杨予诺 Sound source separation method based on beta-VAE algorithm
CN114420142A (en) * 2022-03-28 2022-04-29 北京沃丰时代数据科技有限公司 Voice conversion method, device, equipment and storage medium
CN115050087A (en) * 2022-08-16 2022-09-13 之江实验室 Method and device for decoupling identity and expression of key points of human face
CN115050087B (en) * 2022-08-16 2022-11-18 之江实验室 Method and device for decoupling identity and expression of key points of human face

Similar Documents

Publication Publication Date Title
CN110085254A (en) Multi-to-multi phonetics transfer method based on beta-VAE and i-vector
CN110600047B (en) Perceptual STARGAN-based multi-to-multi speaker conversion method
CN109671442B (en) Many-to-many speaker conversion method based on STARGAN and x vectors
CN101064104B (en) Emotion voice creating method based on voice conversion
CN108777140A (en) Phonetics transfer method based on VAE under a kind of training of non-parallel corpus
CN110060701B (en) Many-to-many voice conversion method based on VAWGAN-AC
CN108305616A (en) A kind of audio scene recognition method and device based on long feature extraction in short-term
CN110060690A (en) Multi-to-multi voice conversion method based on STARGAN and ResNet
CN109599091B (en) Star-WAN-GP and x-vector based many-to-many speaker conversion method
CN110047501A (en) Multi-to-multi phonetics transfer method based on beta-VAE
CN108461079A (en) A kind of song synthetic method towards tone color conversion
CN105096955B (en) A kind of speaker's method for quickly identifying and system based on model growth cluster
CN103544963A (en) Voice emotion recognition method based on core semi-supervised discrimination and analysis
CN105023570B (en) A kind of method and system for realizing sound conversion
CN102568476B (en) Voice conversion method based on self-organizing feature map network cluster and radial basis network
CN104123933A (en) Self-adaptive non-parallel training based voice conversion method
CN103456302B (en) A kind of emotional speaker recognition method based on the synthesis of emotion GMM Model Weight
CN109584893A (en) Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition
CN106997765B (en) Quantitative characterization method for human voice timbre
CN110246488A (en) Half optimizes the phonetics transfer method and device of CycleGAN model
Rammo et al. Detecting the speaker language using CNN deep learning algorithm
CN110060691A (en) Multi-to-multi phonetics transfer method based on i vector sum VARSGAN
CN104240706A (en) Speaker recognition method based on GMM Token matching similarity correction scores
CN103413548B (en) A kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine
CN110189766A (en) A kind of voice style transfer method neural network based

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190802

RJ01 Rejection of invention patent application after publication