CN108777140A - Phonetics transfer method based on VAE under a kind of training of non-parallel corpus - Google Patents

Phonetics transfer method based on VAE under a kind of training of non-parallel corpus Download PDF

Info

Publication number
CN108777140A
CN108777140A CN201810393556.XA CN201810393556A CN108777140A CN 108777140 A CN108777140 A CN 108777140A CN 201810393556 A CN201810393556 A CN 201810393556A CN 108777140 A CN108777140 A CN 108777140A
Authority
CN
China
Prior art keywords
feature
frame
vae
bottleneck
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810393556.XA
Other languages
Chinese (zh)
Other versions
CN108777140B (en
Inventor
李燕萍
凌云志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201810393556.XA priority Critical patent/CN108777140B/en
Publication of CN108777140A publication Critical patent/CN108777140A/en
Application granted granted Critical
Publication of CN108777140B publication Critical patent/CN108777140B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention discloses a kind of under the non-parallel corpus training condition phonetics transfer method based on VAE, under non-parallel text condition, bottleneck characteristic is extracted by deep neural network, that is Bottleneck features, it is then based on the study and modeling of variation own coding model realization transfer function, in the conversion stage, conversion of more speakers to more speakers may be implemented.There are three aspects for the advantage of the present invention:1) dependence to parallel text is released, and training process does not need any alignment operation;2) converting system of multiple sources-target speaker couple can be incorporated into a transformation model, realizes multi-to-multi conversion;3) the multi-to-multi converting system under non-parallel text condition will interact for the technological direction actual speech and provide technical support.

Description

Phonetics transfer method based on VAE under a kind of training of non-parallel corpus
Technical field
The invention belongs to field of voice signal, and in particular to a kind of non-parallel corpus training is lower to be based on variation own coding The phonetics transfer method of model (Variational Autoencoder, VAE).
Background technology
Voice Conversion Techniques are a research branches of Speech processing, it covers Speaker Identification, speech recognition And the content in the fields such as phonetic synthesis, intend the personalized letter for changing voice in the case where the original semantic information of reservation is constant Breath, makes the voice of speaker dependent (i.e. source speaker) sound like the language of another speaker dependent (i.e. target speaker) Sound.The main task of voice conversion includes extracting the characteristic parameter of two speaker dependent's voices and carrying out Mapping and Converting, then Parameter decoding after transformation is reconstructed into transformed voice.The sense of hearing matter of voice after the conversion that ensure in the process Whether personal characteristics is accurate after amount and conversion.Years development is passed through in the research of Voice Conversion Techniques, and voice conversion art has gushed Reveal a variety of different methods, wherein having become the warp in the field using gauss hybrid models as the statistics conversion method of representative Allusion quotation method.But still there are certain defects in this kind of algorithm, such as:The classics of voice conversion are carried out using gauss hybrid models Method is mostly to be based on one-to-one voice convert task, it is desirable that the training sentence content phase that source speaker and target speaker use Together, need to spectrum signature be carried out dynamic time warping (Dynamic Time Warping, DTW) to be aligned frame by frame, mould could be passed through Type training obtains the mapping relations between spectrum signature, such phonetics transfer method underaction in practical applications;Use height This mixed model is global variable and by repetitive exercise data come what is considered when training mapping function, leads to calculation amount abruptly increase, And only when training data is abundant, gauss hybrid models can be only achieved preferable conversion effect, this is not suitable for limited meter Calculate resource and equipment.
In recent years, the research in deep learning field accelerates the training speed of deep neural network and the validity of network, And thering are researchers constantly to propose new model and new learning method, modeling ability is strong, can be from complicated data Learn the feature to more deep layer.
AHOcoder characteristic parameter extraction models are an audio coder & decoder (codec) (speech analysis/synthesis system), by Daniel Erro are obtained in the research and development of the AHOLAB signal processings laboratory of Basque university.AHOcoder is by 16kHz, 16bits Monophonic wav speech decompositions be three parts:Fundamental frequency (F0), spectrum (mel cepstrum coefficients MFCC) and maximum voiced sound frequency. AHOcoder speech analysis, synthetic model can provide an accurate speech analysis and the speech waveform of high quality is rebuild.
Fundamental frequency is the important parameter influenced in phonetic-rhythm characteristic, and the phonetics transfer method designed by the present invention is directed to The conversion of fundamental frequency, using the conversion method of traditional Gaussian normalization.Assuming that the voiced segments pair of source speaker and target speaker Base frequency Gaussian distributed then calculates separately the equal of source speaker and target speaker's voiced segments logarithm fundamental frequency Gaussian Profile Value and variance.Then realize source speaker's voiced segments logarithm fundamental frequency to target speaker's voiced segments logarithm fundamental frequency using following formula Conversion, and voiceless sound Duan Ze is not changed.
The mean value and variance of wherein source speaker voiced segments logarithm fundamental frequency use μ respectivelysrcAnd σsrcIt indicates, target speaker is turbid The mean value and variance of segment logarithm fundamental frequency use μ respectivelytgtAnd σtgtIt indicates, and FOsrcThe fundamental frequency of expression source speaker, FOconvIt indicates Transformed fundamental frequency.
Invention content
To solve the above problems, the present invention proposes the phonetics transfer method based on VAE under a kind of training of non-parallel corpus, pendulum Conversion of more speakers to more speakers is realized in the de- dependence to parallel text, improves flexibility, is solved in resource, equipment The technical issues of voice conversion is difficult to realize under the conditions of limited.
The present invention adopts the following technical scheme that, the phonetics transfer method based on VAE under a kind of non-parallel corpus training, including Training step and voice switch process:
Training step:
1) AHOcoder sound coders is utilized to extract the mel cepstrum feature for the speaker's voice for participating in training respectively Parameter X;
2) each frame mel cepstrum characteristic parameter X of extraction is subjected to difference processing and spliced with former characteristic parameter X, It carries out the characteristic parameter of characteristic parameter and front and back each frame that splicing obtains to be spliced to form union feature parameter again in the time domain xn
3) union feature parameter x is utilizednWith the tag along sort feature y of speakernTo deep neural network (Deep Neural Network, DNN) it is trained, the weights of DNN networks are adjusted to reduce error in classification until network convergence, obtains being based on speaking The DNN networks of people's identification mission extract the bottleneck characteristic of each frame, i.e. Bottleneck feature bsn
4) union feature parameter x is utilizednWith the Bottleneck feature bs of each frame of correspondencenVAE models are trained, until Model training is restrained, and variation own coding model (Variational Autoencoder, VAE) model is extracted, i.e. VAE models are hidden The sampling feature z of each frames of z containing spacen
It 5) will sampling feature znWith the label characteristics y of the speaker of each frame of correspondencenSpliced to obtain Bottleneck features The training data of mapping network (BP networks), and with the Bottleneck feature bs of each framenIt is instructed as supervision message The training of Bottleneck Feature Mapping networks minimizes Bottleneck Feature Mapping networks by stochastic gradient descent algorithm Output error, obtain Bottleneck Feature Mapping networks;
Trained DNN networks, VAE networks and Bottleneck Feature Mapping combination of network constitute based on VAE and The speech conversion system of Bottleneck features;
Voice switch process:
6) by the union feature parameter X of voice to be convertedpBy the encoder modules of VAE models, implicit space is obtained The sampling feature z of each frames of zn
It 7) will sampling feature znWith the label characteristics y of target speakernSplicing input Bottleneck features frame by frame are carried out to reflect Network is penetrated, the Bottleneck features of target speaker are obtained
8) by Bottleneck featuresWith sampling feature znThe decoder Restructuring Modules that splicing passes through VAE models frame by frame Go out the union feature parameter X of transformed voicep';
9) AHOcoder sound coder reconstructed speech signals are utilized.
Preferably, the mel cepstrum for speaker's voice that extraction participates in training in the step 1) is characterized in utilizing AHOcoder sound coders extract the mel cepstrum feature for the speaker's voice for participating in training respectively, and mel cepstrum is special Sign reads in Matlab platforms.
Preferably, union feature parameter is obtained in the step 2) is specially:Each frame characteristic parameter X of extraction is carried out First-order difference and second differnce, and spliced with former feature to obtain characteristic parameter Xt=(X, Δ X, Δ2X), will spell in the time domain The characteristic parameter X connecttIt carries out being spliced to form union feature parameter x again with the characteristic parameter of front and back each framen=(Xt-1,Xt, Xt+1)。
Preferably, to extracting Bottleneck feature bs in the step 3)nInclude the following steps:
31) union feature parameter x is obtained in MATLAB platformsnThe tag along sort feature y of the corresponding speaker of each framen
32) unsupervised pre-training is carried out to DNN networks using successively greedy pre-training method, the activation primitive of hidden layer is adopted With ReLU functions;
33) DNN network output layers are set to softmax classification outputs, by the tag along sort feature y of speakernAs DNN networks carry out the supervision message of Training, and the weights of network are adjusted using stochastic gradient descent algorithm, minimize DNN Network class exports the tag along sort feature y with speakernBetween error until convergence, obtain be based on Speaker Identification task DNN networks;
34) by feedforward arithmetic by union feature parameter xnDNN networks are inputted frame by frame, and it is corresponding to extract each frame Bottleneck layers of activation value, i.e., the Bottleneck feature bs corresponding to the mel cepstrum characteristic parameter of each framen
Preferably, VAE model trainings include the following steps in the step 4):
41) by union feature parameter xnTraining data, Bottleneck feature bs as VAE model encoder modulesnMake For decoder module decoding and reconstitutings when training data VAE models are trained, in the decoder modules of VAE models will Bottleneck feature bsnAs the control information of voice spectrum restructuring procedure, i.e., by Bottleneck feature bsnWith sampling feature znSplicing passes through the training of the decoder modules of VAE models, reconstructed voice spectrum signature frame by frame;
42) the KL divergences and mean square error during utilizing ADAM optimizers to optimize VAE model parameter estimations are to adjust VAE Prototype network weights obtain VAE voice spectrum transformation models;
43) by union feature parameter xnVAE voice spectrum transformation models are inputted frame by frame, and are implied by sampling process Sample feature zn
Preferably, Bottleneck Feature Mapping networks are obtained in the step 5) to include the following steps:
It 51) will sampling feature znWith the tag along sort feature y of the speaker of corresponding each framenCarry out splicing conduct The training data of Bottleneck Feature Mapping networks, Bottleneck Feature Mapping networks using an input layer, one it is hidden The structure of layer and an output layer, hidden layer activation primitive are sigmoid functions, and output layer is linear convergent rate;
52) criterion is minimized according to mean square error, is optimized using the stochastic gradient descent algorithm that backward error is propagated Bottleneck Feature Mapping network weights minimize network and export Bottleneck featuresIt is corresponding with each frame Bottleneck feature bsnBetween error.
Preferably, the union feature parameter X of voice to be converted is obtained in the step 6)pSpecially:It utilizes AHOcoder extracts the mel cepstrum characteristic parameter of voice to be converted, and each frame feature on MATLAB platforms to extracting Parameter carries out first-order difference and second differnce, and is spliced with former feature to obtain characteristic parameter, in the time domain obtains splicing The characteristic parameter of characteristic parameter and front and back each frame carry out being spliced to form union feature parameter again to get to voice to be converted The characteristic parameter X of frequency spectrump
Preferably, reconstructed speech signal is specially in the step 9):The speech characteristic parameter X that will be obtained after conversionp' also Originally it was mel cepstrum characteristic formp, that is, removes time domain joint and Difference Terms, the synthesis of AHOcoder sound coders is recycled to turn Voice after changing.
The reached advantageous effect of invention:The present invention is the voice conversion side based on VAE under a kind of training of non-parallel corpus Method breaks away from the dependence to parallel text, realizes conversion of more speakers to more speakers, improves flexibility, solve resource, The technical issues of voice conversion is difficult to realize under the conditions of equipment is limited.The advantage of the invention is that:
1) utilize VAE models that can believe phoneme unrelated with speaker's individual character in speech spectral characteristics by modeling study Breath is the advantages of hidden layer is separated so that VAE models can carry out voice conversion by nonparallel voice data It practises, has broken away from traditional voice transformation model, needed source and the parallel corpora data of target speaker to be trained, also by language The limitation that audio spectrum signature is aligned greatly improves practicability and the flexibility of speech conversion system, to design across language The speech conversion system of kind is provided convenience;
2) the voice switching network obtained by training VAE models can complete a variety of transition cases, with traditional a pair One speech conversion system is compared, it is only necessary to can be completed a variety of convert tasks by one model of training, be greatly improved The efficiency of voice transformation model training;
3) in the decoder modules of VAE models, Bottleneck feature bs have been usednPersonal characteristics as speaker Transformed speech spectral characteristics are reconstructed, compared to the tag along sort feature y for using speakernAs characterization speaker's individual character letter The system of feature is ceased, the conversion effect and sound quality of finally obtained converting speech are more preferable.
Description of the drawings
Fig. 1 is the systematic training procedural block diagram of the present invention;
Fig. 2 is the system transfer process block diagram of the present invention;
Fig. 3 is the structure chart of the DNN networks based on Speaker Identification task of the present invention;
Fig. 4 is the structure chart of the VAE speech spectral characteristics switching networks of the present invention;
Fig. 5 is the structure chart of the Bottleneck Feature Mapping networks of the present invention;
Fig. 6 is VAE model variation Bayes procedure parameter Estimation schematic diagrams;
Fig. 7 is to be characterized speaker's personal characteristics based on VAE models using different characteristic, turned under different switching situation Change voice MCD value comparison diagrams.
Specific implementation mode
Below according to attached drawing and technical scheme of the present invention is further elaborated in conjunction with the embodiments.
The present invention adopts the following technical scheme that, the phonetics transfer method based on VAE under a kind of non-parallel corpus training passes through AHOcoder sound coders extract the mel cepstrum feature of voice and on MATLAB platforms by itself and first-order difference, second order Differential Characteristics are stitched together, and the characteristic parameter of front and back each frame is then spliced composition union feature parameter xn;By xnAs instruction Practice data to be trained using the DNN networks based on Speaker Identification task, after network training terminates to reach convergence, by xn DNN networks are inputted frame by frame, and obtain the Bottleneck layers output of each frame, this includes as speaker's personal characteristics Bottleneck characteristic parameters bn;By xnAs the training data of VAE model encoder modules, bnAs decoder module solutions Training data when code reconstruct is trained VAE models so that VAE models can be in implicit space z by encoder modules Obtain the phoneme information z with semantic featuren, that is, feature is sampled, by decoder modules by the phoneme information comprising semantic feature znWith the Bottleneck feature bs comprising speaker's personal characteristicsnIt is reconstructed into speech spectral characteristics;To include semantic feature later Phoneme information znWith the tag along sort feature y of speakernThe union feature for splicing composition is trained as the training data of BP networks Target speaker's Bottleneck Feature Mapping networks, expectation network output and the Bottleneck feature bs corresponding to each framenIt Between error it is minimum;In conversion, first the spectrum signature of voice to be converted is extracted by the encoder modules of VAE models The corresponding phoneme information z for including semantic featuren, and by the tag along sort feature y of its frame by frame and target speakernSplicing composition This union feature is inputted BP networks, obtains the Bottleneck features of each frame of target speaker by union featureIt then will packet Phoneme information z containing semantic featurenWith the Bottleneck features of each frame of target speakerThe union feature spliced frame by frame is logical The decoder Restructuring Modules for crossing VAE models are transformed speech spectral characteristics, finally synthesize voice by AHOcoder again; Specifically include training step and voice switch process:
Fig. 1 is the training process block diagram of system of the present invention, training step:
1) AHOcoder sound coders is utilized to extract the mel cepstrum feature for the speaker's voice for participating in training respectively Parameter X;
The mel cepstrum that extraction participates in speaker's voice of training is characterized in distinguishing using AHOcoder sound coders Extraction participates in the mel cepstrum feature of speaker's voice of training, and mel cepstrum feature is read in Matlab platforms;The present invention The middle mel cepstrum feature using 19 dimensions, the voice content of each speaker can be different, need not also carry out DTW pairs Together.
2) each frame mel cepstrum characteristic parameter X of extraction is subjected to difference processing and spliced with former characteristic parameter, It carries out the characteristic parameter of characteristic parameter and front and back each frame that splicing obtains to be spliced to form union feature parameter again in the time domain xn
Each frame characteristic parameter X of extraction is subjected to first-order difference and second differnce, and is spliced to obtain with former feature 57 dimension Differential Characteristics parameter Xt=(X, Δ X, Δ2X), the characteristic parameter X in the time domain obtained splicingtWith front and back each frame Characteristic parameter carries out being spliced to form 171 dimension union feature parameter x againn=(Xt-1,Xt,Xt+1)。
3) union feature parameter x is utilizednWith the tag along sort feature y of speakernDNN networks are trained, DNN is adjusted The weights of network obtain the DNN networks based on Speaker Identification task to reduce error in classification up to network convergence, extract every The Bottleneck feature bs of one framen
The structure of Bottleneck feature extractions DNN networks used in the present invention is as shown in Figure 3, and wherein network is defeated Enter the dimension that node layer number corresponds to the speech spectral characteristics for participating in training, exports the softmax classification outputs for speaker, node Number is depending on participating in the quantity of speaker of training.Extract Bottleneck feature bsnInclude the following steps:
31) union feature parameter x is obtained in MATLAB platformsnThe tag along sort feature y of the corresponding speaker of each framen;This When do not differentiate between source speaker and target speaker first, only to the characteristic parameter of each frame utilize speaker clustering label characteristics ynArea It separates;
32) DNN networks are the neural network of full connecting-type, and using the DNN models of 9 layer networks, input layer number is 171, corresponding xn171 dimensional features per frame, intermediate 7 layers of hidden layer, every layer of number of nodes is respectively 1200,1200,1200,57, 1200,1200,1200, wherein the less hidden layer of number of nodes be Bottleneck layer, utilization successively greediness pre-training method to DNN Connection weight between each node layer of network carries out unsupervised pre-training, and the activation primitive of hidden layer uses biology angle and brain The closer ReLU functions of neuron, i.e.,:
F (x)=max (0, x)
Because ReLU functions have unilateral inhibition, sparse activity and relatively broad excited boundary, it is believed that it has The ability to express of standby more primitive character.
The activation value of+1 layer of hidden layer of kth is:hk+1=f (wkhk+Bk)
Wherein hk+1、hkThe respectively activation value of+1 layer of kth and kth layer hidden layer, wkFor the company between+1 layer of kth and kth layer Meet weights, BkFor the biasing of kth layer.
33) DNN network output layers are set to softmax classification outputs, selects 5 speaker everyone 100 voices Spectrum signature parameter is trained, therefore output layer number of nodes is 5, the label characteristics of corresponding 5 speakers, by point of speaker Class label characteristics ynThe supervision message that Training is carried out as DNN networks adjusts network using stochastic gradient descent algorithm Weights, minimize the output of DNN network class and the tag along sort feature y of speakernBetween error until convergence, obtain base In the DNN networks of Speaker Identification task, i.e. Bottleneck feature extractions network;
34) by feedforward arithmetic by union feature parameter xnDNN networks are inputted frame by frame, and it is corresponding to extract each frame Bottleneck layers of activation value, i.e., the Bottleneck feature bs corresponding to the characteristic parameter of each framen, in of the invention Bottleneck layers are the 4th layer of hidden layer, i.e.,:
bn=f (w3h3+B3)
Wherein, h3For the activation value of the 3rd layer of hidden layer, w3For the connection weight between the 3rd layer and the 4th layer, B3It is the 3rd layer Biasing.
4) union feature is utilized to join xnWith the Bottleneck feature bs of each frame of correspondencenVAE models are trained, until mould Type training restrains, and extraction VAE models imply the sampling feature z of each frames of space zn
Variation autocoder (Variational Auto-encoder, VAE) used in the present invention is a kind of generation Formula learning method, the concrete structure of used model is as shown in Figure 4 in the present invention, wherein xs,nThe spy of expression source voice Parameter is levied,Indicate the characteristic parameter of the target speaker's voice obtained after conversion, bnIndicate that target speaker corresponds to frame Bottleneck features, μ and σ are respectively the vector representation of the mean value and covariance of each component of Gaussian Profile, and z expressions pass through sampling The VAE models that process obtains imply space, znAs sample feature.The parameter estimation procedure of VAE model trainings is as shown in Figure 6. VAE model trainings include the following steps:
41) by union feature parameter xnTraining data, Bottleneck feature bs as VAE model encoder modulesnMake For decoder module decoding and reconstitutings when training data VAE models are trained, in the decoder modules of VAE models will Bottleneck feature bsnAs the control information of voice spectrum restructuring procedure, i.e., by Bottleneck feature bsnWith sampling feature znSplicing passes through the training of the decoder modules of VAE models, reconstructed voice spectrum signature frame by frame;
The encoder input layers of VAE models are 171 nodes in the present invention, then include two hidden layers, and first layer is 500 nodes, the second layer are 64 nodes;In the second node layer, preceding 32 nodes calculate each component of Gaussian mixtures Mean value, the variance that rear 32 nodes calculate each component (is to calculate the height that can be more fitted input signal by neural network at this time This mixed distribution);Implicit space layer includes 32 nodes, and the value of each node from the sampling of second layer hidden layer by obtaining; Decoder modules are set as including a hidden layer, and number of nodes 500, output layer is 171 nodes.In addition to implicit space layer is Linear convergent rate, other hidden layer activation values are ReLU functions;
42) it utilizes ADAM optimizers to optimize VAE models shown in Fig. 4 according to the variation Bayes principle in VAE models to join KL (Kullback-Leibler Divergence) divergences and mean square error in number estimation procedure is to adjust VAE prototype networks Weights obtain VAE voice spectrum transformation models;
43) by union feature parameter xnVAE voice spectrum transformation models are inputted frame by frame, and are implied by sampling process Sample feature zn
It is exactly the decoder modules by VAE models in the phoneme letter comprising semantic feature for this method is more intuitive Cease znOn add speaker's personal characteristics bnModulation.
It 5) will sampling feature znWith the tag along sort feature y of the speaker of each frame of correspondencenSpliced to obtain Bottleneck The training data of Feature Mapping network (BP networks), and with the Bottleneck feature bs of each framenIt is instructed as supervision message The training of Bottleneck Feature Mapping networks minimizes Bottleneck Feature Mapping networks by stochastic gradient descent algorithm Output error, obtain Bottleneck Feature Mapping networks;
Target speaker Bottleneck Feature Mapping networks used in the present invention use BP networks, structure such as Fig. 5 institutes Show, wherein input parameter zn+yn, wherein znFor variation self-encoding encoder hidden layer feature, ynTo participate in trained speaker's Label characteristics;Output is the Bottleneck feature bs of target speakern.It includes such as to obtain Bottleneck Feature Mapping networks Lower step:
51) VAE is implied to the sampling feature z in spacenWith the tag along sort feature y of the speaker of corresponding each framenSpliced As the training data of Bottleneck Feature Mapping networks, Bottleneck Feature Mapping networks are using the full connection of three layers of feedforward The neural network of type forgives an input layer, a hidden layer and an output layer, and input layer number is 37 each nodes, wherein 32 nodes, which correspond to, samples feature z in VAE modelsnDimension, 5 nodes, which correspond to, participates in five speakers of training are constituted 5 Tie up the tag along sort feature y of speakern;Output layer is 57 nodes, corresponding 57 dimension Bottleneck features;Centre includes one Hidden layer, number of nodes 1200, hidden layer activation primitive introduce nonlinear change for sigmoid functions, and output layer is linear convergent rate; The expression formula of sigmoid functions is:
F (x)=1/ (1+ex)
52) criterion is minimized according to mean square error, is optimized using the stochastic gradient descent algorithm that backward error is propagated Bottleneck Feature Mapping network weights minimize network and export Bottleneck featuresIt is corresponding with each frame Bottleneck feature bsnBetween error, i.e.,:
The weights for optimizing whole network, finally obtaining one can be by sampling feature znWith the contingency table of target speaker Sign feature ynObtain target speaker's Bottleneck featuresBP mapping networks.
Trained DNN networks, VAE networks and Bottleneck Feature Mapping combination of network constitute based on VAE and The speech conversion system of Bottleneck features;
Voice conversion, voice switch process are realized according to frequency spectrum flow path switch shown in Fig. 2:
6) by the union feature parameter X of the voice of source speaker to be convertedpBy the encoder modules of VAE models, obtain To the sampling feature z of implicit each frames of space zn
Obtain the union feature parameter X of voice to be convertedpSpecially:To the union feature parameter X of voice to be convertedp Specially:The mel cepstrum characteristic parameter of voice to be converted is extracted using AHOcoder, and to extracting on MATLAB platforms Each frame characteristic parameter carry out first-order difference and second differnce, and spliced to obtain characteristic parameter with former feature, in time domain On by the characteristic parameter of the obtained characteristic parameter of splicing and front and back each frame carry out being spliced to form again union feature parameter to get to The characteristic parameter X of voice spectrum to be convertedp
It 7) will sampling feature znWith the tag along sort feature y of target speakernIt is special to carry out splicing input Bottleneck frame by frame Mapping network is levied, the Bottleneck features of target speaker are obtained
8) by Bottleneck featuresWith sampling feature znThe decoder Restructuring Modules that splicing passes through VAE models frame by frame Go out the union feature parameter X of transformed voicep';
9) AHOcoder sound coder reconstructed speech signals are utilized.
Reconstructed speech signal is specially:The speech characteristic parameter X that will be obtained after conversionp' it is reduced to mel cepstrum feature shape Formula removes time domain joint and Difference Terms, sound coder AHOcoder is recycled to synthesize transformed voice.
Mel cepstrum distortion (Mel-Cepstrum Distortion, MCD) is a kind of measurement voice turn in voice conversion The objective standard of obversion amount.MCD values between converting speech and target voice are smaller, illustrate turning for corresponding speech conversion system It is transsexual can be better.Fig. 7 is that when characterizing speaker's personal characteristics, VAE is trained by non-parallel corpus using different characteristic parameter The MCD values of converting speech under the different switching situation that model obtains compare, and as can be seen from the figure use Bottleneck special The voice conversion for levying to characterize speaker's individual information, the performance ratio of system use speaker's tag characterization speaker's individual information Converting system performance it is more preferable.
The variation self-encoding encoder VAE is the generation learning network model in a kind of deep learning concept, compared to depth It is other such as depth confidence network DBN in study, convolutional neural networks CNN etc., VAE models its encoder in the training process The probability distribution that process can learn to meet original input signal by the principle of variation Bayes, and obtained by sampling process Original signal implies the feature in space, then utilizes sampling feature reconstruction original signal by decoder processes so that reconstruct letter Error number between original signal as far as possible small (or probability distribution variances are small).This characteristic of VAE models can be applied To in Style Transfer, and in voice conversion, it is believed that can be isolated and speaker in implicit space by VAE models Property feature it is unrelated and with the relevant phoneme information of semantic feature, and implicit spatial information can be utilized to combine characterization speaker's individual character The parameter reconstructed voice spectrum signal of feature.In the present invention, it is extracted using the DNN networks based on Speaker Identification task Bottleneck features obtain phoneme letter to characterize the personal characteristics of speaker by the mapping network that BP network trainings obtain Mapping relations between breath, speaker's label union feature formed and Bottleneck features, to be said indirectly by source People's speech spectral characteristics are talked about to obtain target speaker's Bottleneck features, will be implied finally by the decoder modules of VAE The Bottleneck feature reconstructions of phoneme information and target speaker in space are transformed speech spectral characteristics.
The present invention is for traditional gauss hybrid models conversion method and utilizes BP real-time performance voice spectrums conversion side It is needed in method using parallel corpora and needs to carry out the problem of DTW alignment carries out model training again, in conjunction with the characteristics of VAE models Phonetics transfer method under a kind of non-parallel corpus proposed, there are three key points by the present invention:First, knowing using based on speaker The Bottleneck features of DNN networks extraction characterization speaker's personal characteristics of other taskSecond is that being built using BP neural network Vertical sampling feature zn, speaker tag along sort feature ynThe union feature of composition and Bottleneck featuresBetween mapping Relationship;Third, using the decoder modules of trained VAE models by Bottleneck featuresWith sampling feature znComposition Union feature is reconstructed into transformed speech spectral characteristics.
The innovation of the present invention is:1. using the feature of VAE models, isolated in implicit space and speaker Property feature is unrelated and related with semantic feature phoneme information, so as to realize that the voice under non-parallel corpus training turns It changes, and this method can complete a variety of convert tasks by a model training for different speakers;2. utilize based on The Bottleneck features extracted in the DNN networks of Speaker Identification task participate in VAE moulds as the personal characteristics of speaker The decoder Restructuring Module processes of type, improve voice conversion performance.
For some medical auxiliary systems, such as cannot normal sounding because having phonatory organ physiological defect or illness Some principles of method involved in the present invention may be used when providing sounding ancillary equipment for them in patient;The present invention has Preferable expansion provides for the materialization for solving the problems, such as in voice conversion, including the voice transfer problem of multi-to-multi (M2M) Resolving ideas.
The above is only the preferred embodiment of the present invention, it should be pointed out that:For the ordinary skill of the art For personnel, without departing from the principle of the present invention, several improvement can also be made, these improvement also should be regarded as this hair Bright protection domain.

Claims (8)

1. the phonetics transfer method based on VAE under a kind of non-parallel corpus training, which is characterized in that including training step and voice Switch process:
Training step:
1) AHOcoder sound coders is utilized to extract the mel cepstrum characteristic parameter for the speaker's voice for participating in training respectively X;
2) each frame mel cepstrum characteristic parameter X of extraction difference processing and with former characteristic parameter X splice, when The characteristic parameter X for obtaining splicing on domaintIt carries out being spliced to form union feature parameter x again with the characteristic parameter of front and back each framen
3) union feature parameter x is utilizednWith the tag along sort feature y of speakernDNN networks are trained, DNN networks are adjusted Weights to reduce error in classification until network convergence, obtain the DNN networks based on Speaker Identification task, extract each frame Bottleneck feature bsn
4) union feature parameter x is utilizednWith the Bottleneck feature bs of each frame of correspondencenVAE models are trained, until model Training convergence, extraction VAE models imply the sampling feature z of each frames of space zn
It 5) will sampling feature znWith the tag along sort feature y of the speaker of each frame of correspondencenSpliced to obtain Bottleneck features The training data of mapping network, and with the Bottleneck feature bs of each framenBottleneck features are instructed to reflect as supervision message The training for penetrating network is minimized the output error of Bottleneck Feature Mapping networks by stochastic gradient descent algorithm, obtained Bottleneck Feature Mapping networks;
Voice switch process:
6) by the union feature parameter X of voice to be convertedpBy the encoder modules of VAE models, each frames of implicit space z are obtained Sampling feature zn
It 7) will sampling feature znWith the tag along sort feature y of target speakernSplicing input Bottleneck features frame by frame are carried out to reflect Network is penetrated, the Bottleneck features of target speaker are obtained
8) by Bottleneck featuresWith sampling feature znSplicing goes out to turn by the decoder Restructuring Modules of VAE models frame by frame The union feature parameter X of voice after changingp';
9) AHOcoder sound coder reconstructed speech signals are utilized.
2. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist In the mel cepstrum of speaker's voice of extraction participation training is characterized in utilizing AHOcoder sound codecs in the step 1) Device extracts the mel cepstrum feature for the speaker's voice for participating in training respectively, and mel cepstrum feature is read in Matlab platforms.
3. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist In obtaining union feature parameter in the step 2) is specially:Each frame characteristic parameter X of extraction is subjected to first-order difference and two Order difference, and spliced to obtain characteristic parameter X with former characteristic parameter Xt=(X, Δ X, Δ2X), splicing is obtained in the time domain Characteristic parameter XtIt carries out being spliced to form union feature parameter x again with the characteristic parameter of front and back each framen=(Xt-1,Xt,Xt+1)。
4. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist In extraction Bottleneck feature bs in the step 3)nInclude the following steps:
31) union feature parameter x is obtained in MATLAB platformsnThe tag along sort feature y of the corresponding speaker of each framen
32) unsupervised pre-training is carried out to DNN networks using successively greedy pre-training method, the activation primitive of hidden layer uses ReLU functions;
33) DNN network output layers are set to softmax classification outputs, by the tag along sort feature y of speakernAs DNN nets Network carries out the supervision message of Training, and the weights of network are adjusted using stochastic gradient descent algorithm, minimizes DNN networks point Class exports the tag along sort feature y with speakernBetween error until convergence, obtain the DNN based on Speaker Identification task Network;
34) by feedforward arithmetic by union feature parameter xnDNN networks are inputted frame by frame, extract the corresponding Bottleneck of each frame The activation value of layer, i.e., the Bottleneck feature bs corresponding to the mel cepstrum characteristic parameter of each framen
5. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist In VAE model trainings includes the following steps in the step 4):
41) by union feature parameter xnTraining data, Bottleneck feature bs as VAE model encoder modulesnAs Training data when decoder module decoding and reconstitutings is trained VAE models, will in the decoder modules of VAE models Bottleneck feature bsnAs the control information of voice spectrum restructuring procedure, i.e., by Bottleneck feature bsnWith sampling feature znSplicing passes through the training of the decoder modules of VAE models, reconstructed voice spectrum signature frame by frame;
42) the KL divergences and mean square error during utilizing ADAM optimizers to optimize VAE model parameter estimations are to adjust VAE models Network weight obtains VAE voice spectrum transformation models;
43) by union feature parameter xnVAE voice spectrum transformation models are inputted frame by frame, and implied sample is obtained by sampling process Feature zn
6. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist In obtaining Bottleneck Feature Mapping networks in the step 5) and include the following steps:
51) by the sampling feature z of VAE voice spectrum transformation modelsnWith the tag along sort feature y of the speaker of corresponding each framenIt carries out Splice the training data as Bottleneck Feature Mapping networks, Bottleneck Feature Mapping networks are using an input The structure of layer, a hidden layer and an output layer, hidden layer activation primitive are sigmoid functions, and output layer is linear convergent rate;
52) criterion is minimized according to mean square error, is optimized using the stochastic gradient descent algorithm that backward error is propagated Bottleneck Feature Mapping network weights minimize network and export Bottleneck featuresIt is corresponding with each frame Bottleneck feature bsnBetween error.
7. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist In obtaining the union feature parameter X of voice to be converted in the step 6)pSpecially:It is extracted using AHOcoder to be converted The mel cepstrum characteristic parameter of voice, and on MATLAB platforms to each frame characteristic parameter for extracting carry out first-order difference and Second differnce, and spliced with former feature to obtain characteristic parameter, the characteristic parameter in the time domain obtaining splicing with it is front and back each The characteristic parameter of one frame carries out being spliced to form again union feature parameter to get to the characteristic parameter X of voice spectrum to be convertedp
8. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist In reconstructed speech signal is specially in the step 9):The speech characteristic parameter X that will be obtained after conversionp' it is reduced to mel cepstrum Characteristic formp removes time domain joint and Difference Terms, AHOcoder sound coders is recycled to synthesize transformed voice.
CN201810393556.XA 2018-04-27 2018-04-27 Voice conversion method based on VAE under non-parallel corpus training Active CN108777140B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810393556.XA CN108777140B (en) 2018-04-27 2018-04-27 Voice conversion method based on VAE under non-parallel corpus training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810393556.XA CN108777140B (en) 2018-04-27 2018-04-27 Voice conversion method based on VAE under non-parallel corpus training

Publications (2)

Publication Number Publication Date
CN108777140A true CN108777140A (en) 2018-11-09
CN108777140B CN108777140B (en) 2020-07-28

Family

ID=64026673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810393556.XA Active CN108777140B (en) 2018-04-27 2018-04-27 Voice conversion method based on VAE under non-parallel corpus training

Country Status (1)

Country Link
CN (1) CN108777140B (en)

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
CN109377986A (en) * 2018-11-29 2019-02-22 四川长虹电器股份有限公司 A kind of non-parallel corpus voice personalization conversion method
CN109377978A (en) * 2018-11-12 2019-02-22 南京邮电大学 Multi-to-multi voice conversion method under non-parallel text condition based on i vector
CN109584893A (en) * 2018-12-26 2019-04-05 南京邮电大学 Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition
CN109599091A (en) * 2019-01-14 2019-04-09 南京邮电大学 Multi-to-multi voice conversion method based on STARWGAN-GP and x vector
CN109671442A (en) * 2019-01-14 2019-04-23 南京邮电大学 Multi-to-multi voice conversion method based on STARGAN Yu x vector
CN110033096A (en) * 2019-03-07 2019-07-19 北京大学 A kind of status data generation method and system for intensified learning
CN110047501A (en) * 2019-04-04 2019-07-23 南京邮电大学 Multi-to-multi phonetics transfer method based on beta-VAE
CN110060691A (en) * 2019-04-16 2019-07-26 南京邮电大学 Multi-to-multi phonetics transfer method based on i vector sum VARSGAN
CN110060690A (en) * 2019-04-04 2019-07-26 南京邮电大学 Multi-to-multi voice conversion method based on STARGAN and ResNet
CN110070895A (en) * 2019-03-11 2019-07-30 江苏大学 A kind of mixed sound event detecting method based on supervision variation encoder Factor Decomposition
CN110085254A (en) * 2019-04-22 2019-08-02 南京邮电大学 Multi-to-multi phonetics transfer method based on beta-VAE and i-vector
CN110164463A (en) * 2019-05-23 2019-08-23 北京达佳互联信息技术有限公司 A kind of phonetics transfer method, device, electronic equipment and storage medium
CN110211575A (en) * 2019-06-13 2019-09-06 苏州思必驰信息科技有限公司 Voice for data enhancing adds method for de-noising and system
CN110648658A (en) * 2019-09-06 2020-01-03 北京达佳互联信息技术有限公司 Method and device for generating voice recognition model and electronic equipment
CN111326138A (en) * 2020-02-24 2020-06-23 北京达佳互联信息技术有限公司 Voice generation method and device
CN111724809A (en) * 2020-06-15 2020-09-29 苏州意能通信息技术有限公司 Vocoder implementation method and device based on variational self-encoder
CN112309365A (en) * 2020-10-21 2021-02-02 北京大米科技有限公司 Training method and device of speech synthesis model, storage medium and electronic equipment
CN112382271A (en) * 2020-11-30 2021-02-19 北京百度网讯科技有限公司 Voice processing method, device, electronic equipment and storage medium
CN113032558A (en) * 2021-03-11 2021-06-25 昆明理工大学 Variational semi-supervised hundred-degree encyclopedia classification method fusing wiki knowledge
CN113299267A (en) * 2021-07-26 2021-08-24 北京语言大学 Voice stimulation continuum synthesis method and device based on variational self-encoder
WO2021212954A1 (en) * 2020-04-21 2021-10-28 升智信息科技(南京)有限公司 Method and apparatus for synthesizing emotional speech of specific speaker with extremely few resources
CN113571039A (en) * 2021-08-09 2021-10-29 北京百度网讯科技有限公司 Voice conversion method, system, electronic equipment and readable storage medium
CN113763924A (en) * 2021-11-08 2021-12-07 北京优幕科技有限责任公司 Acoustic deep learning model training method, and voice generation method and device
CN113763987A (en) * 2021-09-06 2021-12-07 中国科学院声学研究所 Training method and device of voice conversion model
CN114360557A (en) * 2021-12-22 2022-04-15 北京百度网讯科技有限公司 Voice tone conversion method, model training method, device, equipment and medium
WO2022083083A1 (en) * 2020-10-21 2022-04-28 南京硅基智能科技有限公司 Sound conversion system and training method for same
US11854562B2 (en) 2019-05-14 2023-12-26 International Business Machines Corporation High-quality non-parallel many-to-many voice conversion
WO2024069726A1 (en) * 2022-09-27 2024-04-04 日本電信電話株式会社 Learning device, conversion device, training method, conversion method, and program

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04248598A (en) * 1990-10-30 1992-09-04 Internatl Business Mach Corp <Ibm> Device and method of editing composite sound information
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition
CN103258531A (en) * 2013-05-29 2013-08-21 安宁 Harmonic wave feature extracting method for irrelevant speech emotion recognition of speaker
CN104123933A (en) * 2014-08-01 2014-10-29 中国科学院自动化研究所 Self-adaptive non-parallel training based voice conversion method
CN104361620A (en) * 2014-11-27 2015-02-18 韩慧健 Mouth shape animation synthesis method based on comprehensive weighted algorithm
WO2016207978A1 (en) * 2015-06-23 2016-12-29 株式会社大入 Method and device for manufacturing book with audio, and method and device for reproducing acoustic waveform
US20170069306A1 (en) * 2015-09-04 2017-03-09 Foundation of the Idiap Research Institute (IDIAP) Signal processing method and apparatus based on structured sparsity of phonological features
CN106778700A (en) * 2017-01-22 2017-05-31 福州大学 One kind is based on change constituent encoder Chinese Sign Language recognition methods
CN107274029A (en) * 2017-06-23 2017-10-20 深圳市唯特视科技有限公司 A kind of future anticipation method of interaction medium in utilization dynamic scene
CN107301859A (en) * 2017-06-21 2017-10-27 南京邮电大学 Phonetics transfer method under the non-parallel text condition clustered based on adaptive Gauss

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04248598A (en) * 1990-10-30 1992-09-04 Internatl Business Mach Corp <Ibm> Device and method of editing composite sound information
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition
CN102063899B (en) * 2010-10-27 2012-05-23 南京邮电大学 Method for voice conversion under unparallel text condition
CN103258531A (en) * 2013-05-29 2013-08-21 安宁 Harmonic wave feature extracting method for irrelevant speech emotion recognition of speaker
CN103258531B (en) * 2013-05-29 2015-11-11 安宁 A kind of harmonic characteristic extracting method of the speech emotion recognition had nothing to do for speaker
CN104123933A (en) * 2014-08-01 2014-10-29 中国科学院自动化研究所 Self-adaptive non-parallel training based voice conversion method
CN104361620A (en) * 2014-11-27 2015-02-18 韩慧健 Mouth shape animation synthesis method based on comprehensive weighted algorithm
WO2016207978A1 (en) * 2015-06-23 2016-12-29 株式会社大入 Method and device for manufacturing book with audio, and method and device for reproducing acoustic waveform
US20170069306A1 (en) * 2015-09-04 2017-03-09 Foundation of the Idiap Research Institute (IDIAP) Signal processing method and apparatus based on structured sparsity of phonological features
CN106778700A (en) * 2017-01-22 2017-05-31 福州大学 One kind is based on change constituent encoder Chinese Sign Language recognition methods
CN107301859A (en) * 2017-06-21 2017-10-27 南京邮电大学 Phonetics transfer method under the non-parallel text condition clustered based on adaptive Gauss
CN107274029A (en) * 2017-06-23 2017-10-20 深圳市唯特视科技有限公司 A kind of future anticipation method of interaction medium in utilization dynamic scene

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李燕萍: "一种适于说话人辨认的自适应频率尺度变换", 《南京理工大学学报》 *

Cited By (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109377978A (en) * 2018-11-12 2019-02-22 南京邮电大学 Multi-to-multi voice conversion method under non-parallel text condition based on i vector
CN109377978B (en) * 2018-11-12 2021-01-26 南京邮电大学 Many-to-many speaker conversion method based on i vector under non-parallel text condition
CN109326283B (en) * 2018-11-23 2021-01-26 南京邮电大学 Many-to-many voice conversion method based on text encoder under non-parallel text condition
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
CN109377986A (en) * 2018-11-29 2019-02-22 四川长虹电器股份有限公司 A kind of non-parallel corpus voice personalization conversion method
CN109377986B (en) * 2018-11-29 2022-02-01 四川长虹电器股份有限公司 Non-parallel corpus voice personalized conversion method
CN109584893A (en) * 2018-12-26 2019-04-05 南京邮电大学 Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition
CN109584893B (en) * 2018-12-26 2021-09-14 南京邮电大学 VAE and i-vector based many-to-many voice conversion system under non-parallel text condition
CN109599091A (en) * 2019-01-14 2019-04-09 南京邮电大学 Multi-to-multi voice conversion method based on STARWGAN-GP and x vector
CN109671442A (en) * 2019-01-14 2019-04-23 南京邮电大学 Multi-to-multi voice conversion method based on STARGAN Yu x vector
CN109599091B (en) * 2019-01-14 2021-01-26 南京邮电大学 Star-WAN-GP and x-vector based many-to-many speaker conversion method
CN110033096B (en) * 2019-03-07 2021-04-02 北京大学 State data generation method and system for reinforcement learning
CN110033096A (en) * 2019-03-07 2019-07-19 北京大学 A kind of status data generation method and system for intensified learning
CN110070895A (en) * 2019-03-11 2019-07-30 江苏大学 A kind of mixed sound event detecting method based on supervision variation encoder Factor Decomposition
CN110047501A (en) * 2019-04-04 2019-07-23 南京邮电大学 Multi-to-multi phonetics transfer method based on beta-VAE
CN110047501B (en) * 2019-04-04 2021-09-07 南京邮电大学 Many-to-many voice conversion method based on beta-VAE
CN110060690A (en) * 2019-04-04 2019-07-26 南京邮电大学 Multi-to-multi voice conversion method based on STARGAN and ResNet
CN110060691B (en) * 2019-04-16 2023-02-28 南京邮电大学 Many-to-many voice conversion method based on i-vector and VARSGAN
CN110060691A (en) * 2019-04-16 2019-07-26 南京邮电大学 Multi-to-multi phonetics transfer method based on i vector sum VARSGAN
CN110085254A (en) * 2019-04-22 2019-08-02 南京邮电大学 Multi-to-multi phonetics transfer method based on beta-VAE and i-vector
US11854562B2 (en) 2019-05-14 2023-12-26 International Business Machines Corporation High-quality non-parallel many-to-many voice conversion
CN110164463A (en) * 2019-05-23 2019-08-23 北京达佳互联信息技术有限公司 A kind of phonetics transfer method, device, electronic equipment and storage medium
CN110164463B (en) * 2019-05-23 2021-09-10 北京达佳互联信息技术有限公司 Voice conversion method and device, electronic equipment and storage medium
CN110211575B (en) * 2019-06-13 2021-06-04 思必驰科技股份有限公司 Voice noise adding method and system for data enhancement
CN110211575A (en) * 2019-06-13 2019-09-06 苏州思必驰信息科技有限公司 Voice for data enhancing adds method for de-noising and system
CN110648658B (en) * 2019-09-06 2022-04-08 北京达佳互联信息技术有限公司 Method and device for generating voice recognition model and electronic equipment
CN110648658A (en) * 2019-09-06 2020-01-03 北京达佳互联信息技术有限公司 Method and device for generating voice recognition model and electronic equipment
CN111326138A (en) * 2020-02-24 2020-06-23 北京达佳互联信息技术有限公司 Voice generation method and device
WO2021212954A1 (en) * 2020-04-21 2021-10-28 升智信息科技(南京)有限公司 Method and apparatus for synthesizing emotional speech of specific speaker with extremely few resources
CN111724809A (en) * 2020-06-15 2020-09-29 苏州意能通信息技术有限公司 Vocoder implementation method and device based on variational self-encoder
CN112309365A (en) * 2020-10-21 2021-02-02 北京大米科技有限公司 Training method and device of speech synthesis model, storage medium and electronic equipment
WO2022083083A1 (en) * 2020-10-21 2022-04-28 南京硅基智能科技有限公司 Sound conversion system and training method for same
US11875775B2 (en) 2020-10-21 2024-01-16 Nanjing Silicon Intelligence Technology Co., Ltd. Voice conversion system and training method therefor
CN112382271B (en) * 2020-11-30 2024-03-26 北京百度网讯科技有限公司 Voice processing method, device, electronic equipment and storage medium
CN112382271A (en) * 2020-11-30 2021-02-19 北京百度网讯科技有限公司 Voice processing method, device, electronic equipment and storage medium
CN113032558B (en) * 2021-03-11 2023-08-29 昆明理工大学 Variable semi-supervised hundred degree encyclopedia classification method integrating wiki knowledge
CN113032558A (en) * 2021-03-11 2021-06-25 昆明理工大学 Variational semi-supervised hundred-degree encyclopedia classification method fusing wiki knowledge
CN113299267A (en) * 2021-07-26 2021-08-24 北京语言大学 Voice stimulation continuum synthesis method and device based on variational self-encoder
CN113571039B (en) * 2021-08-09 2022-04-08 北京百度网讯科技有限公司 Voice conversion method, system, electronic equipment and readable storage medium
CN113571039A (en) * 2021-08-09 2021-10-29 北京百度网讯科技有限公司 Voice conversion method, system, electronic equipment and readable storage medium
CN113763987A (en) * 2021-09-06 2021-12-07 中国科学院声学研究所 Training method and device of voice conversion model
CN113763924A (en) * 2021-11-08 2021-12-07 北京优幕科技有限责任公司 Acoustic deep learning model training method, and voice generation method and device
CN114360557A (en) * 2021-12-22 2022-04-15 北京百度网讯科技有限公司 Voice tone conversion method, model training method, device, equipment and medium
CN114360557B (en) * 2021-12-22 2022-11-01 北京百度网讯科技有限公司 Voice tone conversion method, model training method, device, equipment and medium
WO2024069726A1 (en) * 2022-09-27 2024-04-04 日本電信電話株式会社 Learning device, conversion device, training method, conversion method, and program

Also Published As

Publication number Publication date
CN108777140B (en) 2020-07-28

Similar Documents

Publication Publication Date Title
CN108777140A (en) Phonetics transfer method based on VAE under a kind of training of non-parallel corpus
CN112767958B (en) Zero-order learning-based cross-language tone conversion system and method
CN102800316B (en) Optimal codebook design method for voiceprint recognition system based on nerve network
CN110060690A (en) Multi-to-multi voice conversion method based on STARGAN and ResNet
CN109671442A (en) Multi-to-multi voice conversion method based on STARGAN Yu x vector
Luo et al. Emotional voice conversion using deep neural networks with MCC and F0 features
Casale et al. Multistyle classification of speech under stress using feature subset selection based on genetic algorithms
CN109671423A (en) Non-parallel text compressing method under the limited situation of training data
CN110060657A (en) Multi-to-multi voice conversion method based on SN
Azizah et al. Hierarchical transfer learning for multilingual, multi-speaker, and style transfer DNN-based TTS on low-resource languages
Liu et al. Voice conversion with transformer network
Lai et al. Phone-aware LSTM-RNN for voice conversion
Moon et al. Mist-tacotron: End-to-end emotional speech synthesis using mel-spectrogram image style transfer
Wang et al. A study on acoustic modeling for child speech based on multi-task learning
Bi et al. Deep feed-forward sequential memory networks for speech synthesis
Latif et al. Generative emotional AI for speech emotion recognition: The case for synthetic emotional speech augmentation
Guo et al. Phonetic posteriorgrams based many-to-many singing voice conversion via adversarial training
Choi et al. A melody-unsupervision model for singing voice synthesis
Kang et al. Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion
CN113470622A (en) Conversion method and device capable of converting any voice into multiple voices
Zhao et al. Research on voice cloning with a few samples
Wu et al. Non-parallel voice conversion system with WaveNet vocoder and collapsed speech suppression
Li et al. Many-to-many voice conversion based on bottleneck features with variational autoencoder for non-parallel training data
Nazir et al. Deep learning end to end speech synthesis: A review
Kaushik et al. End-to-end speaker age and height estimation using attention mechanism and triplet loss

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant