CN108777140A - Phonetics transfer method based on VAE under a kind of training of non-parallel corpus - Google Patents
Phonetics transfer method based on VAE under a kind of training of non-parallel corpus Download PDFInfo
- Publication number
- CN108777140A CN108777140A CN201810393556.XA CN201810393556A CN108777140A CN 108777140 A CN108777140 A CN 108777140A CN 201810393556 A CN201810393556 A CN 201810393556A CN 108777140 A CN108777140 A CN 108777140A
- Authority
- CN
- China
- Prior art keywords
- feature
- frame
- vae
- bottleneck
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Abstract
The invention discloses a kind of under the non-parallel corpus training condition phonetics transfer method based on VAE, under non-parallel text condition, bottleneck characteristic is extracted by deep neural network, that is Bottleneck features, it is then based on the study and modeling of variation own coding model realization transfer function, in the conversion stage, conversion of more speakers to more speakers may be implemented.There are three aspects for the advantage of the present invention:1) dependence to parallel text is released, and training process does not need any alignment operation;2) converting system of multiple sources-target speaker couple can be incorporated into a transformation model, realizes multi-to-multi conversion;3) the multi-to-multi converting system under non-parallel text condition will interact for the technological direction actual speech and provide technical support.
Description
Technical field
The invention belongs to field of voice signal, and in particular to a kind of non-parallel corpus training is lower to be based on variation own coding
The phonetics transfer method of model (Variational Autoencoder, VAE).
Background technology
Voice Conversion Techniques are a research branches of Speech processing, it covers Speaker Identification, speech recognition
And the content in the fields such as phonetic synthesis, intend the personalized letter for changing voice in the case where the original semantic information of reservation is constant
Breath, makes the voice of speaker dependent (i.e. source speaker) sound like the language of another speaker dependent (i.e. target speaker)
Sound.The main task of voice conversion includes extracting the characteristic parameter of two speaker dependent's voices and carrying out Mapping and Converting, then
Parameter decoding after transformation is reconstructed into transformed voice.The sense of hearing matter of voice after the conversion that ensure in the process
Whether personal characteristics is accurate after amount and conversion.Years development is passed through in the research of Voice Conversion Techniques, and voice conversion art has gushed
Reveal a variety of different methods, wherein having become the warp in the field using gauss hybrid models as the statistics conversion method of representative
Allusion quotation method.But still there are certain defects in this kind of algorithm, such as:The classics of voice conversion are carried out using gauss hybrid models
Method is mostly to be based on one-to-one voice convert task, it is desirable that the training sentence content phase that source speaker and target speaker use
Together, need to spectrum signature be carried out dynamic time warping (Dynamic Time Warping, DTW) to be aligned frame by frame, mould could be passed through
Type training obtains the mapping relations between spectrum signature, such phonetics transfer method underaction in practical applications;Use height
This mixed model is global variable and by repetitive exercise data come what is considered when training mapping function, leads to calculation amount abruptly increase,
And only when training data is abundant, gauss hybrid models can be only achieved preferable conversion effect, this is not suitable for limited meter
Calculate resource and equipment.
In recent years, the research in deep learning field accelerates the training speed of deep neural network and the validity of network,
And thering are researchers constantly to propose new model and new learning method, modeling ability is strong, can be from complicated data
Learn the feature to more deep layer.
AHOcoder characteristic parameter extraction models are an audio coder & decoder (codec) (speech analysis/synthesis system), by
Daniel Erro are obtained in the research and development of the AHOLAB signal processings laboratory of Basque university.AHOcoder is by 16kHz, 16bits
Monophonic wav speech decompositions be three parts:Fundamental frequency (F0), spectrum (mel cepstrum coefficients MFCC) and maximum voiced sound frequency.
AHOcoder speech analysis, synthetic model can provide an accurate speech analysis and the speech waveform of high quality is rebuild.
Fundamental frequency is the important parameter influenced in phonetic-rhythm characteristic, and the phonetics transfer method designed by the present invention is directed to
The conversion of fundamental frequency, using the conversion method of traditional Gaussian normalization.Assuming that the voiced segments pair of source speaker and target speaker
Base frequency Gaussian distributed then calculates separately the equal of source speaker and target speaker's voiced segments logarithm fundamental frequency Gaussian Profile
Value and variance.Then realize source speaker's voiced segments logarithm fundamental frequency to target speaker's voiced segments logarithm fundamental frequency using following formula
Conversion, and voiceless sound Duan Ze is not changed.
The mean value and variance of wherein source speaker voiced segments logarithm fundamental frequency use μ respectivelysrcAnd σsrcIt indicates, target speaker is turbid
The mean value and variance of segment logarithm fundamental frequency use μ respectivelytgtAnd σtgtIt indicates, and FOsrcThe fundamental frequency of expression source speaker, FOconvIt indicates
Transformed fundamental frequency.
Invention content
To solve the above problems, the present invention proposes the phonetics transfer method based on VAE under a kind of training of non-parallel corpus, pendulum
Conversion of more speakers to more speakers is realized in the de- dependence to parallel text, improves flexibility, is solved in resource, equipment
The technical issues of voice conversion is difficult to realize under the conditions of limited.
The present invention adopts the following technical scheme that, the phonetics transfer method based on VAE under a kind of non-parallel corpus training, including
Training step and voice switch process:
Training step:
1) AHOcoder sound coders is utilized to extract the mel cepstrum feature for the speaker's voice for participating in training respectively
Parameter X;
2) each frame mel cepstrum characteristic parameter X of extraction is subjected to difference processing and spliced with former characteristic parameter X,
It carries out the characteristic parameter of characteristic parameter and front and back each frame that splicing obtains to be spliced to form union feature parameter again in the time domain
xn;
3) union feature parameter x is utilizednWith the tag along sort feature y of speakernTo deep neural network (Deep Neural
Network, DNN) it is trained, the weights of DNN networks are adjusted to reduce error in classification until network convergence, obtains being based on speaking
The DNN networks of people's identification mission extract the bottleneck characteristic of each frame, i.e. Bottleneck feature bsn;
4) union feature parameter x is utilizednWith the Bottleneck feature bs of each frame of correspondencenVAE models are trained, until
Model training is restrained, and variation own coding model (Variational Autoencoder, VAE) model is extracted, i.e. VAE models are hidden
The sampling feature z of each frames of z containing spacen;
It 5) will sampling feature znWith the label characteristics y of the speaker of each frame of correspondencenSpliced to obtain Bottleneck features
The training data of mapping network (BP networks), and with the Bottleneck feature bs of each framenIt is instructed as supervision message
The training of Bottleneck Feature Mapping networks minimizes Bottleneck Feature Mapping networks by stochastic gradient descent algorithm
Output error, obtain Bottleneck Feature Mapping networks;
Trained DNN networks, VAE networks and Bottleneck Feature Mapping combination of network constitute based on VAE and
The speech conversion system of Bottleneck features;
Voice switch process:
6) by the union feature parameter X of voice to be convertedpBy the encoder modules of VAE models, implicit space is obtained
The sampling feature z of each frames of zn;
It 7) will sampling feature znWith the label characteristics y of target speakernSplicing input Bottleneck features frame by frame are carried out to reflect
Network is penetrated, the Bottleneck features of target speaker are obtained
8) by Bottleneck featuresWith sampling feature znThe decoder Restructuring Modules that splicing passes through VAE models frame by frame
Go out the union feature parameter X of transformed voicep';
9) AHOcoder sound coder reconstructed speech signals are utilized.
Preferably, the mel cepstrum for speaker's voice that extraction participates in training in the step 1) is characterized in utilizing
AHOcoder sound coders extract the mel cepstrum feature for the speaker's voice for participating in training respectively, and mel cepstrum is special
Sign reads in Matlab platforms.
Preferably, union feature parameter is obtained in the step 2) is specially:Each frame characteristic parameter X of extraction is carried out
First-order difference and second differnce, and spliced with former feature to obtain characteristic parameter Xt=(X, Δ X, Δ2X), will spell in the time domain
The characteristic parameter X connecttIt carries out being spliced to form union feature parameter x again with the characteristic parameter of front and back each framen=(Xt-1,Xt,
Xt+1)。
Preferably, to extracting Bottleneck feature bs in the step 3)nInclude the following steps:
31) union feature parameter x is obtained in MATLAB platformsnThe tag along sort feature y of the corresponding speaker of each framen;
32) unsupervised pre-training is carried out to DNN networks using successively greedy pre-training method, the activation primitive of hidden layer is adopted
With ReLU functions;
33) DNN network output layers are set to softmax classification outputs, by the tag along sort feature y of speakernAs
DNN networks carry out the supervision message of Training, and the weights of network are adjusted using stochastic gradient descent algorithm, minimize DNN
Network class exports the tag along sort feature y with speakernBetween error until convergence, obtain be based on Speaker Identification task
DNN networks;
34) by feedforward arithmetic by union feature parameter xnDNN networks are inputted frame by frame, and it is corresponding to extract each frame
Bottleneck layers of activation value, i.e., the Bottleneck feature bs corresponding to the mel cepstrum characteristic parameter of each framen。
Preferably, VAE model trainings include the following steps in the step 4):
41) by union feature parameter xnTraining data, Bottleneck feature bs as VAE model encoder modulesnMake
For decoder module decoding and reconstitutings when training data VAE models are trained, in the decoder modules of VAE models will
Bottleneck feature bsnAs the control information of voice spectrum restructuring procedure, i.e., by Bottleneck feature bsnWith sampling feature
znSplicing passes through the training of the decoder modules of VAE models, reconstructed voice spectrum signature frame by frame;
42) the KL divergences and mean square error during utilizing ADAM optimizers to optimize VAE model parameter estimations are to adjust VAE
Prototype network weights obtain VAE voice spectrum transformation models;
43) by union feature parameter xnVAE voice spectrum transformation models are inputted frame by frame, and are implied by sampling process
Sample feature zn。
Preferably, Bottleneck Feature Mapping networks are obtained in the step 5) to include the following steps:
It 51) will sampling feature znWith the tag along sort feature y of the speaker of corresponding each framenCarry out splicing conduct
The training data of Bottleneck Feature Mapping networks, Bottleneck Feature Mapping networks using an input layer, one it is hidden
The structure of layer and an output layer, hidden layer activation primitive are sigmoid functions, and output layer is linear convergent rate;
52) criterion is minimized according to mean square error, is optimized using the stochastic gradient descent algorithm that backward error is propagated
Bottleneck Feature Mapping network weights minimize network and export Bottleneck featuresIt is corresponding with each frame
Bottleneck feature bsnBetween error.
Preferably, the union feature parameter X of voice to be converted is obtained in the step 6)pSpecially:It utilizes
AHOcoder extracts the mel cepstrum characteristic parameter of voice to be converted, and each frame feature on MATLAB platforms to extracting
Parameter carries out first-order difference and second differnce, and is spliced with former feature to obtain characteristic parameter, in the time domain obtains splicing
The characteristic parameter of characteristic parameter and front and back each frame carry out being spliced to form union feature parameter again to get to voice to be converted
The characteristic parameter X of frequency spectrump。
Preferably, reconstructed speech signal is specially in the step 9):The speech characteristic parameter X that will be obtained after conversionp' also
Originally it was mel cepstrum characteristic formp, that is, removes time domain joint and Difference Terms, the synthesis of AHOcoder sound coders is recycled to turn
Voice after changing.
The reached advantageous effect of invention:The present invention is the voice conversion side based on VAE under a kind of training of non-parallel corpus
Method breaks away from the dependence to parallel text, realizes conversion of more speakers to more speakers, improves flexibility, solve resource,
The technical issues of voice conversion is difficult to realize under the conditions of equipment is limited.The advantage of the invention is that:
1) utilize VAE models that can believe phoneme unrelated with speaker's individual character in speech spectral characteristics by modeling study
Breath is the advantages of hidden layer is separated so that VAE models can carry out voice conversion by nonparallel voice data
It practises, has broken away from traditional voice transformation model, needed source and the parallel corpora data of target speaker to be trained, also by language
The limitation that audio spectrum signature is aligned greatly improves practicability and the flexibility of speech conversion system, to design across language
The speech conversion system of kind is provided convenience;
2) the voice switching network obtained by training VAE models can complete a variety of transition cases, with traditional a pair
One speech conversion system is compared, it is only necessary to can be completed a variety of convert tasks by one model of training, be greatly improved
The efficiency of voice transformation model training;
3) in the decoder modules of VAE models, Bottleneck feature bs have been usednPersonal characteristics as speaker
Transformed speech spectral characteristics are reconstructed, compared to the tag along sort feature y for using speakernAs characterization speaker's individual character letter
The system of feature is ceased, the conversion effect and sound quality of finally obtained converting speech are more preferable.
Description of the drawings
Fig. 1 is the systematic training procedural block diagram of the present invention;
Fig. 2 is the system transfer process block diagram of the present invention;
Fig. 3 is the structure chart of the DNN networks based on Speaker Identification task of the present invention;
Fig. 4 is the structure chart of the VAE speech spectral characteristics switching networks of the present invention;
Fig. 5 is the structure chart of the Bottleneck Feature Mapping networks of the present invention;
Fig. 6 is VAE model variation Bayes procedure parameter Estimation schematic diagrams;
Fig. 7 is to be characterized speaker's personal characteristics based on VAE models using different characteristic, turned under different switching situation
Change voice MCD value comparison diagrams.
Specific implementation mode
Below according to attached drawing and technical scheme of the present invention is further elaborated in conjunction with the embodiments.
The present invention adopts the following technical scheme that, the phonetics transfer method based on VAE under a kind of non-parallel corpus training passes through
AHOcoder sound coders extract the mel cepstrum feature of voice and on MATLAB platforms by itself and first-order difference, second order
Differential Characteristics are stitched together, and the characteristic parameter of front and back each frame is then spliced composition union feature parameter xn;By xnAs instruction
Practice data to be trained using the DNN networks based on Speaker Identification task, after network training terminates to reach convergence, by xn
DNN networks are inputted frame by frame, and obtain the Bottleneck layers output of each frame, this includes as speaker's personal characteristics
Bottleneck characteristic parameters bn;By xnAs the training data of VAE model encoder modules, bnAs decoder module solutions
Training data when code reconstruct is trained VAE models so that VAE models can be in implicit space z by encoder modules
Obtain the phoneme information z with semantic featuren, that is, feature is sampled, by decoder modules by the phoneme information comprising semantic feature
znWith the Bottleneck feature bs comprising speaker's personal characteristicsnIt is reconstructed into speech spectral characteristics;To include semantic feature later
Phoneme information znWith the tag along sort feature y of speakernThe union feature for splicing composition is trained as the training data of BP networks
Target speaker's Bottleneck Feature Mapping networks, expectation network output and the Bottleneck feature bs corresponding to each framenIt
Between error it is minimum;In conversion, first the spectrum signature of voice to be converted is extracted by the encoder modules of VAE models
The corresponding phoneme information z for including semantic featuren, and by the tag along sort feature y of its frame by frame and target speakernSplicing composition
This union feature is inputted BP networks, obtains the Bottleneck features of each frame of target speaker by union featureIt then will packet
Phoneme information z containing semantic featurenWith the Bottleneck features of each frame of target speakerThe union feature spliced frame by frame is logical
The decoder Restructuring Modules for crossing VAE models are transformed speech spectral characteristics, finally synthesize voice by AHOcoder again;
Specifically include training step and voice switch process:
Fig. 1 is the training process block diagram of system of the present invention, training step:
1) AHOcoder sound coders is utilized to extract the mel cepstrum feature for the speaker's voice for participating in training respectively
Parameter X;
The mel cepstrum that extraction participates in speaker's voice of training is characterized in distinguishing using AHOcoder sound coders
Extraction participates in the mel cepstrum feature of speaker's voice of training, and mel cepstrum feature is read in Matlab platforms;The present invention
The middle mel cepstrum feature using 19 dimensions, the voice content of each speaker can be different, need not also carry out DTW pairs
Together.
2) each frame mel cepstrum characteristic parameter X of extraction is subjected to difference processing and spliced with former characteristic parameter,
It carries out the characteristic parameter of characteristic parameter and front and back each frame that splicing obtains to be spliced to form union feature parameter again in the time domain
xn;
Each frame characteristic parameter X of extraction is subjected to first-order difference and second differnce, and is spliced to obtain with former feature
57 dimension Differential Characteristics parameter Xt=(X, Δ X, Δ2X), the characteristic parameter X in the time domain obtained splicingtWith front and back each frame
Characteristic parameter carries out being spliced to form 171 dimension union feature parameter x againn=(Xt-1,Xt,Xt+1)。
3) union feature parameter x is utilizednWith the tag along sort feature y of speakernDNN networks are trained, DNN is adjusted
The weights of network obtain the DNN networks based on Speaker Identification task to reduce error in classification up to network convergence, extract every
The Bottleneck feature bs of one framen;
The structure of Bottleneck feature extractions DNN networks used in the present invention is as shown in Figure 3, and wherein network is defeated
Enter the dimension that node layer number corresponds to the speech spectral characteristics for participating in training, exports the softmax classification outputs for speaker, node
Number is depending on participating in the quantity of speaker of training.Extract Bottleneck feature bsnInclude the following steps:
31) union feature parameter x is obtained in MATLAB platformsnThe tag along sort feature y of the corresponding speaker of each framen;This
When do not differentiate between source speaker and target speaker first, only to the characteristic parameter of each frame utilize speaker clustering label characteristics ynArea
It separates;
32) DNN networks are the neural network of full connecting-type, and using the DNN models of 9 layer networks, input layer number is
171, corresponding xn171 dimensional features per frame, intermediate 7 layers of hidden layer, every layer of number of nodes is respectively 1200,1200,1200,57,
1200,1200,1200, wherein the less hidden layer of number of nodes be Bottleneck layer, utilization successively greediness pre-training method to DNN
Connection weight between each node layer of network carries out unsupervised pre-training, and the activation primitive of hidden layer uses biology angle and brain
The closer ReLU functions of neuron, i.e.,:
F (x)=max (0, x)
Because ReLU functions have unilateral inhibition, sparse activity and relatively broad excited boundary, it is believed that it has
The ability to express of standby more primitive character.
The activation value of+1 layer of hidden layer of kth is:hk+1=f (wkhk+Bk)
Wherein hk+1、hkThe respectively activation value of+1 layer of kth and kth layer hidden layer, wkFor the company between+1 layer of kth and kth layer
Meet weights, BkFor the biasing of kth layer.
33) DNN network output layers are set to softmax classification outputs, selects 5 speaker everyone 100 voices
Spectrum signature parameter is trained, therefore output layer number of nodes is 5, the label characteristics of corresponding 5 speakers, by point of speaker
Class label characteristics ynThe supervision message that Training is carried out as DNN networks adjusts network using stochastic gradient descent algorithm
Weights, minimize the output of DNN network class and the tag along sort feature y of speakernBetween error until convergence, obtain base
In the DNN networks of Speaker Identification task, i.e. Bottleneck feature extractions network;
34) by feedforward arithmetic by union feature parameter xnDNN networks are inputted frame by frame, and it is corresponding to extract each frame
Bottleneck layers of activation value, i.e., the Bottleneck feature bs corresponding to the characteristic parameter of each framen, in of the invention
Bottleneck layers are the 4th layer of hidden layer, i.e.,:
bn=f (w3h3+B3)
Wherein, h3For the activation value of the 3rd layer of hidden layer, w3For the connection weight between the 3rd layer and the 4th layer, B3It is the 3rd layer
Biasing.
4) union feature is utilized to join xnWith the Bottleneck feature bs of each frame of correspondencenVAE models are trained, until mould
Type training restrains, and extraction VAE models imply the sampling feature z of each frames of space zn;
Variation autocoder (Variational Auto-encoder, VAE) used in the present invention is a kind of generation
Formula learning method, the concrete structure of used model is as shown in Figure 4 in the present invention, wherein xs,nThe spy of expression source voice
Parameter is levied,Indicate the characteristic parameter of the target speaker's voice obtained after conversion, bnIndicate that target speaker corresponds to frame
Bottleneck features, μ and σ are respectively the vector representation of the mean value and covariance of each component of Gaussian Profile, and z expressions pass through sampling
The VAE models that process obtains imply space, znAs sample feature.The parameter estimation procedure of VAE model trainings is as shown in Figure 6.
VAE model trainings include the following steps:
41) by union feature parameter xnTraining data, Bottleneck feature bs as VAE model encoder modulesnMake
For decoder module decoding and reconstitutings when training data VAE models are trained, in the decoder modules of VAE models will
Bottleneck feature bsnAs the control information of voice spectrum restructuring procedure, i.e., by Bottleneck feature bsnWith sampling feature
znSplicing passes through the training of the decoder modules of VAE models, reconstructed voice spectrum signature frame by frame;
The encoder input layers of VAE models are 171 nodes in the present invention, then include two hidden layers, and first layer is
500 nodes, the second layer are 64 nodes;In the second node layer, preceding 32 nodes calculate each component of Gaussian mixtures
Mean value, the variance that rear 32 nodes calculate each component (is to calculate the height that can be more fitted input signal by neural network at this time
This mixed distribution);Implicit space layer includes 32 nodes, and the value of each node from the sampling of second layer hidden layer by obtaining;
Decoder modules are set as including a hidden layer, and number of nodes 500, output layer is 171 nodes.In addition to implicit space layer is
Linear convergent rate, other hidden layer activation values are ReLU functions;
42) it utilizes ADAM optimizers to optimize VAE models shown in Fig. 4 according to the variation Bayes principle in VAE models to join
KL (Kullback-Leibler Divergence) divergences and mean square error in number estimation procedure is to adjust VAE prototype networks
Weights obtain VAE voice spectrum transformation models;
43) by union feature parameter xnVAE voice spectrum transformation models are inputted frame by frame, and are implied by sampling process
Sample feature zn。
It is exactly the decoder modules by VAE models in the phoneme letter comprising semantic feature for this method is more intuitive
Cease znOn add speaker's personal characteristics bnModulation.
It 5) will sampling feature znWith the tag along sort feature y of the speaker of each frame of correspondencenSpliced to obtain Bottleneck
The training data of Feature Mapping network (BP networks), and with the Bottleneck feature bs of each framenIt is instructed as supervision message
The training of Bottleneck Feature Mapping networks minimizes Bottleneck Feature Mapping networks by stochastic gradient descent algorithm
Output error, obtain Bottleneck Feature Mapping networks;
Target speaker Bottleneck Feature Mapping networks used in the present invention use BP networks, structure such as Fig. 5 institutes
Show, wherein input parameter zn+yn, wherein znFor variation self-encoding encoder hidden layer feature, ynTo participate in trained speaker's
Label characteristics;Output is the Bottleneck feature bs of target speakern.It includes such as to obtain Bottleneck Feature Mapping networks
Lower step:
51) VAE is implied to the sampling feature z in spacenWith the tag along sort feature y of the speaker of corresponding each framenSpliced
As the training data of Bottleneck Feature Mapping networks, Bottleneck Feature Mapping networks are using the full connection of three layers of feedforward
The neural network of type forgives an input layer, a hidden layer and an output layer, and input layer number is 37 each nodes, wherein
32 nodes, which correspond to, samples feature z in VAE modelsnDimension, 5 nodes, which correspond to, participates in five speakers of training are constituted 5
Tie up the tag along sort feature y of speakern;Output layer is 57 nodes, corresponding 57 dimension Bottleneck features;Centre includes one
Hidden layer, number of nodes 1200, hidden layer activation primitive introduce nonlinear change for sigmoid functions, and output layer is linear convergent rate;
The expression formula of sigmoid functions is:
F (x)=1/ (1+ex)
52) criterion is minimized according to mean square error, is optimized using the stochastic gradient descent algorithm that backward error is propagated
Bottleneck Feature Mapping network weights minimize network and export Bottleneck featuresIt is corresponding with each frame
Bottleneck feature bsnBetween error, i.e.,:
The weights for optimizing whole network, finally obtaining one can be by sampling feature znWith the contingency table of target speaker
Sign feature ynObtain target speaker's Bottleneck featuresBP mapping networks.
Trained DNN networks, VAE networks and Bottleneck Feature Mapping combination of network constitute based on VAE and
The speech conversion system of Bottleneck features;
Voice conversion, voice switch process are realized according to frequency spectrum flow path switch shown in Fig. 2:
6) by the union feature parameter X of the voice of source speaker to be convertedpBy the encoder modules of VAE models, obtain
To the sampling feature z of implicit each frames of space zn;
Obtain the union feature parameter X of voice to be convertedpSpecially:To the union feature parameter X of voice to be convertedp
Specially:The mel cepstrum characteristic parameter of voice to be converted is extracted using AHOcoder, and to extracting on MATLAB platforms
Each frame characteristic parameter carry out first-order difference and second differnce, and spliced to obtain characteristic parameter with former feature, in time domain
On by the characteristic parameter of the obtained characteristic parameter of splicing and front and back each frame carry out being spliced to form again union feature parameter to get to
The characteristic parameter X of voice spectrum to be convertedp。
It 7) will sampling feature znWith the tag along sort feature y of target speakernIt is special to carry out splicing input Bottleneck frame by frame
Mapping network is levied, the Bottleneck features of target speaker are obtained
8) by Bottleneck featuresWith sampling feature znThe decoder Restructuring Modules that splicing passes through VAE models frame by frame
Go out the union feature parameter X of transformed voicep';
9) AHOcoder sound coder reconstructed speech signals are utilized.
Reconstructed speech signal is specially:The speech characteristic parameter X that will be obtained after conversionp' it is reduced to mel cepstrum feature shape
Formula removes time domain joint and Difference Terms, sound coder AHOcoder is recycled to synthesize transformed voice.
Mel cepstrum distortion (Mel-Cepstrum Distortion, MCD) is a kind of measurement voice turn in voice conversion
The objective standard of obversion amount.MCD values between converting speech and target voice are smaller, illustrate turning for corresponding speech conversion system
It is transsexual can be better.Fig. 7 is that when characterizing speaker's personal characteristics, VAE is trained by non-parallel corpus using different characteristic parameter
The MCD values of converting speech under the different switching situation that model obtains compare, and as can be seen from the figure use Bottleneck special
The voice conversion for levying to characterize speaker's individual information, the performance ratio of system use speaker's tag characterization speaker's individual information
Converting system performance it is more preferable.
The variation self-encoding encoder VAE is the generation learning network model in a kind of deep learning concept, compared to depth
It is other such as depth confidence network DBN in study, convolutional neural networks CNN etc., VAE models its encoder in the training process
The probability distribution that process can learn to meet original input signal by the principle of variation Bayes, and obtained by sampling process
Original signal implies the feature in space, then utilizes sampling feature reconstruction original signal by decoder processes so that reconstruct letter
Error number between original signal as far as possible small (or probability distribution variances are small).This characteristic of VAE models can be applied
To in Style Transfer, and in voice conversion, it is believed that can be isolated and speaker in implicit space by VAE models
Property feature it is unrelated and with the relevant phoneme information of semantic feature, and implicit spatial information can be utilized to combine characterization speaker's individual character
The parameter reconstructed voice spectrum signal of feature.In the present invention, it is extracted using the DNN networks based on Speaker Identification task
Bottleneck features obtain phoneme letter to characterize the personal characteristics of speaker by the mapping network that BP network trainings obtain
Mapping relations between breath, speaker's label union feature formed and Bottleneck features, to be said indirectly by source
People's speech spectral characteristics are talked about to obtain target speaker's Bottleneck features, will be implied finally by the decoder modules of VAE
The Bottleneck feature reconstructions of phoneme information and target speaker in space are transformed speech spectral characteristics.
The present invention is for traditional gauss hybrid models conversion method and utilizes BP real-time performance voice spectrums conversion side
It is needed in method using parallel corpora and needs to carry out the problem of DTW alignment carries out model training again, in conjunction with the characteristics of VAE models
Phonetics transfer method under a kind of non-parallel corpus proposed, there are three key points by the present invention:First, knowing using based on speaker
The Bottleneck features of DNN networks extraction characterization speaker's personal characteristics of other taskSecond is that being built using BP neural network
Vertical sampling feature zn, speaker tag along sort feature ynThe union feature of composition and Bottleneck featuresBetween mapping
Relationship;Third, using the decoder modules of trained VAE models by Bottleneck featuresWith sampling feature znComposition
Union feature is reconstructed into transformed speech spectral characteristics.
The innovation of the present invention is:1. using the feature of VAE models, isolated in implicit space and speaker
Property feature is unrelated and related with semantic feature phoneme information, so as to realize that the voice under non-parallel corpus training turns
It changes, and this method can complete a variety of convert tasks by a model training for different speakers;2. utilize based on
The Bottleneck features extracted in the DNN networks of Speaker Identification task participate in VAE moulds as the personal characteristics of speaker
The decoder Restructuring Module processes of type, improve voice conversion performance.
For some medical auxiliary systems, such as cannot normal sounding because having phonatory organ physiological defect or illness
Some principles of method involved in the present invention may be used when providing sounding ancillary equipment for them in patient;The present invention has
Preferable expansion provides for the materialization for solving the problems, such as in voice conversion, including the voice transfer problem of multi-to-multi (M2M)
Resolving ideas.
The above is only the preferred embodiment of the present invention, it should be pointed out that:For the ordinary skill of the art
For personnel, without departing from the principle of the present invention, several improvement can also be made, these improvement also should be regarded as this hair
Bright protection domain.
Claims (8)
1. the phonetics transfer method based on VAE under a kind of non-parallel corpus training, which is characterized in that including training step and voice
Switch process:
Training step:
1) AHOcoder sound coders is utilized to extract the mel cepstrum characteristic parameter for the speaker's voice for participating in training respectively
X;
2) each frame mel cepstrum characteristic parameter X of extraction difference processing and with former characteristic parameter X splice, when
The characteristic parameter X for obtaining splicing on domaintIt carries out being spliced to form union feature parameter x again with the characteristic parameter of front and back each framen;
3) union feature parameter x is utilizednWith the tag along sort feature y of speakernDNN networks are trained, DNN networks are adjusted
Weights to reduce error in classification until network convergence, obtain the DNN networks based on Speaker Identification task, extract each frame
Bottleneck feature bsn;
4) union feature parameter x is utilizednWith the Bottleneck feature bs of each frame of correspondencenVAE models are trained, until model
Training convergence, extraction VAE models imply the sampling feature z of each frames of space zn;
It 5) will sampling feature znWith the tag along sort feature y of the speaker of each frame of correspondencenSpliced to obtain Bottleneck features
The training data of mapping network, and with the Bottleneck feature bs of each framenBottleneck features are instructed to reflect as supervision message
The training for penetrating network is minimized the output error of Bottleneck Feature Mapping networks by stochastic gradient descent algorithm, obtained
Bottleneck Feature Mapping networks;
Voice switch process:
6) by the union feature parameter X of voice to be convertedpBy the encoder modules of VAE models, each frames of implicit space z are obtained
Sampling feature zn;
It 7) will sampling feature znWith the tag along sort feature y of target speakernSplicing input Bottleneck features frame by frame are carried out to reflect
Network is penetrated, the Bottleneck features of target speaker are obtained
8) by Bottleneck featuresWith sampling feature znSplicing goes out to turn by the decoder Restructuring Modules of VAE models frame by frame
The union feature parameter X of voice after changingp';
9) AHOcoder sound coder reconstructed speech signals are utilized.
2. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist
In the mel cepstrum of speaker's voice of extraction participation training is characterized in utilizing AHOcoder sound codecs in the step 1)
Device extracts the mel cepstrum feature for the speaker's voice for participating in training respectively, and mel cepstrum feature is read in Matlab platforms.
3. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist
In obtaining union feature parameter in the step 2) is specially:Each frame characteristic parameter X of extraction is subjected to first-order difference and two
Order difference, and spliced to obtain characteristic parameter X with former characteristic parameter Xt=(X, Δ X, Δ2X), splicing is obtained in the time domain
Characteristic parameter XtIt carries out being spliced to form union feature parameter x again with the characteristic parameter of front and back each framen=(Xt-1,Xt,Xt+1)。
4. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist
In extraction Bottleneck feature bs in the step 3)nInclude the following steps:
31) union feature parameter x is obtained in MATLAB platformsnThe tag along sort feature y of the corresponding speaker of each framen;
32) unsupervised pre-training is carried out to DNN networks using successively greedy pre-training method, the activation primitive of hidden layer uses
ReLU functions;
33) DNN network output layers are set to softmax classification outputs, by the tag along sort feature y of speakernAs DNN nets
Network carries out the supervision message of Training, and the weights of network are adjusted using stochastic gradient descent algorithm, minimizes DNN networks point
Class exports the tag along sort feature y with speakernBetween error until convergence, obtain the DNN based on Speaker Identification task
Network;
34) by feedforward arithmetic by union feature parameter xnDNN networks are inputted frame by frame, extract the corresponding Bottleneck of each frame
The activation value of layer, i.e., the Bottleneck feature bs corresponding to the mel cepstrum characteristic parameter of each framen。
5. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist
In VAE model trainings includes the following steps in the step 4):
41) by union feature parameter xnTraining data, Bottleneck feature bs as VAE model encoder modulesnAs
Training data when decoder module decoding and reconstitutings is trained VAE models, will in the decoder modules of VAE models
Bottleneck feature bsnAs the control information of voice spectrum restructuring procedure, i.e., by Bottleneck feature bsnWith sampling feature
znSplicing passes through the training of the decoder modules of VAE models, reconstructed voice spectrum signature frame by frame;
42) the KL divergences and mean square error during utilizing ADAM optimizers to optimize VAE model parameter estimations are to adjust VAE models
Network weight obtains VAE voice spectrum transformation models;
43) by union feature parameter xnVAE voice spectrum transformation models are inputted frame by frame, and implied sample is obtained by sampling process
Feature zn。
6. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist
In obtaining Bottleneck Feature Mapping networks in the step 5) and include the following steps:
51) by the sampling feature z of VAE voice spectrum transformation modelsnWith the tag along sort feature y of the speaker of corresponding each framenIt carries out
Splice the training data as Bottleneck Feature Mapping networks, Bottleneck Feature Mapping networks are using an input
The structure of layer, a hidden layer and an output layer, hidden layer activation primitive are sigmoid functions, and output layer is linear convergent rate;
52) criterion is minimized according to mean square error, is optimized using the stochastic gradient descent algorithm that backward error is propagated
Bottleneck Feature Mapping network weights minimize network and export Bottleneck featuresIt is corresponding with each frame
Bottleneck feature bsnBetween error.
7. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist
In obtaining the union feature parameter X of voice to be converted in the step 6)pSpecially:It is extracted using AHOcoder to be converted
The mel cepstrum characteristic parameter of voice, and on MATLAB platforms to each frame characteristic parameter for extracting carry out first-order difference and
Second differnce, and spliced with former feature to obtain characteristic parameter, the characteristic parameter in the time domain obtaining splicing with it is front and back each
The characteristic parameter of one frame carries out being spliced to form again union feature parameter to get to the characteristic parameter X of voice spectrum to be convertedp。
8. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist
In reconstructed speech signal is specially in the step 9):The speech characteristic parameter X that will be obtained after conversionp' it is reduced to mel cepstrum
Characteristic formp removes time domain joint and Difference Terms, AHOcoder sound coders is recycled to synthesize transformed voice.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810393556.XA CN108777140B (en) | 2018-04-27 | 2018-04-27 | Voice conversion method based on VAE under non-parallel corpus training |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810393556.XA CN108777140B (en) | 2018-04-27 | 2018-04-27 | Voice conversion method based on VAE under non-parallel corpus training |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108777140A true CN108777140A (en) | 2018-11-09 |
CN108777140B CN108777140B (en) | 2020-07-28 |
Family
ID=64026673
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810393556.XA Active CN108777140B (en) | 2018-04-27 | 2018-04-27 | Voice conversion method based on VAE under non-parallel corpus training |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108777140B (en) |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109326283A (en) * | 2018-11-23 | 2019-02-12 | 南京邮电大学 | Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder |
CN109377986A (en) * | 2018-11-29 | 2019-02-22 | 四川长虹电器股份有限公司 | A kind of non-parallel corpus voice personalization conversion method |
CN109377978A (en) * | 2018-11-12 | 2019-02-22 | 南京邮电大学 | Multi-to-multi voice conversion method under non-parallel text condition based on i vector |
CN109584893A (en) * | 2018-12-26 | 2019-04-05 | 南京邮电大学 | Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition |
CN109599091A (en) * | 2019-01-14 | 2019-04-09 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARWGAN-GP and x vector |
CN109671442A (en) * | 2019-01-14 | 2019-04-23 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARGAN Yu x vector |
CN110033096A (en) * | 2019-03-07 | 2019-07-19 | 北京大学 | A kind of status data generation method and system for intensified learning |
CN110047501A (en) * | 2019-04-04 | 2019-07-23 | 南京邮电大学 | Multi-to-multi phonetics transfer method based on beta-VAE |
CN110060691A (en) * | 2019-04-16 | 2019-07-26 | 南京邮电大学 | Multi-to-multi phonetics transfer method based on i vector sum VARSGAN |
CN110060690A (en) * | 2019-04-04 | 2019-07-26 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARGAN and ResNet |
CN110070895A (en) * | 2019-03-11 | 2019-07-30 | 江苏大学 | A kind of mixed sound event detecting method based on supervision variation encoder Factor Decomposition |
CN110085254A (en) * | 2019-04-22 | 2019-08-02 | 南京邮电大学 | Multi-to-multi phonetics transfer method based on beta-VAE and i-vector |
CN110164463A (en) * | 2019-05-23 | 2019-08-23 | 北京达佳互联信息技术有限公司 | A kind of phonetics transfer method, device, electronic equipment and storage medium |
CN110211575A (en) * | 2019-06-13 | 2019-09-06 | 苏州思必驰信息科技有限公司 | Voice for data enhancing adds method for de-noising and system |
CN110648658A (en) * | 2019-09-06 | 2020-01-03 | 北京达佳互联信息技术有限公司 | Method and device for generating voice recognition model and electronic equipment |
CN111326138A (en) * | 2020-02-24 | 2020-06-23 | 北京达佳互联信息技术有限公司 | Voice generation method and device |
CN111724809A (en) * | 2020-06-15 | 2020-09-29 | 苏州意能通信息技术有限公司 | Vocoder implementation method and device based on variational self-encoder |
CN112309365A (en) * | 2020-10-21 | 2021-02-02 | 北京大米科技有限公司 | Training method and device of speech synthesis model, storage medium and electronic equipment |
CN112382271A (en) * | 2020-11-30 | 2021-02-19 | 北京百度网讯科技有限公司 | Voice processing method, device, electronic equipment and storage medium |
CN113032558A (en) * | 2021-03-11 | 2021-06-25 | 昆明理工大学 | Variational semi-supervised hundred-degree encyclopedia classification method fusing wiki knowledge |
CN113299267A (en) * | 2021-07-26 | 2021-08-24 | 北京语言大学 | Voice stimulation continuum synthesis method and device based on variational self-encoder |
WO2021212954A1 (en) * | 2020-04-21 | 2021-10-28 | 升智信息科技(南京)有限公司 | Method and apparatus for synthesizing emotional speech of specific speaker with extremely few resources |
CN113571039A (en) * | 2021-08-09 | 2021-10-29 | 北京百度网讯科技有限公司 | Voice conversion method, system, electronic equipment and readable storage medium |
CN113763924A (en) * | 2021-11-08 | 2021-12-07 | 北京优幕科技有限责任公司 | Acoustic deep learning model training method, and voice generation method and device |
CN113763987A (en) * | 2021-09-06 | 2021-12-07 | 中国科学院声学研究所 | Training method and device of voice conversion model |
CN114360557A (en) * | 2021-12-22 | 2022-04-15 | 北京百度网讯科技有限公司 | Voice tone conversion method, model training method, device, equipment and medium |
WO2022083083A1 (en) * | 2020-10-21 | 2022-04-28 | 南京硅基智能科技有限公司 | Sound conversion system and training method for same |
US11854562B2 (en) | 2019-05-14 | 2023-12-26 | International Business Machines Corporation | High-quality non-parallel many-to-many voice conversion |
WO2024069726A1 (en) * | 2022-09-27 | 2024-04-04 | 日本電信電話株式会社 | Learning device, conversion device, training method, conversion method, and program |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH04248598A (en) * | 1990-10-30 | 1992-09-04 | Internatl Business Mach Corp <Ibm> | Device and method of editing composite sound information |
CN102063899A (en) * | 2010-10-27 | 2011-05-18 | 南京邮电大学 | Method for voice conversion under unparallel text condition |
CN103258531A (en) * | 2013-05-29 | 2013-08-21 | 安宁 | Harmonic wave feature extracting method for irrelevant speech emotion recognition of speaker |
CN104123933A (en) * | 2014-08-01 | 2014-10-29 | 中国科学院自动化研究所 | Self-adaptive non-parallel training based voice conversion method |
CN104361620A (en) * | 2014-11-27 | 2015-02-18 | 韩慧健 | Mouth shape animation synthesis method based on comprehensive weighted algorithm |
WO2016207978A1 (en) * | 2015-06-23 | 2016-12-29 | 株式会社大入 | Method and device for manufacturing book with audio, and method and device for reproducing acoustic waveform |
US20170069306A1 (en) * | 2015-09-04 | 2017-03-09 | Foundation of the Idiap Research Institute (IDIAP) | Signal processing method and apparatus based on structured sparsity of phonological features |
CN106778700A (en) * | 2017-01-22 | 2017-05-31 | 福州大学 | One kind is based on change constituent encoder Chinese Sign Language recognition methods |
CN107274029A (en) * | 2017-06-23 | 2017-10-20 | 深圳市唯特视科技有限公司 | A kind of future anticipation method of interaction medium in utilization dynamic scene |
CN107301859A (en) * | 2017-06-21 | 2017-10-27 | 南京邮电大学 | Phonetics transfer method under the non-parallel text condition clustered based on adaptive Gauss |
-
2018
- 2018-04-27 CN CN201810393556.XA patent/CN108777140B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH04248598A (en) * | 1990-10-30 | 1992-09-04 | Internatl Business Mach Corp <Ibm> | Device and method of editing composite sound information |
CN102063899A (en) * | 2010-10-27 | 2011-05-18 | 南京邮电大学 | Method for voice conversion under unparallel text condition |
CN102063899B (en) * | 2010-10-27 | 2012-05-23 | 南京邮电大学 | Method for voice conversion under unparallel text condition |
CN103258531A (en) * | 2013-05-29 | 2013-08-21 | 安宁 | Harmonic wave feature extracting method for irrelevant speech emotion recognition of speaker |
CN103258531B (en) * | 2013-05-29 | 2015-11-11 | 安宁 | A kind of harmonic characteristic extracting method of the speech emotion recognition had nothing to do for speaker |
CN104123933A (en) * | 2014-08-01 | 2014-10-29 | 中国科学院自动化研究所 | Self-adaptive non-parallel training based voice conversion method |
CN104361620A (en) * | 2014-11-27 | 2015-02-18 | 韩慧健 | Mouth shape animation synthesis method based on comprehensive weighted algorithm |
WO2016207978A1 (en) * | 2015-06-23 | 2016-12-29 | 株式会社大入 | Method and device for manufacturing book with audio, and method and device for reproducing acoustic waveform |
US20170069306A1 (en) * | 2015-09-04 | 2017-03-09 | Foundation of the Idiap Research Institute (IDIAP) | Signal processing method and apparatus based on structured sparsity of phonological features |
CN106778700A (en) * | 2017-01-22 | 2017-05-31 | 福州大学 | One kind is based on change constituent encoder Chinese Sign Language recognition methods |
CN107301859A (en) * | 2017-06-21 | 2017-10-27 | 南京邮电大学 | Phonetics transfer method under the non-parallel text condition clustered based on adaptive Gauss |
CN107274029A (en) * | 2017-06-23 | 2017-10-20 | 深圳市唯特视科技有限公司 | A kind of future anticipation method of interaction medium in utilization dynamic scene |
Non-Patent Citations (1)
Title |
---|
李燕萍: "一种适于说话人辨认的自适应频率尺度变换", 《南京理工大学学报》 * |
Cited By (45)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109377978A (en) * | 2018-11-12 | 2019-02-22 | 南京邮电大学 | Multi-to-multi voice conversion method under non-parallel text condition based on i vector |
CN109377978B (en) * | 2018-11-12 | 2021-01-26 | 南京邮电大学 | Many-to-many speaker conversion method based on i vector under non-parallel text condition |
CN109326283B (en) * | 2018-11-23 | 2021-01-26 | 南京邮电大学 | Many-to-many voice conversion method based on text encoder under non-parallel text condition |
CN109326283A (en) * | 2018-11-23 | 2019-02-12 | 南京邮电大学 | Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder |
CN109377986A (en) * | 2018-11-29 | 2019-02-22 | 四川长虹电器股份有限公司 | A kind of non-parallel corpus voice personalization conversion method |
CN109377986B (en) * | 2018-11-29 | 2022-02-01 | 四川长虹电器股份有限公司 | Non-parallel corpus voice personalized conversion method |
CN109584893A (en) * | 2018-12-26 | 2019-04-05 | 南京邮电大学 | Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition |
CN109584893B (en) * | 2018-12-26 | 2021-09-14 | 南京邮电大学 | VAE and i-vector based many-to-many voice conversion system under non-parallel text condition |
CN109599091A (en) * | 2019-01-14 | 2019-04-09 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARWGAN-GP and x vector |
CN109671442A (en) * | 2019-01-14 | 2019-04-23 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARGAN Yu x vector |
CN109599091B (en) * | 2019-01-14 | 2021-01-26 | 南京邮电大学 | Star-WAN-GP and x-vector based many-to-many speaker conversion method |
CN110033096B (en) * | 2019-03-07 | 2021-04-02 | 北京大学 | State data generation method and system for reinforcement learning |
CN110033096A (en) * | 2019-03-07 | 2019-07-19 | 北京大学 | A kind of status data generation method and system for intensified learning |
CN110070895A (en) * | 2019-03-11 | 2019-07-30 | 江苏大学 | A kind of mixed sound event detecting method based on supervision variation encoder Factor Decomposition |
CN110047501A (en) * | 2019-04-04 | 2019-07-23 | 南京邮电大学 | Multi-to-multi phonetics transfer method based on beta-VAE |
CN110047501B (en) * | 2019-04-04 | 2021-09-07 | 南京邮电大学 | Many-to-many voice conversion method based on beta-VAE |
CN110060690A (en) * | 2019-04-04 | 2019-07-26 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARGAN and ResNet |
CN110060691B (en) * | 2019-04-16 | 2023-02-28 | 南京邮电大学 | Many-to-many voice conversion method based on i-vector and VARSGAN |
CN110060691A (en) * | 2019-04-16 | 2019-07-26 | 南京邮电大学 | Multi-to-multi phonetics transfer method based on i vector sum VARSGAN |
CN110085254A (en) * | 2019-04-22 | 2019-08-02 | 南京邮电大学 | Multi-to-multi phonetics transfer method based on beta-VAE and i-vector |
US11854562B2 (en) | 2019-05-14 | 2023-12-26 | International Business Machines Corporation | High-quality non-parallel many-to-many voice conversion |
CN110164463A (en) * | 2019-05-23 | 2019-08-23 | 北京达佳互联信息技术有限公司 | A kind of phonetics transfer method, device, electronic equipment and storage medium |
CN110164463B (en) * | 2019-05-23 | 2021-09-10 | 北京达佳互联信息技术有限公司 | Voice conversion method and device, electronic equipment and storage medium |
CN110211575B (en) * | 2019-06-13 | 2021-06-04 | 思必驰科技股份有限公司 | Voice noise adding method and system for data enhancement |
CN110211575A (en) * | 2019-06-13 | 2019-09-06 | 苏州思必驰信息科技有限公司 | Voice for data enhancing adds method for de-noising and system |
CN110648658B (en) * | 2019-09-06 | 2022-04-08 | 北京达佳互联信息技术有限公司 | Method and device for generating voice recognition model and electronic equipment |
CN110648658A (en) * | 2019-09-06 | 2020-01-03 | 北京达佳互联信息技术有限公司 | Method and device for generating voice recognition model and electronic equipment |
CN111326138A (en) * | 2020-02-24 | 2020-06-23 | 北京达佳互联信息技术有限公司 | Voice generation method and device |
WO2021212954A1 (en) * | 2020-04-21 | 2021-10-28 | 升智信息科技(南京)有限公司 | Method and apparatus for synthesizing emotional speech of specific speaker with extremely few resources |
CN111724809A (en) * | 2020-06-15 | 2020-09-29 | 苏州意能通信息技术有限公司 | Vocoder implementation method and device based on variational self-encoder |
CN112309365A (en) * | 2020-10-21 | 2021-02-02 | 北京大米科技有限公司 | Training method and device of speech synthesis model, storage medium and electronic equipment |
WO2022083083A1 (en) * | 2020-10-21 | 2022-04-28 | 南京硅基智能科技有限公司 | Sound conversion system and training method for same |
US11875775B2 (en) | 2020-10-21 | 2024-01-16 | Nanjing Silicon Intelligence Technology Co., Ltd. | Voice conversion system and training method therefor |
CN112382271B (en) * | 2020-11-30 | 2024-03-26 | 北京百度网讯科技有限公司 | Voice processing method, device, electronic equipment and storage medium |
CN112382271A (en) * | 2020-11-30 | 2021-02-19 | 北京百度网讯科技有限公司 | Voice processing method, device, electronic equipment and storage medium |
CN113032558B (en) * | 2021-03-11 | 2023-08-29 | 昆明理工大学 | Variable semi-supervised hundred degree encyclopedia classification method integrating wiki knowledge |
CN113032558A (en) * | 2021-03-11 | 2021-06-25 | 昆明理工大学 | Variational semi-supervised hundred-degree encyclopedia classification method fusing wiki knowledge |
CN113299267A (en) * | 2021-07-26 | 2021-08-24 | 北京语言大学 | Voice stimulation continuum synthesis method and device based on variational self-encoder |
CN113571039B (en) * | 2021-08-09 | 2022-04-08 | 北京百度网讯科技有限公司 | Voice conversion method, system, electronic equipment and readable storage medium |
CN113571039A (en) * | 2021-08-09 | 2021-10-29 | 北京百度网讯科技有限公司 | Voice conversion method, system, electronic equipment and readable storage medium |
CN113763987A (en) * | 2021-09-06 | 2021-12-07 | 中国科学院声学研究所 | Training method and device of voice conversion model |
CN113763924A (en) * | 2021-11-08 | 2021-12-07 | 北京优幕科技有限责任公司 | Acoustic deep learning model training method, and voice generation method and device |
CN114360557A (en) * | 2021-12-22 | 2022-04-15 | 北京百度网讯科技有限公司 | Voice tone conversion method, model training method, device, equipment and medium |
CN114360557B (en) * | 2021-12-22 | 2022-11-01 | 北京百度网讯科技有限公司 | Voice tone conversion method, model training method, device, equipment and medium |
WO2024069726A1 (en) * | 2022-09-27 | 2024-04-04 | 日本電信電話株式会社 | Learning device, conversion device, training method, conversion method, and program |
Also Published As
Publication number | Publication date |
---|---|
CN108777140B (en) | 2020-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108777140A (en) | Phonetics transfer method based on VAE under a kind of training of non-parallel corpus | |
CN112767958B (en) | Zero-order learning-based cross-language tone conversion system and method | |
CN102800316B (en) | Optimal codebook design method for voiceprint recognition system based on nerve network | |
CN110060690A (en) | Multi-to-multi voice conversion method based on STARGAN and ResNet | |
CN109671442A (en) | Multi-to-multi voice conversion method based on STARGAN Yu x vector | |
Luo et al. | Emotional voice conversion using deep neural networks with MCC and F0 features | |
Casale et al. | Multistyle classification of speech under stress using feature subset selection based on genetic algorithms | |
CN109671423A (en) | Non-parallel text compressing method under the limited situation of training data | |
CN110060657A (en) | Multi-to-multi voice conversion method based on SN | |
Azizah et al. | Hierarchical transfer learning for multilingual, multi-speaker, and style transfer DNN-based TTS on low-resource languages | |
Liu et al. | Voice conversion with transformer network | |
Lai et al. | Phone-aware LSTM-RNN for voice conversion | |
Moon et al. | Mist-tacotron: End-to-end emotional speech synthesis using mel-spectrogram image style transfer | |
Wang et al. | A study on acoustic modeling for child speech based on multi-task learning | |
Bi et al. | Deep feed-forward sequential memory networks for speech synthesis | |
Latif et al. | Generative emotional AI for speech emotion recognition: The case for synthetic emotional speech augmentation | |
Guo et al. | Phonetic posteriorgrams based many-to-many singing voice conversion via adversarial training | |
Choi et al. | A melody-unsupervision model for singing voice synthesis | |
Kang et al. | Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion | |
CN113470622A (en) | Conversion method and device capable of converting any voice into multiple voices | |
Zhao et al. | Research on voice cloning with a few samples | |
Wu et al. | Non-parallel voice conversion system with WaveNet vocoder and collapsed speech suppression | |
Li et al. | Many-to-many voice conversion based on bottleneck features with variational autoencoder for non-parallel training data | |
Nazir et al. | Deep learning end to end speech synthesis: A review | |
Kaushik et al. | End-to-end speaker age and height estimation using attention mechanism and triplet loss |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |