CN109671442A - Multi-to-multi voice conversion method based on STARGAN Yu x vector - Google Patents

Multi-to-multi voice conversion method based on STARGAN Yu x vector Download PDF

Info

Publication number
CN109671442A
CN109671442A CN201910030578.4A CN201910030578A CN109671442A CN 109671442 A CN109671442 A CN 109671442A CN 201910030578 A CN201910030578 A CN 201910030578A CN 109671442 A CN109671442 A CN 109671442A
Authority
CN
China
Prior art keywords
vector
speaker
generator
feature
spectrum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910030578.4A
Other languages
Chinese (zh)
Other versions
CN109671442B (en
Inventor
李燕萍
曹盼
张燕
徐东祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201910030578.4A priority Critical patent/CN109671442B/en
Publication of CN109671442A publication Critical patent/CN109671442A/en
Application granted granted Critical
Publication of CN109671442B publication Critical patent/CN109671442B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Complex Calculations (AREA)

Abstract

The multi-to-multi voice conversion method based on STARGAN Yu x vector that the invention discloses a kind of, including training stage and conversion stage, STARGAN has been used to be combined with x vector to realize speech conversion system, the individual character similarity and voice quality of voice after converting can preferably be promoted, there is preferably characterization performance especially for language x vector in short-term, voice conversion quality is more preferable, while can overcome the problems, such as excessively smooth in C-VAE, realizes a kind of phonetics transfer method of high quality.Furthermore, this method can be realized the conversion of the voice under non-parallel text condition, and training process does not need any alignment procedure, improve the versatility and practicability of speech conversion system, multiple sources-target speaker couple converting system can also be incorporated into a transformation model by this method, more speakers are realized to more voice conversions, are dubbed in the conversion of across languages voices, film, there is preferable application prospect in the fields such as voiced translation.

Description

Multi-to-multi voice conversion method based on STARGAN Yu x vector
Technical field
The present invention relates to a kind of multi-to-multi voice conversion method, more particularly to a kind of based on STARGAN and x vector Multi-to-multi voice conversion method.
Background technique
Voice conversion is the research branch of field of voice signal, is in speech analysis, identification and the research of synthesis base Develop on plinth and extension.The target of voice conversion is the voice personal characteristics of change source speaker, is spoken with target The voice personal characteristics of people, that is, it is voice that another person says that the voice that says a people sounds like after conversion, Retain semanteme simultaneously.
Voice Conversion Techniques pass through years of researches, have emerged in large numbers many classical conversion methods.Including Gauss Mixed model (Gaussian Mixed Model, GMM), recurrent neural network (Recurrent Neural Network, RNN), the most of phonetics transfer method such as deep neural network (Deep Neural Networks, DNN).But these languages It is that parallel text, i.e. source speaker and target speaker need to issue language that sound conversion method requires the corpus for training mostly Sound content, the identical sentence of voice duration, and pronounce rhythm and mood etc. it is consistent as far as possible.However speech characteristic parameter when training The accuracy of alignment can become a kind of restriction of voice conversion performance.Furthermore converted across languages, medical treatment auxiliary patient's voice turn Parallel speech can not also be obtained by changing etc. in practical applications.Therefore, no matter come from the versatility or practicability of speech conversion system Consider, the research of phonetics transfer method all has great practical significance and application value under non-parallel text condition.
Phonetics transfer method under existing non-parallel text condition has based on the consistent confrontation network (Cycle- of circulation Consistent Adversarial Networks, Cycle-GAN) method, be based on condition variation self-encoding encoder The method etc. of (Conditional Variational Auto-Encoder, C-VAE).Voice conversion based on C-VAE model Method directly establishes speech conversion system using the identity label of speaker, and wherein encoder realizes semantic and individual character to voice The separation of information, decoder realizes the reconstruct of voice by semantic and speaker's identity label, so as to release to parallel The dependence of text.But since C-VAE is based on ideal hypothesis, it is believed that the data observed usually follow Gaussian Profile, cause to solve The output voice excess smoothness of code device, the voice quality after conversion be not high.Phonetics transfer method benefit based on Cycle-GAN model It is lost with antagonism loss is consistent with circulation, while learning the positive mapping and inverse mapping of acoustic feature, can effectively solve to put down Sliding problem improves converting speech quality, but Cycle-GAN can only realize one-to-one voice conversion.
Network (Star Generative Adversarial Network, STARGAN) model is fought based on star-like generation Phonetics transfer method simultaneously have the advantages that C-VAE and Cycle-GAN, due to this method generator have encoding and decoding knot Structure can learn multi-to-multi mapping simultaneously, and the attribute of generator output is controlled by speaker's identity label, therefore may be implemented non- The voice conversion of parallel lower multi-to-multi.But the identity label of speaker can not give full expression to the individual character of speaker in the method Change feature, therefore the voice after conversion be not greatly improved yet in voice similarity.
Summary of the invention
Goal of the invention: the technical problem to be solved in the present invention is to provide a kind of multi-to-multis based on STARGAN and x vector to say People's conversion method is talked about, the individualized feature of speaker can be given full expression to, the individual character for effectively improving voice after converting is similar Degree.
Technical solution: the multi-to-multi voice conversion method of the present invention based on STARGAN and x vector, including instruction Practice the stage and conversion the stage, the training stage the following steps are included:
(1.1) training corpus is obtained, training corpus is made of the corpus of several speakers, is said comprising source speaker and target Talk about people;
(1.2) training corpus is extracted into each speaker's sentence by WORLD speech analysis/synthetic model Spectrum envelope feature x, fundamental frequency feature and the x vector X-vector for representing each speaker's individualized feature;
(1.3) by the spectrum envelope feature x of source speakers, target speaker spectrum envelope feature xt, source speaker mark Sign csWith x vector X-vectorsAnd target speaker's label ct, x vector X-vectort, be input to STARGAN network into Row training, the STARGAN network are made of generator G, discriminator D and classifier C, and the generator G is by coding net Network and decoding network are constituted;
(1.4) training process make the loss function of generator, the loss function of discriminator, classifier loss function as far as possible It is small, until the number of iterations of setting, obtains trained STARGAN network;
(1.5) the fundamental frequency transfer function of speech pitch of the building from the speech pitch of source speaker to target speaker;
The conversion stage the following steps are included:
(2.1) voice of source speaker in corpus to be converted is extracted into frequency spectrum by WORLD speech analysis/synthetic model Envelope characteristic xs', aperiodicity feature and fundamental frequency;
(2.2) by above-mentioned source speaker spectrum envelope feature xs', target speaker's label characteristics ct', target speaker x Vector X-vectortTrained STARGAN network in ' input (1.4), reconstructs target speaker's spectrum envelope feature xtc′;
(2.3) the fundamental frequency transfer function obtained by (1.5), the source speaker's fundamental frequency extracted in (2.1) is converted to The fundamental frequency of target speaker;
(2.4) by target speaker spectrum envelope feature x obtained in (2.2)tc', target speaker obtained in (2.3) Fundamental frequency and (2.1) in aperiodicity feature for extracting by WORLD speech analysis/synthetic model, after synthesis is converted Speaker's voice.
Further, the training process in step (1.3) and (1.4) the following steps are included:
(1) by the spectrum envelope feature x of source speakersThe coding network for inputting generator G, obtains the unrelated language of speaker Adopted feature G (xs);
(2) by semantic feature G (x obtained aboves) with the label characteristics c of target speakert, target speaker x vector X-vectortThe decoding network for being input to generator G together is trained, and minimizes the loss of generator G in the training process Function, to obtain the spectrum envelope feature x of target speakertc
(3) by the spectrum envelope feature x of target speaker obtained abovetc, it is again inputted into the coding net of generator G Network obtains the unrelated semantic feature G (x of speakertc);
(4) by semantic feature G (x obtained abovetc) and source speaker label characteristics cs, source speaker's x vector X- vectorsThe decoding network for being input to generator G is trained, and is minimized the loss function of generator G in the training process, is obtained To the spectrum envelope feature x of reconstruct source speakersc
(5) by the spectrum envelope feature x of target speakertc, target speaker's spectrum signature xt, and target speaker Label characteristics ctIt is input in discriminator D and is trained together, minimize the loss function of discriminator;
(6) by the spectrum envelope feature x of target speakertcWith the spectrum envelope feature x of target speakertInput classifier C is trained, and minimizes the loss function of classifier;
(7) it returns to step (1) to repeat the above steps, until reaching the number of iterations, to obtain trained STARGAN Network.
Further, the input process in step (2.2) the following steps are included:
(1) by the spectrum envelope feature x of source speakersThe coding network of ' input generator G, it is unrelated to obtain speaker Semantic feature G (xs)′;
(2) by semantic feature G (x obtained abovesThe label characteristics c of) ' with target speakert', the x of target speaker Vector X-vectort' it is input to the decoding network of generator G together, obtain the spectrum envelope feature x of target speakertc′。
Further, the generator G uses two-dimensional convolution neural network, loss function are as follows:
Wherein, λcls>=0, λcyc>=0 and λid>=0 is regularization parameter, and it is consistent to respectively indicate Classification Loss, circulation Property loss and Feature Mapping loss weight,Lcyc(G)、Lid(G) confrontation of generator is respectively indicated Loss, the Classification Loss of classifier optimization generator, the consistent loss of circulation, Feature Mapping loss;
The discriminator D uses two-dimensional convolution neural network, loss function are as follows:
Wherein, D (xt,ct) indicate that discriminator D differentiates real frequency spectrum feature, G (xs,ct,X-vectort) indicate generator G Target speaker's spectrum signature of generation, D (G (xs,ct,X-vectort),ct) indicate that discriminator differentiates the spectrum signature generated,Indicate the expectation for the probability distribution that generator G is generated,Indicate the phase of true probability distribution It hopes;
The classifier uses two-dimensional convolution neural network C, loss function are as follows:
Wherein, pC(ct|xt) presentation class device differentiate target speaker characteristic be label ctReal frequency spectrum probability.
Further,
Wherein,Indicate the expectation for the probability distribution that generator generates, G (xs,ct,X-vectort) indicate Generator generates spectrum signature;
Wherein, pC(ct|G(xs,ct,X-vectort)) presentation class device differentiate generate target speaker frequency spectrum label belong to ctProbability, G (xs,ct,X-vectort) indicate target speaker's frequency spectrum that generator generates;
Wherein, G (G (xs,ct,X-vectort),cs) it is the source speaker's spectrum signature reconstructed,For the loss expectation for reconstructing source speaker frequency spectrum and real source speaker's frequency spectrum;
Wherein, G (xs,cs,X-vectors) it is source speaker frequency spectrum, speaker's label and x vector, after being input to generator Obtained source speaker's spectrum signature,For xsWith G (xs,cs,X-vectors) loss expectation.
Further, the coding network of the generator G includes 5 convolutional layers, the filter size point of 5 convolutional layers Not Wei 3*9,4*8,4*8,3*5,9*5, step-length is respectively 1*1,2*2,2*2,1*1,9*1, filter depth is respectively 32,64, 128,64,5;The decoding network of generator G includes 5 warp laminations, and the filter size of 5 warp laminations is respectively 9*5,3* 5,4*8,4*8,3*9, step-length are respectively 9*1,1*1,2*2,2*2,1*1, and filter depth is respectively 64,128,64,32,1.
Further, the discriminator D include 5 convolutional layers, the filter size of 5 convolutional layers be respectively 3*9, 3*8,3*8,3*6,36*5, step-length are respectively 1*1,1*2,1*2,1*2,36*1, filter depth is respectively 32,32,32,32, 1。
Further, the classifier C include 5 convolutional layers, the filter size of 5 convolutional layers be respectively 4*4, 4*4,4*4,3*4,1*4, step-length are respectively 2*2,2*2,2*2,1*2,1*2, and filter depth is respectively 8,16,32,16,4.
Further, the fundamental frequency transfer function are as follows:
Wherein, μsAnd σsRespectively mean value and variance of the fundamental frequency of source speaker in log-domain, μtAnd σtRespectively target is said Talk about mean value and variance of the fundamental frequency of people in log-domain, logf0sFor the logarithm fundamental frequency of source speaker, logf0t' it is logarithm after conversion Fundamental frequency.
It is combined with X-vector vector the utility model has the advantages that this method is able to use STARGAN to realize parallel text and non- Multi-to-multi speaker's voice conversion under parallel text condition, has given full expression to the individualized feature of speaker, especially for In short-term language X-vector vector possess preferably characterization performance, can preferably be promoted conversion after voice individual character similarity and Voice quality, while can overcome the problems, such as excessively smooth in C-VAE, realize a kind of phonetics transfer method of high quality.In addition, This method can be realized the conversion of the voice under non-parallel text condition, and training process does not need any alignment procedure, improve The versatility and practicability of speech conversion system, this method can also be whole by multiple sources-target speaker couple converting system It closes in a transformation model, that is, realizes more speakers to more voice conversions, dub in the conversion of across languages voices, film, language There is preferable application prospect in the fields such as sound translation.
Detailed description of the invention
Fig. 1 is the overall flow figure of this method.
Specific embodiment
As shown in Figure 1, method of the present invention is divided into two parts: training part is for obtaining needed for voice conversion Parameter and transfer function, and conversion portion is converted to target speaker's voice for realizing source speaker's voice.
Training stage implementation steps are as follows:
1.1) training corpus of non-parallel text is obtained, training corpus is the corpus of several speakers, includes source speaker With target speaker.Training corpus is derived from VCC2018 speech corpus.The training of the corpus is concentrated with 6 males and 6 Female speaker, each speaker have 81 corpus.This method can both realize conversion under parallel text, can also be non-flat Composing a piece of writing, this realizes down conversion, so these training corpus are also possible to non-parallel text.
1.2) spectrum envelope that training corpus extracts each speaker's sentence by WORLD speech analysis/synthetic model is special Levy x, aperiodicity feature, logarithm fundamental frequency logf0.The x vector X- for representing each speaker's individualized feature is extracted simultaneously vector.Wherein since fast Fourier transform (Fast Fourier Transformation, FFT) length is set as 1024, because This obtained spectrum envelope feature x and aperiodicity feature are 1024/2+1=513 dimension.Each block of speech has 512 frames, Mel cepstrum coefficients (MCEP) feature that 36 dimensions are extracted from spectrum envelope feature, once takes 8 block of speech when training.Therefore, The dimension of training corpus is 8*36*512.
In practical applications, the voice length of person to be converted is relatively short, characterizes i vector i- using traditional speaker Vector converting speech effect is general.X-vector is that a kind of novel low-dimensional fixed length extracted using DNN is embedded in, since DNN has There is extremely strong ability in feature extraction, for Short Time Speech, X-vector has preferably characterization ability.The network is in Kaldi It is realized in speech recognition tools using nnet3 neural network library.The main distinction of X-vector and i-vector is extraction side The difference of method, the system structure such as table 1 for extracting X-vector show that X-vector system is by frame layers, stats pooling Layer, segment layers and softmax layers of composition.T indicates to input all speech frames, the quantity of N expression training speaker, training Corpus is derived from VCC2018 speech corpus, so N is 12.
The system structure table of the extraction of table 1 X-vector
DNN in X-vector system has time delay structure, and splicing 5 frame of context first is 1 new frame set, then with Centered on new frame set, splicing 4 frame of context be 1 new frame set, and so on to splicing 15 frames be new frame set As the input of DNN, input is characterized in 23 dimension MFCC features, frame length 25ms.Frame5 layers of stats pooling layers of polymer The output of all T frames, and calculate mean value and standard deviation.Statistic is 1500 dimensional vectors, calculates one in each input voice segments It is secondary, then these statistical informations are linked together and are transmitted to segment layers.It is finally general by one posteriority of softmax layers of output RateThe number of output neuron is consistent with speaker's number in training set.X-vector system uses following Formula classifies to trained speaker.
The loss function of DNN network training are as follows:
N indicates that the voice of input, k indicate each speaker,It indicates softmax layers and provides input language Sound belongs to the posterior probability of speaker k, dnkIndicate only when voice speak artificial k when be just equal to 1, be otherwise 0.
DNN is not only a classifier, and is the combination of a feature extractor and classifier, and each layer has pole Strong ability in feature extraction.After training, segment layers can be used to extract the X-vector of voice, as shown in table 1, The X-vector of 512 dimensions is extracted at segment6 using remaining structure.After X-vector is extracted and i-vector mono- Sample calculates the similarity between X-vector using probability linear discriminant analysis rear end.
1.3) the STARGAN network in the present embodiment is based on Cycle-GAN model, by improving the structure of GAN, with And combining classification device, Lai Tisheng Cycle-GAN effect.STARGAN is consisted of three parts: one generates the generation of true frequency spectrum Device G, a judgement input are that the discriminator D of the frequency spectrum of true frequency spectrum or generation and one differentiate the mark for generating frequency spectrum Whether label belong to ctClassifier C.
The objective function of STARGAN network are as follows:
Wherein, IG(G) it is the loss function of generator:
Wherein, λcls>=0, λcyc>=0 and λid>=0 is regularization parameter, and it is consistent to respectively indicate Classification Loss, circulation Property loss and Feature Mapping loss weight.Lcyc(G)、Lid(G) confrontation of generator is respectively indicated Loss, the Classification Loss of classifier optimization generator, the consistent loss of circulation, Feature Mapping loss.
The loss function of discriminator are as follows:
D(xt,ct) indicate that discriminator differentiates real frequency spectrum feature, G (xs,ct,X-vectort) indicate the mesh that generator generates Mark speaker's spectrum signature, D (G (xs,ct,X-vectort),ct) indicate that discriminator differentiates the spectrum signature generated.Indicate the expectation for the probability distribution that generator generates,Indicate the phase of true probability distribution It hopes.
The loss function of classifier two-dimensional convolution neural network are as follows:
Wherein, pC(ct|xt) presentation class device differentiate target speaker characteristic be label ctReal frequency spectrum probability.
1.4) by 1.2) the middle source speaker's spectrum envelope feature x extractedsWith target speaker's label characteristics ct, x vector X-vectortAs union feature (xs,ct,X-vectort) input generator be trained.Training generator, makes generator Loss function LGIt is small as far as possible, it obtains generating target speaker spectrum envelope feature xtc
Generator uses two-dimensional convolution neural network, is made of coding network and decoding network.Coding network includes 5 volumes Lamination, the filter size of 5 convolutional layers are respectively 3*9,4*8,4*8,3*5,9*5, and step-length is respectively 1*1,2*2,2*2,1* 1,9*1, filter depth are respectively 32,64,128,64,5.Decoding network includes 5 warp laminations, the mistake of 5 warp laminations Filter size is respectively 9*5,3*5,4*8,4*8,3*9, and step-length is respectively 9*1,1*1,2*2,2*2,1*1, filter depth point It Wei 64,128,64,32,1.
1.5) the generation target speaker's spectrum envelope feature x that will 1.4) obtaintcWith the mesh of the training corpus 1.2) obtained Mark speaker's spectrum envelope feature xtAnd target speaker's label ct, together as the input of discriminator, training discriminator makes The loss function of discriminatorIt is as small as possible.
Discriminator uses two-dimensional convolution neural network, including 5 convolutional layers, the filter size of 5 convolutional layers be respectively 3*9,3*8,3*8,3*6,36*5, step-length are respectively 1*1,1*2,1*2,1*2,36*1, filter depth is respectively 32,32,32, 32、1。
The loss function of discriminator are as follows:
Optimization aim are as follows:
1.6) by the spectrum envelope feature x of target speaker obtained abovetc, it is again inputted into the coding net of generator G Network obtains the unrelated semantic feature G (x of speakertc), by semantic feature G (x obtained abovetc) and source speaker label characteristics cs, source speaker's x vector X-vectorsThe decoding network for being input to generator G is trained, in the training process minimum metaplasia It grows up to be a useful person the loss function of G, obtains the spectrum envelope feature x of reconstruct source speakersc.The damage of generator is minimized in the training process Function is lost, the Classification Loss of consistent loss, Feature Mapping loss and generator is lost, recycles in the confrontation including generator.Its In, the consistent loss of training circulation is to make source speaker's spectrum signature xsIn the source speaker after generator G, reconstructed Spectrum signature xscCan and xsIt is consistent as far as possible.Training characteristics shadowing loss is to guarantee xsAfter generator G Speaker's label is still cs, Classification Loss refer to classifier differentiate generator target speaker frequency spectrum x generatedtcBelong to label ct Probability loss.
The loss function of generator are as follows:
Optimization aim are as follows:
Wherein, λcls>=0, λcyc>=0 and λid>=0 is regularization parameter, and it is consistent to respectively indicate Classification Loss, circulation Property loss and Feature Mapping loss weight.
Indicate the confrontation loss of generator in GAN:
Wherein,Indicate the expectation for the probability distribution that generator generates, G (xs,ct,X-vectort) indicate Generator generates spectrum signature.With the loss of discriminatorConfrontation loss common in GAN is collectively formed, is used To differentiate that the frequency spectrum of input discriminator is real frequency spectrum or generates frequency spectrum.In the training processIt is as small as possible, it generates Device is continued to optimize, until generating the spectrum signature G (x that can be mixed the spurious with the genuines,ct,X-vectort), so that discriminator is difficult to differentiate It is true and false.
It is used to optimize the Classification Loss of generator for classifier C:
Wherein, pC(ct|G(xs,ct,X-vectort)) presentation class device differentiate generate target speaker frequency spectrum label belong to ctProbability, G (xs,ct,X-vectort) indicate target speaker's frequency spectrum that generator generates.In the training process, It is as small as possible, so that the frequency spectrum G (x that generator G is generateds,ct,X-vectort) device can be classified correctly be classified as label ct
Lcyc(G) and Lid(G) loss of generator in Cycle-GAN model, L are used for referencecycIt (G) is to recycle one in generator G Cause loss:
Wherein, G (G (xs,ct,X-vectort),cs) it is the source speaker's spectrum signature reconstructed,For the loss expectation for reconstructing source speaker frequency spectrum and real source speaker's frequency spectrum.It is generated in training In the loss of device, Lcyc(G) as small as possible, make to generate target spectrum G (xs,ct,X-vectort), source speaker's label csAgain After being input to generator, obtained reconstruct source speaker's voice spectrum is as far as possible and xsIt is similar.Pass through training Lcyc(G), Ke Yiyou Effect guarantees the semantic feature of speaker's voice, is not lost after the coding by generator.
Lid(G) it is lost for the Feature Mapping of generator G:
Wherein, G (xs,cs,X-vectors) it is source speaker frequency spectrum, speaker's label and x vector, after being input to generator Obtained source speaker's spectrum signature,For xsWith G (xs,cs,X-vectors) loss expectation.Training Lid(G), the label c of input spectrum can be effectively ensuredsVector X-vector is indicated with speakersIt is still protected after inputting generator It holds constant.
1.7) by target speaker's spectrum envelope feature x of above-mentioned generationtcWith the spectrum envelope feature x of target speakert Input classifier is trained, and minimizes the loss function of classifier.
Classifier uses two-dimensional convolution neural network C, including 5 convolutional layers, the filter size of 5 convolutional layers be respectively 4*4,4*4,4*4,3*4,1*4, step-length are respectively 2*2,2*2,2*2,1*2,1*2, filter depth is respectively 8,16,32,16, 4。
The loss function of classifier two-dimensional convolution neural network are as follows:
Optimization aim are as follows:
1.8) it repeats 1.4), 1.5), 1.6) He 1.7), until reaching the number of iterations, to obtain trained STARGAN Network, wherein generator parameter phi, discriminator parameter θ, classifier parameters ψ are trained parameter.Since neural network is specific Setting is different and experimental facilities performance is different, and the number of iterations of selection is also different.Selected in this experiment the number of iterations for 20000 times.
1.9) logarithm fundamental frequency logf is used0Mean value and variance establish fundamental frequency transformational relation, count each speak The mean value and variance of the logarithm fundamental frequency of people, using log-domain linear transformation by source speaker's logarithm fundamental frequency logf0sIt is converted to mesh Mark speaker's logarithm fundamental frequency logf0t′。
Fundamental frequency transfer function are as follows:
Wherein, μsAnd σsRespectively mean value and variance of the fundamental frequency of source speaker in log-domain, μtAnd σtRespectively target is said Talk about mean value and variance of the fundamental frequency in log-domain of people.
Conversion stage implementation steps are as follows:
2.1) source speaker's voice is passed through into WORLD speech analysis/synthetic model, the different sentences of extraction source speaker Spectrum envelope feature xs', aperiodicity feature, fundamental frequency.Wherein since fast Fourier transform (FFT) length is set as 1024, because This obtained spectrum envelope feature xs' and aperiodicity feature be 1024/2+1=513 dimension.
2.2) by the spectrum envelope feature x of the source speaker's voice 2.1) extracteds' and target speaker label characteristics ct′、 Target speaker's x vector X-vectort' it is used as union feature (xs′,ct′,X-vectort') input the STARGAN 1.8) trained Network, to reconstruct target speaker's spectrum envelope feature xtc′。
2.3) by the fundamental frequency transfer function that 1.9) obtains, 2.1) the source speaker's fundamental frequency extracted in is converted to The fundamental frequency of target speaker.
2.4) by target speaker's spectrum envelope feature x obtained in 2.2)tc', 2.3) obtained in target speaker Fundamental frequency and the aperiodicity feature 2.1) extracted pass through speaker's voice after WORLD speech analysis/synthetic model synthesis conversion.

Claims (9)

1. a kind of multi-to-multi voice conversion method based on STARGAN and x vector, it is characterised in that including the training stage and turn Change the stage, the training stage the following steps are included:
(1.1) training corpus is obtained, training corpus is made of the corpus of several speakers, speaks comprising source speaker and target People;
(1.2) training corpus is extracted into the frequency spectrum of each speaker's sentence by WORLD speech analysis/synthetic model Envelope characteristic x, fundamental frequency feature and the x vector X-vector for representing each speaker's individualized feature;
(1.3) by the spectrum envelope feature x of source speakers, target speaker spectrum envelope feature xt, source speaker's label cs With x vector X-vectorsAnd target speaker's label ct, x vector X-vectort, it is input to STARGAN network and is instructed Practice, the STARGAN network is made of generator G, discriminator D and classifier C, the generator G by coding network and Decoding network is constituted;
(1.4) training process keeps the loss function of generator, the loss function of discriminator, the loss function of classifier small as far as possible, Until the number of iterations of setting, obtains trained STARGAN network;
(1.5) the fundamental frequency transfer function of speech pitch of the building from the speech pitch of source speaker to target speaker;
The conversion stage the following steps are included:
(2.1) voice of source speaker in corpus to be converted is extracted into spectrum envelope by WORLD speech analysis/synthetic model Feature xs', aperiodicity feature and fundamental frequency;
(2.2) by above-mentioned source speaker spectrum envelope feature xs', target speaker's label characteristics ct', target speaker's x vector X- vectortTrained STARGAN network in ' input (1.4), reconstructs target speaker's spectrum envelope feature xtc′;
(2.3) the fundamental frequency transfer function obtained by (1.5), is converted to target for the source speaker's fundamental frequency extracted in (2.1) The fundamental frequency of speaker;
(2.4) by target speaker spectrum envelope feature x obtained in (2.2)tc', the base of target speaker obtained in (2.3) Speaking after the aperiodicity feature of extraction is converted by WORLD speech analysis/synthetic model, synthesis in frequency and (2.1) Human speech sound.
2. the multi-to-multi voice conversion method according to claim 1 based on STARGAN and x vector, it is characterised in that: Training process in step (1.3) and (1.4) the following steps are included:
(1) by the spectrum envelope feature x of source speakersIt is special to obtain the unrelated semanteme of speaker for the coding network for inputting generator G Levy G (xs);
(2) by semantic feature G (x obtained aboves) with the label characteristics c of target speakert, target speaker x vector X- vectortThe decoding network for being input to generator G together is trained, and minimizes the loss letter of generator G in the training process Number, to obtain the spectrum envelope feature x of target speakertc
(3) by the spectrum envelope feature x of target speaker obtained abovetc, it is again inputted into the coding network of generator G, is obtained Semantic feature G (the x unrelated to speakertc);
(4) by semantic feature G (x obtained abovetc) and source speaker label characteristics cs, source speaker's x vector X-vectorsIt is defeated The decoding network entered to generator G is trained, and is minimized the loss function of generator G in the training process, is obtained reconstruct source The spectrum envelope feature x of speakersc
(5) by the spectrum envelope feature x of target speakertc, target speaker's spectrum signature xt, and the label of target speaker Feature ctIt is input in discriminator D and is trained together, minimize the loss function of discriminator;
(6) by the spectrum envelope feature x of target speakertcWith the spectrum envelope feature x of target speakertInput classifier C into Row training, minimizes the loss function of classifier;
(7) it returns to step (1) to repeat the above steps, until reaching the number of iterations, to obtain trained STARGAN network.
3. the multi-to-multi voice conversion method according to claim 1 based on STARGAN and x vector, it is characterised in that: Input process in step (2.2) the following steps are included:
(1) by the spectrum envelope feature x of source speakersIt is special to obtain the unrelated semanteme of speaker for the coding network of ' input generator G Levy G (xs)′;
(2) by semantic feature G (x obtained abovesThe label characteristics c of) ' with target speakert', the x vector X- of target speaker vectort' it is input to the decoding network of generator G together, obtain the spectrum envelope feature x of target speakertc′。
4. the multi-to-multi voice conversion method according to claim 1 based on STARGAN and x vector, it is characterised in that: The generator G uses two-dimensional convolution neural network, loss function are as follows:
Wherein, λcls>=0, λcyc>=0 and λid>=0 is regularization parameter, respectively indicates Classification Loss, circulation consistency damage The weight of Feature Mapping of becoming estranged loss,Lcyc(G)、Lid(G) respectively indicate generator confrontation loss, The consistent loss of the Classification Loss of classifier optimization generator, circulation, Feature Mapping loss;
The discriminator D uses two-dimensional convolution neural network, loss function are as follows:
Wherein, D (xt,ct) indicate that discriminator D differentiates real frequency spectrum feature, G (xs,ct,X-vectort) indicate that generator G is generated Target speaker's spectrum signature, D (G (xs,ct,X-vectort),ct) indicate that discriminator differentiates the spectrum signature generated,Indicate the expectation for the probability distribution that generator G is generated,Indicate the phase of true probability distribution It hopes;
The classifier uses two-dimensional convolution neural network C, loss function are as follows:
Wherein, pC(ct|xt) presentation class device differentiate target speaker characteristic be label ctReal frequency spectrum probability.
5. the multi-to-multi voice conversion method according to claim 4 based on STARGAN and x vector, it is characterised in that:
Wherein,Indicate the expectation for the probability distribution that generator generates, G (xs,ct,X-vectort) indicate generator Generate spectrum signature;
Wherein, pC(ct|G(xs,ct,X-vectort)) presentation class device differentiate generate target speaker frequency spectrum label belong to ct's Probability, G (xs,ct,X-vectort) indicate target speaker's frequency spectrum that generator generates;
Wherein, G (G (xs,ct,X-vectort),cs) it is the source speaker's spectrum signature reconstructed, For the loss expectation for reconstructing source speaker frequency spectrum and real source speaker's frequency spectrum;
Wherein, G (xs,cs,X-vectors) it is source speaker frequency spectrum, speaker's label and x vector, it is obtained after being input to generator Source speaker's spectrum signature,For xsWith G (xs,cs,X-vectors) loss expectation.
6. the multi-to-multi voice conversion method according to claim 5 based on STARGAN and x vector, it is characterised in that: The coding network of the generator G include 5 convolutional layers, the filter size of 5 convolutional layers be respectively 3*9,4*8,4*8, 3*5,9*5, step-length are respectively 1*1,2*2,2*2,1*1,9*1, and filter depth is respectively 32,64,128,64,5;Generator G Decoding network include 5 warp laminations, the filter size of 5 warp laminations is respectively 9*5,3*5,4*8,4*8,3*9, step Long is respectively 9*1,1*1,2*2,2*2,1*1, and filter depth is respectively 64,128,64,32,1.
7. the multi-to-multi voice conversion method according to claim 5 based on STARGAN and x vector, it is characterised in that: The discriminator D includes 5 convolutional layers, and the filter size of 5 convolutional layers is respectively 3*9,3*8,3*8,3*6,36*5, step Long is respectively 1*1,1*2,1*2,1*2,36*1, and filter depth is respectively 32,32,32,32,1.
8. the multi-to-multi voice conversion method according to claim 5 based on STARGAN and x vector, it is characterised in that: The classifier C includes 5 convolutional layers, and the filter size of 5 convolutional layers is respectively 4*4,4*4,4*4,3*4,1*4, step Long is respectively 2*2,2*2,2*2,1*2,1*2, and filter depth is respectively 8,16,32,16,4.
9. the multi-to-multi voice conversion method according to any one of claims 1 to 8 based on STARGAN and x vector, It is characterized in that: the fundamental frequency transfer function are as follows:
Wherein, μsAnd σsRespectively mean value and variance of the fundamental frequency of source speaker in log-domain, μtAnd σtRespectively target speaker Mean value and variance of the fundamental frequency in log-domain, logf0sFor the logarithm fundamental frequency of source speaker, logf0t' it is logarithm fundamental frequency after conversion.
CN201910030578.4A 2019-01-14 2019-01-14 Many-to-many speaker conversion method based on STARGAN and x vectors Active CN109671442B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910030578.4A CN109671442B (en) 2019-01-14 2019-01-14 Many-to-many speaker conversion method based on STARGAN and x vectors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910030578.4A CN109671442B (en) 2019-01-14 2019-01-14 Many-to-many speaker conversion method based on STARGAN and x vectors

Publications (2)

Publication Number Publication Date
CN109671442A true CN109671442A (en) 2019-04-23
CN109671442B CN109671442B (en) 2023-02-28

Family

ID=66150583

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910030578.4A Active CN109671442B (en) 2019-01-14 2019-01-14 Many-to-many speaker conversion method based on STARGAN and x vectors

Country Status (1)

Country Link
CN (1) CN109671442B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110136686A (en) * 2019-05-14 2019-08-16 南京邮电大学 Multi-to-multi voice conversion method based on STARGAN Yu i vector
CN110459232A (en) * 2019-07-24 2019-11-15 浙江工业大学 A kind of phonetics transfer method generating confrontation network based on circulation
CN110600046A (en) * 2019-09-17 2019-12-20 南京邮电大学 Many-to-many speaker conversion method based on improved STARGAN and x vectors
CN110600047A (en) * 2019-09-17 2019-12-20 南京邮电大学 Perceptual STARGAN-based many-to-many speaker conversion method
CN110956971A (en) * 2019-12-03 2020-04-03 广州酷狗计算机科技有限公司 Audio processing method, device, terminal and storage medium
CN111261177A (en) * 2020-01-19 2020-06-09 平安科技(深圳)有限公司 Voice conversion method, electronic device and computer readable storage medium
CN111414888A (en) * 2020-03-31 2020-07-14 杭州博雅鸿图视频技术有限公司 Low-resolution face recognition method, system, device and storage medium
CN111462768A (en) * 2020-03-12 2020-07-28 南京邮电大学 Multi-scale StarGAN voice conversion method based on shared training
CN111785261A (en) * 2020-05-18 2020-10-16 南京邮电大学 Cross-language voice conversion method and system based on disentanglement and explanatory representation
CN111785258A (en) * 2020-07-13 2020-10-16 四川长虹电器股份有限公司 Personalized voice translation method and device based on speaker characteristics
CN111816156A (en) * 2020-06-02 2020-10-23 南京邮电大学 Many-to-many voice conversion method and system based on speaker style feature modeling
CN111833855A (en) * 2020-03-16 2020-10-27 南京邮电大学 Many-to-many speaker conversion method based on DenseNet STARGAN
CN112115771A (en) * 2020-08-05 2020-12-22 暨南大学 Gait image synthesis method based on star-shaped generation confrontation network
CN113129914A (en) * 2019-12-30 2021-07-16 明日基金知识产权有限公司 Cross-language speech conversion system and method
CN113421576A (en) * 2021-06-29 2021-09-21 平安科技(深圳)有限公司 Voice conversion method, device, equipment and storage medium
CN115294970A (en) * 2022-10-09 2022-11-04 苏州大学 Voice conversion method, device and storage medium for pathological voice

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130262096A1 (en) * 2011-09-23 2013-10-03 Lessac Technologies, Inc. Methods for aligning expressive speech utterances with text and systems therefor
US20160140951A1 (en) * 2014-11-13 2016-05-19 Google Inc. Method and System for Building Text-to-Speech Voice from Diverse Recordings
CN106205623A (en) * 2016-06-17 2016-12-07 福建星网视易信息系统有限公司 A kind of sound converting method and device
CN107301859A (en) * 2017-06-21 2017-10-27 南京邮电大学 Phonetics transfer method under the non-parallel text condition clustered based on adaptive Gauss
CN108461079A (en) * 2018-02-02 2018-08-28 福州大学 A kind of song synthetic method towards tone color conversion
CN108777140A (en) * 2018-04-27 2018-11-09 南京邮电大学 Phonetics transfer method based on VAE under a kind of training of non-parallel corpus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130262096A1 (en) * 2011-09-23 2013-10-03 Lessac Technologies, Inc. Methods for aligning expressive speech utterances with text and systems therefor
US20160140951A1 (en) * 2014-11-13 2016-05-19 Google Inc. Method and System for Building Text-to-Speech Voice from Diverse Recordings
CN106205623A (en) * 2016-06-17 2016-12-07 福建星网视易信息系统有限公司 A kind of sound converting method and device
CN107301859A (en) * 2017-06-21 2017-10-27 南京邮电大学 Phonetics transfer method under the non-parallel text condition clustered based on adaptive Gauss
CN108461079A (en) * 2018-02-02 2018-08-28 福州大学 A kind of song synthetic method towards tone color conversion
CN108777140A (en) * 2018-04-27 2018-11-09 南京邮电大学 Phonetics transfer method based on VAE under a kind of training of non-parallel corpus

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110136686A (en) * 2019-05-14 2019-08-16 南京邮电大学 Multi-to-multi voice conversion method based on STARGAN Yu i vector
CN110459232A (en) * 2019-07-24 2019-11-15 浙江工业大学 A kind of phonetics transfer method generating confrontation network based on circulation
CN110600046A (en) * 2019-09-17 2019-12-20 南京邮电大学 Many-to-many speaker conversion method based on improved STARGAN and x vectors
CN110600047A (en) * 2019-09-17 2019-12-20 南京邮电大学 Perceptual STARGAN-based many-to-many speaker conversion method
CN110956971A (en) * 2019-12-03 2020-04-03 广州酷狗计算机科技有限公司 Audio processing method, device, terminal and storage medium
CN113129914A (en) * 2019-12-30 2021-07-16 明日基金知识产权有限公司 Cross-language speech conversion system and method
CN111261177A (en) * 2020-01-19 2020-06-09 平安科技(深圳)有限公司 Voice conversion method, electronic device and computer readable storage medium
CN111462768A (en) * 2020-03-12 2020-07-28 南京邮电大学 Multi-scale StarGAN voice conversion method based on shared training
CN111833855B (en) * 2020-03-16 2024-02-23 南京邮电大学 Multi-to-multi speaker conversion method based on DenseNet STARGAN
CN111833855A (en) * 2020-03-16 2020-10-27 南京邮电大学 Many-to-many speaker conversion method based on DenseNet STARGAN
CN111414888A (en) * 2020-03-31 2020-07-14 杭州博雅鸿图视频技术有限公司 Low-resolution face recognition method, system, device and storage medium
CN111785261A (en) * 2020-05-18 2020-10-16 南京邮电大学 Cross-language voice conversion method and system based on disentanglement and explanatory representation
CN111785261B (en) * 2020-05-18 2023-07-21 南京邮电大学 Cross-language voice conversion method and system based on entanglement and explanatory characterization
CN111816156A (en) * 2020-06-02 2020-10-23 南京邮电大学 Many-to-many voice conversion method and system based on speaker style feature modeling
CN111816156B (en) * 2020-06-02 2023-07-21 南京邮电大学 Multi-to-multi voice conversion method and system based on speaker style feature modeling
CN111785258B (en) * 2020-07-13 2022-02-01 四川长虹电器股份有限公司 Personalized voice translation method and device based on speaker characteristics
CN111785258A (en) * 2020-07-13 2020-10-16 四川长虹电器股份有限公司 Personalized voice translation method and device based on speaker characteristics
CN112115771B (en) * 2020-08-05 2022-04-01 暨南大学 Gait image synthesis method based on star-shaped generation confrontation network
CN112115771A (en) * 2020-08-05 2020-12-22 暨南大学 Gait image synthesis method based on star-shaped generation confrontation network
CN113421576A (en) * 2021-06-29 2021-09-21 平安科技(深圳)有限公司 Voice conversion method, device, equipment and storage medium
CN113421576B (en) * 2021-06-29 2024-05-24 平安科技(深圳)有限公司 Voice conversion method, device, equipment and storage medium
CN115294970A (en) * 2022-10-09 2022-11-04 苏州大学 Voice conversion method, device and storage medium for pathological voice

Also Published As

Publication number Publication date
CN109671442B (en) 2023-02-28

Similar Documents

Publication Publication Date Title
CN109671442A (en) Multi-to-multi voice conversion method based on STARGAN Yu x vector
Łańcucki Fastpitch: Parallel text-to-speech with pitch prediction
CN109326283B (en) Many-to-many voice conversion method based on text encoder under non-parallel text condition
Saito et al. Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors
CN110060690A (en) Multi-to-multi voice conversion method based on STARGAN and ResNet
CN109377978B (en) Many-to-many speaker conversion method based on i vector under non-parallel text condition
CN109599091A (en) Multi-to-multi voice conversion method based on STARWGAN-GP and x vector
CN110136686A (en) Multi-to-multi voice conversion method based on STARGAN Yu i vector
CN108777140A (en) Phonetics transfer method based on VAE under a kind of training of non-parallel corpus
Kekre et al. Speaker identification by using vector quantization
CN110060657A (en) Multi-to-multi voice conversion method based on SN
CN110060691A (en) Multi-to-multi phonetics transfer method based on i vector sum VARSGAN
Wang et al. Accent and speaker disentanglement in many-to-many voice conversion
CN109584893A (en) Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition
Wu et al. Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations
CN110600046A (en) Many-to-many speaker conversion method based on improved STARGAN and x vectors
Zhao et al. Research on voice cloning with a few samples
Kim et al. Linguistic-coupled age-to-age voice translation to improve speech recognition performance in real environments
Nazir et al. Deep learning end to end speech synthesis: A review
Lee et al. HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer
Sharma et al. Soft-Computational Techniques and Spectro-Temporal Features for Telephonic Speech Recognition: an overview and review of current state of the art
Othmane et al. Enhancement of esophageal speech using voice conversion techniques
Barman et al. State of the art review of speech recognition using genetic algorithm
Hong et al. Emotion recognition from Korean language using MFCC HMM and speech speed
Ijima et al. Prosody Aware Word-Level Encoder Based on BLSTM-RNNs for DNN-Based Speech Synthesis.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant