CN108777140A

CN108777140A - Phonetics transfer method based on VAE under a kind of training of non-parallel corpus

Info

Publication number: CN108777140A
Application number: CN201810393556.XA
Authority: CN
Inventors: 李燕萍; 凌云志
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2018-04-27
Filing date: 2018-04-27
Publication date: 2018-11-09
Anticipated expiration: 2038-04-27
Also published as: CN108777140B

Abstract

The invention discloses a kind of under the non-parallel corpus training condition phonetics transfer method based on VAE, under non-parallel text condition, bottleneck characteristic is extracted by deep neural network, that is Bottleneck features, it is then based on the study and modeling of variation own coding model realization transfer function, in the conversion stage, conversion of more speakers to more speakers may be implemented.There are three aspects for the advantage of the present invention：1) dependence to parallel text is released, and training process does not need any alignment operation；2) converting system of multiple sources-target speaker couple can be incorporated into a transformation model, realizes multi-to-multi conversion；3) the multi-to-multi converting system under non-parallel text condition will interact for the technological direction actual speech and provide technical support.

Description

Phonetics transfer method based on VAE under a kind of training of non-parallel corpus

Technical field

The invention belongs to field of voice signal, and in particular to a kind of non-parallel corpus training is lower to be based on variation own coding The phonetics transfer method of model (Variational Autoencoder, VAE).

Background technology

Voice Conversion Techniques are a research branches of Speech processing, it covers Speaker Identification, speech recognition And the content in the fields such as phonetic synthesis, intend the personalized letter for changing voice in the case where the original semantic information of reservation is constant Breath, makes the voice of speaker dependent (i.e. source speaker) sound like the language of another speaker dependent (i.e. target speaker) Sound.The main task of voice conversion includes extracting the characteristic parameter of two speaker dependent's voices and carrying out Mapping and Converting, then Parameter decoding after transformation is reconstructed into transformed voice.The sense of hearing matter of voice after the conversion that ensure in the process Whether personal characteristics is accurate after amount and conversion.Years development is passed through in the research of Voice Conversion Techniques, and voice conversion art has gushed Reveal a variety of different methods, wherein having become the warp in the field using gauss hybrid models as the statistics conversion method of representative Allusion quotation method.But still there are certain defects in this kind of algorithm, such as：The classics of voice conversion are carried out using gauss hybrid models Method is mostly to be based on one-to-one voice convert task, it is desirable that the training sentence content phase that source speaker and target speaker use Together, need to spectrum signature be carried out dynamic time warping (Dynamic Time Warping, DTW) to be aligned frame by frame, mould could be passed through Type training obtains the mapping relations between spectrum signature, such phonetics transfer method underaction in practical applications；Use height This mixed model is global variable and by repetitive exercise data come what is considered when training mapping function, leads to calculation amount abruptly increase, And only when training data is abundant, gauss hybrid models can be only achieved preferable conversion effect, this is not suitable for limited meter Calculate resource and equipment.

In recent years, the research in deep learning field accelerates the training speed of deep neural network and the validity of network, And thering are researchers constantly to propose new model and new learning method, modeling ability is strong, can be from complicated data Learn the feature to more deep layer.

AHOcoder characteristic parameter extraction models are an audio coder & decoder (codec) (speech analysis/synthesis system), by Daniel Erro are obtained in the research and development of the AHOLAB signal processings laboratory of Basque university.AHOcoder is by 16kHz, 16bits Monophonic wav speech decompositions be three parts：Fundamental frequency (F0), spectrum (mel cepstrum coefficients MFCC) and maximum voiced sound frequency. AHOcoder speech analysis, synthetic model can provide an accurate speech analysis and the speech waveform of high quality is rebuild.

Fundamental frequency is the important parameter influenced in phonetic-rhythm characteristic, and the phonetics transfer method designed by the present invention is directed to The conversion of fundamental frequency, using the conversion method of traditional Gaussian normalization.Assuming that the voiced segments pair of source speaker and target speaker Base frequency Gaussian distributed then calculates separately the equal of source speaker and target speaker's voiced segments logarithm fundamental frequency Gaussian Profile Value and variance.Then realize source speaker's voiced segments logarithm fundamental frequency to target speaker's voiced segments logarithm fundamental frequency using following formula Conversion, and voiceless sound Duan Ze is not changed.

The mean value and variance of wherein source speaker voiced segments logarithm fundamental frequency use μ respectively_srcAnd σ_srcIt indicates, target speaker is turbid The mean value and variance of segment logarithm fundamental frequency use μ respectively_tgtAnd σ_tgtIt indicates, and FO_srcThe fundamental frequency of expression source speaker, FO_convIt indicates Transformed fundamental frequency.

Invention content

To solve the above problems, the present invention proposes the phonetics transfer method based on VAE under a kind of training of non-parallel corpus, pendulum Conversion of more speakers to more speakers is realized in the de- dependence to parallel text, improves flexibility, is solved in resource, equipment The technical issues of voice conversion is difficult to realize under the conditions of limited.

The present invention adopts the following technical scheme that, the phonetics transfer method based on VAE under a kind of non-parallel corpus training, including Training step and voice switch process：

Training step：

1) AHOcoder sound coders is utilized to extract the mel cepstrum feature for the speaker's voice for participating in training respectively Parameter X；

2) each frame mel cepstrum characteristic parameter X of extraction is subjected to difference processing and spliced with former characteristic parameter X, It carries out the characteristic parameter of characteristic parameter and front and back each frame that splicing obtains to be spliced to form union feature parameter again in the time domain x_n；

3) union feature parameter x is utilized_nWith the tag along sort feature y of speaker_nTo deep neural network (Deep Neural Network, DNN) it is trained, the weights of DNN networks are adjusted to reduce error in classification until network convergence, obtains being based on speaking The DNN networks of people's identification mission extract the bottleneck characteristic of each frame, i.e. Bottleneck feature bs_n；

4) union feature parameter x is utilized_nWith the Bottleneck feature bs of each frame of correspondence_nVAE models are trained, until Model training is restrained, and variation own coding model (Variational Autoencoder, VAE) model is extracted, i.e. VAE models are hidden The sampling feature z of each frames of z containing space_n；

It 5) will sampling feature z_nWith the label characteristics y of the speaker of each frame of correspondence_nSpliced to obtain Bottleneck features The training data of mapping network (BP networks), and with the Bottleneck feature bs of each frame_nIt is instructed as supervision message The training of Bottleneck Feature Mapping networks minimizes Bottleneck Feature Mapping networks by stochastic gradient descent algorithm Output error, obtain Bottleneck Feature Mapping networks；

Trained DNN networks, VAE networks and Bottleneck Feature Mapping combination of network constitute based on VAE and The speech conversion system of Bottleneck features；

Voice switch process：

6) by the union feature parameter X of voice to be converted_pBy the encoder modules of VAE models, implicit space is obtained The sampling feature z of each frames of z_n；

It 7) will sampling feature z_nWith the label characteristics y of target speaker_nSplicing input Bottleneck features frame by frame are carried out to reflect Network is penetrated, the Bottleneck features of target speaker are obtained

8) by Bottleneck featuresWith sampling feature z_nThe decoder Restructuring Modules that splicing passes through VAE models frame by frame Go out the union feature parameter X of transformed voice_p'；

9) AHOcoder sound coder reconstructed speech signals are utilized.

Preferably, the mel cepstrum for speaker's voice that extraction participates in training in the step 1) is characterized in utilizing AHOcoder sound coders extract the mel cepstrum feature for the speaker's voice for participating in training respectively, and mel cepstrum is special Sign reads in Matlab platforms.

Preferably, union feature parameter is obtained in the step 2) is specially：Each frame characteristic parameter X of extraction is carried out First-order difference and second differnce, and spliced with former feature to obtain characteristic parameter X_t=(X, Δ X, Δ²X), will spell in the time domain The characteristic parameter X connect_tIt carries out being spliced to form union feature parameter x again with the characteristic parameter of front and back each frame_n=(X_t-1,X_t, X_t+1)。

Preferably, to extracting Bottleneck feature bs in the step 3)_nInclude the following steps：

31) union feature parameter x is obtained in MATLAB platforms_nThe tag along sort feature y of the corresponding speaker of each frame_n；

32) unsupervised pre-training is carried out to DNN networks using successively greedy pre-training method, the activation primitive of hidden layer is adopted With ReLU functions；

33) DNN network output layers are set to softmax classification outputs, by the tag along sort feature y of speaker_nAs DNN networks carry out the supervision message of Training, and the weights of network are adjusted using stochastic gradient descent algorithm, minimize DNN Network class exports the tag along sort feature y with speaker_nBetween error until convergence, obtain be based on Speaker Identification task DNN networks；

34) by feedforward arithmetic by union feature parameter x_nDNN networks are inputted frame by frame, and it is corresponding to extract each frame Bottleneck layers of activation value, i.e., the Bottleneck feature bs corresponding to the mel cepstrum characteristic parameter of each frame_n。

Preferably, VAE model trainings include the following steps in the step 4)：

41) by union feature parameter x_nTraining data, Bottleneck feature bs as VAE model encoder modules_nMake For decoder module decoding and reconstitutings when training data VAE models are trained, in the decoder modules of VAE models will Bottleneck feature bs_nAs the control information of voice spectrum restructuring procedure, i.e., by Bottleneck feature bs_nWith sampling feature z_nSplicing passes through the training of the decoder modules of VAE models, reconstructed voice spectrum signature frame by frame；

42) the KL divergences and mean square error during utilizing ADAM optimizers to optimize VAE model parameter estimations are to adjust VAE Prototype network weights obtain VAE voice spectrum transformation models；

43) by union feature parameter x_nVAE voice spectrum transformation models are inputted frame by frame, and are implied by sampling process Sample feature z_n。

Preferably, Bottleneck Feature Mapping networks are obtained in the step 5) to include the following steps：

It 51) will sampling feature z_nWith the tag along sort feature y of the speaker of corresponding each frame_nCarry out splicing conduct The training data of Bottleneck Feature Mapping networks, Bottleneck Feature Mapping networks using an input layer, one it is hidden The structure of layer and an output layer, hidden layer activation primitive are sigmoid functions, and output layer is linear convergent rate；

52) criterion is minimized according to mean square error, is optimized using the stochastic gradient descent algorithm that backward error is propagated Bottleneck Feature Mapping network weights minimize network and export Bottleneck featuresIt is corresponding with each frame Bottleneck feature bs_nBetween error.

Preferably, the union feature parameter X of voice to be converted is obtained in the step 6)_pSpecially:It utilizes AHOcoder extracts the mel cepstrum characteristic parameter of voice to be converted, and each frame feature on MATLAB platforms to extracting Parameter carries out first-order difference and second differnce, and is spliced with former feature to obtain characteristic parameter, in the time domain obtains splicing The characteristic parameter of characteristic parameter and front and back each frame carry out being spliced to form union feature parameter again to get to voice to be converted The characteristic parameter X of frequency spectrum_p。

Preferably, reconstructed speech signal is specially in the step 9)：The speech characteristic parameter X that will be obtained after conversion_p' also Originally it was mel cepstrum characteristic formp, that is, removes time domain joint and Difference Terms, the synthesis of AHOcoder sound coders is recycled to turn Voice after changing.

The reached advantageous effect of invention：The present invention is the voice conversion side based on VAE under a kind of training of non-parallel corpus Method breaks away from the dependence to parallel text, realizes conversion of more speakers to more speakers, improves flexibility, solve resource, The technical issues of voice conversion is difficult to realize under the conditions of equipment is limited.The advantage of the invention is that：

1) utilize VAE models that can believe phoneme unrelated with speaker's individual character in speech spectral characteristics by modeling study Breath is the advantages of hidden layer is separated so that VAE models can carry out voice conversion by nonparallel voice data It practises, has broken away from traditional voice transformation model, needed source and the parallel corpora data of target speaker to be trained, also by language The limitation that audio spectrum signature is aligned greatly improves practicability and the flexibility of speech conversion system, to design across language The speech conversion system of kind is provided convenience；

2) the voice switching network obtained by training VAE models can complete a variety of transition cases, with traditional a pair One speech conversion system is compared, it is only necessary to can be completed a variety of convert tasks by one model of training, be greatly improved The efficiency of voice transformation model training；

3) in the decoder modules of VAE models, Bottleneck feature bs have been used_nPersonal characteristics as speaker Transformed speech spectral characteristics are reconstructed, compared to the tag along sort feature y for using speaker_nAs characterization speaker's individual character letter The system of feature is ceased, the conversion effect and sound quality of finally obtained converting speech are more preferable.

Description of the drawings

Fig. 1 is the systematic training procedural block diagram of the present invention；

Fig. 2 is the system transfer process block diagram of the present invention；

Fig. 3 is the structure chart of the DNN networks based on Speaker Identification task of the present invention；

Fig. 4 is the structure chart of the VAE speech spectral characteristics switching networks of the present invention；

Fig. 5 is the structure chart of the Bottleneck Feature Mapping networks of the present invention；

Fig. 6 is VAE model variation Bayes procedure parameter Estimation schematic diagrams；

Fig. 7 is to be characterized speaker's personal characteristics based on VAE models using different characteristic, turned under different switching situation Change voice MCD value comparison diagrams.

Specific implementation mode

Below according to attached drawing and technical scheme of the present invention is further elaborated in conjunction with the embodiments.

The present invention adopts the following technical scheme that, the phonetics transfer method based on VAE under a kind of non-parallel corpus training passes through AHOcoder sound coders extract the mel cepstrum feature of voice and on MATLAB platforms by itself and first-order difference, second order Differential Characteristics are stitched together, and the characteristic parameter of front and back each frame is then spliced composition union feature parameter x_n；By x_nAs instruction Practice data to be trained using the DNN networks based on Speaker Identification task, after network training terminates to reach convergence, by x_n DNN networks are inputted frame by frame, and obtain the Bottleneck layers output of each frame, this includes as speaker's personal characteristics Bottleneck characteristic parameters b_n；By x_nAs the training data of VAE model encoder modules, b_nAs decoder module solutions Training data when code reconstruct is trained VAE models so that VAE models can be in implicit space z by encoder modules Obtain the phoneme information z with semantic feature_n, that is, feature is sampled, by decoder modules by the phoneme information comprising semantic feature z_nWith the Bottleneck feature bs comprising speaker's personal characteristics_nIt is reconstructed into speech spectral characteristics；To include semantic feature later Phoneme information z_nWith the tag along sort feature y of speaker_nThe union feature for splicing composition is trained as the training data of BP networks Target speaker's Bottleneck Feature Mapping networks, expectation network output and the Bottleneck feature bs corresponding to each frame_nIt Between error it is minimum；In conversion, first the spectrum signature of voice to be converted is extracted by the encoder modules of VAE models The corresponding phoneme information z for including semantic feature_n, and by the tag along sort feature y of its frame by frame and target speaker_nSplicing composition This union feature is inputted BP networks, obtains the Bottleneck features of each frame of target speaker by union featureIt then will packet Phoneme information z containing semantic feature_nWith the Bottleneck features of each frame of target speakerThe union feature spliced frame by frame is logical The decoder Restructuring Modules for crossing VAE models are transformed speech spectral characteristics, finally synthesize voice by AHOcoder again； Specifically include training step and voice switch process：

Fig. 1 is the training process block diagram of system of the present invention, training step：

The mel cepstrum that extraction participates in speaker's voice of training is characterized in distinguishing using AHOcoder sound coders Extraction participates in the mel cepstrum feature of speaker's voice of training, and mel cepstrum feature is read in Matlab platforms；The present invention The middle mel cepstrum feature using 19 dimensions, the voice content of each speaker can be different, need not also carry out DTW pairs Together.

2) each frame mel cepstrum characteristic parameter X of extraction is subjected to difference processing and spliced with former characteristic parameter, It carries out the characteristic parameter of characteristic parameter and front and back each frame that splicing obtains to be spliced to form union feature parameter again in the time domain x_n；

Each frame characteristic parameter X of extraction is subjected to first-order difference and second differnce, and is spliced to obtain with former feature 57 dimension Differential Characteristics parameter X_t=(X, Δ X, Δ²X), the characteristic parameter X in the time domain obtained splicing_tWith front and back each frame Characteristic parameter carries out being spliced to form 171 dimension union feature parameter x again_n=(X_t-1,X_t,X_t+1)。

3) union feature parameter x is utilized_nWith the tag along sort feature y of speaker_nDNN networks are trained, DNN is adjusted The weights of network obtain the DNN networks based on Speaker Identification task to reduce error in classification up to network convergence, extract every The Bottleneck feature bs of one frame_n；

The structure of Bottleneck feature extractions DNN networks used in the present invention is as shown in Figure 3, and wherein network is defeated Enter the dimension that node layer number corresponds to the speech spectral characteristics for participating in training, exports the softmax classification outputs for speaker, node Number is depending on participating in the quantity of speaker of training.Extract Bottleneck feature bs_nInclude the following steps：

31) union feature parameter x is obtained in MATLAB platforms_nThe tag along sort feature y of the corresponding speaker of each frame_n；This When do not differentiate between source speaker and target speaker first, only to the characteristic parameter of each frame utilize speaker clustering label characteristics y_nArea It separates；

32) DNN networks are the neural network of full connecting-type, and using the DNN models of 9 layer networks, input layer number is 171, corresponding x_n171 dimensional features per frame, intermediate 7 layers of hidden layer, every layer of number of nodes is respectively 1200,1200,1200,57, 1200,1200,1200, wherein the less hidden layer of number of nodes be Bottleneck layer, utilization successively greediness pre-training method to DNN Connection weight between each node layer of network carries out unsupervised pre-training, and the activation primitive of hidden layer uses biology angle and brain The closer ReLU functions of neuron, i.e.,：

F (x)=max (0, x)

Because ReLU functions have unilateral inhibition, sparse activity and relatively broad excited boundary, it is believed that it has The ability to express of standby more primitive character.

The activation value of+1 layer of hidden layer of kth is：h_k+1=f (w_kh_k+B_k)

Wherein h_k+1、h_kThe respectively activation value of+1 layer of kth and kth layer hidden layer, w_kFor the company between+1 layer of kth and kth layer Meet weights, B_kFor the biasing of kth layer.

33) DNN network output layers are set to softmax classification outputs, selects 5 speaker everyone 100 voices Spectrum signature parameter is trained, therefore output layer number of nodes is 5, the label characteristics of corresponding 5 speakers, by point of speaker Class label characteristics y_nThe supervision message that Training is carried out as DNN networks adjusts network using stochastic gradient descent algorithm Weights, minimize the output of DNN network class and the tag along sort feature y of speaker_nBetween error until convergence, obtain base In the DNN networks of Speaker Identification task, i.e. Bottleneck feature extractions network；

34) by feedforward arithmetic by union feature parameter x_nDNN networks are inputted frame by frame, and it is corresponding to extract each frame Bottleneck layers of activation value, i.e., the Bottleneck feature bs corresponding to the characteristic parameter of each frame_n, in of the invention Bottleneck layers are the 4th layer of hidden layer, i.e.,：

b_n=f (w₃h₃+B₃)

Wherein, h₃For the activation value of the 3rd layer of hidden layer, w₃For the connection weight between the 3rd layer and the 4th layer, B₃It is the 3rd layer Biasing.

4) union feature is utilized to join x_nWith the Bottleneck feature bs of each frame of correspondence_nVAE models are trained, until mould Type training restrains, and extraction VAE models imply the sampling feature z of each frames of space z_n；

Variation autocoder (Variational Auto-encoder, VAE) used in the present invention is a kind of generation Formula learning method, the concrete structure of used model is as shown in Figure 4 in the present invention, wherein x_s,nThe spy of expression source voice Parameter is levied,Indicate the characteristic parameter of the target speaker's voice obtained after conversion, b_nIndicate that target speaker corresponds to frame Bottleneck features, μ and σ are respectively the vector representation of the mean value and covariance of each component of Gaussian Profile, and z expressions pass through sampling The VAE models that process obtains imply space, z_nAs sample feature.The parameter estimation procedure of VAE model trainings is as shown in Figure 6. VAE model trainings include the following steps：

The encoder input layers of VAE models are 171 nodes in the present invention, then include two hidden layers, and first layer is 500 nodes, the second layer are 64 nodes；In the second node layer, preceding 32 nodes calculate each component of Gaussian mixtures Mean value, the variance that rear 32 nodes calculate each component (is to calculate the height that can be more fitted input signal by neural network at this time This mixed distribution)；Implicit space layer includes 32 nodes, and the value of each node from the sampling of second layer hidden layer by obtaining； Decoder modules are set as including a hidden layer, and number of nodes 500, output layer is 171 nodes.In addition to implicit space layer is Linear convergent rate, other hidden layer activation values are ReLU functions；

42) it utilizes ADAM optimizers to optimize VAE models shown in Fig. 4 according to the variation Bayes principle in VAE models to join KL (Kullback-Leibler Divergence) divergences and mean square error in number estimation procedure is to adjust VAE prototype networks Weights obtain VAE voice spectrum transformation models；

It is exactly the decoder modules by VAE models in the phoneme letter comprising semantic feature for this method is more intuitive Cease z_nOn add speaker's personal characteristics b_nModulation.

It 5) will sampling feature z_nWith the tag along sort feature y of the speaker of each frame of correspondence_nSpliced to obtain Bottleneck The training data of Feature Mapping network (BP networks), and with the Bottleneck feature bs of each frame_nIt is instructed as supervision message The training of Bottleneck Feature Mapping networks minimizes Bottleneck Feature Mapping networks by stochastic gradient descent algorithm Output error, obtain Bottleneck Feature Mapping networks；

Target speaker Bottleneck Feature Mapping networks used in the present invention use BP networks, structure such as Fig. 5 institutes Show, wherein input parameter z_n+y_n, wherein z_nFor variation self-encoding encoder hidden layer feature, y_nTo participate in trained speaker's Label characteristics；Output is the Bottleneck feature bs of target speaker_n.It includes such as to obtain Bottleneck Feature Mapping networks Lower step：

51) VAE is implied to the sampling feature z in space_nWith the tag along sort feature y of the speaker of corresponding each frame_nSpliced As the training data of Bottleneck Feature Mapping networks, Bottleneck Feature Mapping networks are using the full connection of three layers of feedforward The neural network of type forgives an input layer, a hidden layer and an output layer, and input layer number is 37 each nodes, wherein 32 nodes, which correspond to, samples feature z in VAE models_nDimension, 5 nodes, which correspond to, participates in five speakers of training are constituted 5 Tie up the tag along sort feature y of speaker_n；Output layer is 57 nodes, corresponding 57 dimension Bottleneck features；Centre includes one Hidden layer, number of nodes 1200, hidden layer activation primitive introduce nonlinear change for sigmoid functions, and output layer is linear convergent rate； The expression formula of sigmoid functions is：

F (x)=1/ (1+e^x)

52) criterion is minimized according to mean square error, is optimized using the stochastic gradient descent algorithm that backward error is propagated Bottleneck Feature Mapping network weights minimize network and export Bottleneck featuresIt is corresponding with each frame Bottleneck feature bs_nBetween error, i.e.,：

The weights for optimizing whole network, finally obtaining one can be by sampling feature z_nWith the contingency table of target speaker Sign feature y_nObtain target speaker's Bottleneck featuresBP mapping networks.

Voice conversion, voice switch process are realized according to frequency spectrum flow path switch shown in Fig. 2：

6) by the union feature parameter X of the voice of source speaker to be converted_pBy the encoder modules of VAE models, obtain To the sampling feature z of implicit each frames of space z_n；

Obtain the union feature parameter X of voice to be converted_pSpecially:To the union feature parameter X of voice to be converted_p Specially:The mel cepstrum characteristic parameter of voice to be converted is extracted using AHOcoder, and to extracting on MATLAB platforms Each frame characteristic parameter carry out first-order difference and second differnce, and spliced to obtain characteristic parameter with former feature, in time domain On by the characteristic parameter of the obtained characteristic parameter of splicing and front and back each frame carry out being spliced to form again union feature parameter to get to The characteristic parameter X of voice spectrum to be converted_p。

It 7) will sampling feature z_nWith the tag along sort feature y of target speaker_nIt is special to carry out splicing input Bottleneck frame by frame Mapping network is levied, the Bottleneck features of target speaker are obtained

9) AHOcoder sound coder reconstructed speech signals are utilized.

Reconstructed speech signal is specially：The speech characteristic parameter X that will be obtained after conversion_p' it is reduced to mel cepstrum feature shape Formula removes time domain joint and Difference Terms, sound coder AHOcoder is recycled to synthesize transformed voice.

Mel cepstrum distortion (Mel-Cepstrum Distortion, MCD) is a kind of measurement voice turn in voice conversion The objective standard of obversion amount.MCD values between converting speech and target voice are smaller, illustrate turning for corresponding speech conversion system It is transsexual can be better.Fig. 7 is that when characterizing speaker's personal characteristics, VAE is trained by non-parallel corpus using different characteristic parameter The MCD values of converting speech under the different switching situation that model obtains compare, and as can be seen from the figure use Bottleneck special The voice conversion for levying to characterize speaker's individual information, the performance ratio of system use speaker's tag characterization speaker's individual information Converting system performance it is more preferable.

The variation self-encoding encoder VAE is the generation learning network model in a kind of deep learning concept, compared to depth It is other such as depth confidence network DBN in study, convolutional neural networks CNN etc., VAE models its encoder in the training process The probability distribution that process can learn to meet original input signal by the principle of variation Bayes, and obtained by sampling process Original signal implies the feature in space, then utilizes sampling feature reconstruction original signal by decoder processes so that reconstruct letter Error number between original signal as far as possible small (or probability distribution variances are small).This characteristic of VAE models can be applied To in Style Transfer, and in voice conversion, it is believed that can be isolated and speaker in implicit space by VAE models Property feature it is unrelated and with the relevant phoneme information of semantic feature, and implicit spatial information can be utilized to combine characterization speaker's individual character The parameter reconstructed voice spectrum signal of feature.In the present invention, it is extracted using the DNN networks based on Speaker Identification task Bottleneck features obtain phoneme letter to characterize the personal characteristics of speaker by the mapping network that BP network trainings obtain Mapping relations between breath, speaker's label union feature formed and Bottleneck features, to be said indirectly by source People's speech spectral characteristics are talked about to obtain target speaker's Bottleneck features, will be implied finally by the decoder modules of VAE The Bottleneck feature reconstructions of phoneme information and target speaker in space are transformed speech spectral characteristics.

The present invention is for traditional gauss hybrid models conversion method and utilizes BP real-time performance voice spectrums conversion side It is needed in method using parallel corpora and needs to carry out the problem of DTW alignment carries out model training again, in conjunction with the characteristics of VAE models Phonetics transfer method under a kind of non-parallel corpus proposed, there are three key points by the present invention：First, knowing using based on speaker The Bottleneck features of DNN networks extraction characterization speaker's personal characteristics of other taskSecond is that being built using BP neural network Vertical sampling feature z_n, speaker tag along sort feature y_nThe union feature of composition and Bottleneck featuresBetween mapping Relationship；Third, using the decoder modules of trained VAE models by Bottleneck featuresWith sampling feature z_nComposition Union feature is reconstructed into transformed speech spectral characteristics.

The innovation of the present invention is：1. using the feature of VAE models, isolated in implicit space and speaker Property feature is unrelated and related with semantic feature phoneme information, so as to realize that the voice under non-parallel corpus training turns It changes, and this method can complete a variety of convert tasks by a model training for different speakers；2. utilize based on The Bottleneck features extracted in the DNN networks of Speaker Identification task participate in VAE moulds as the personal characteristics of speaker The decoder Restructuring Module processes of type, improve voice conversion performance.

For some medical auxiliary systems, such as cannot normal sounding because having phonatory organ physiological defect or illness Some principles of method involved in the present invention may be used when providing sounding ancillary equipment for them in patient；The present invention has Preferable expansion provides for the materialization for solving the problems, such as in voice conversion, including the voice transfer problem of multi-to-multi (M2M) Resolving ideas.

The above is only the preferred embodiment of the present invention, it should be pointed out that：For the ordinary skill of the art For personnel, without departing from the principle of the present invention, several improvement can also be made, these improvement also should be regarded as this hair Bright protection domain.

Claims

1. the phonetics transfer method based on VAE under a kind of non-parallel corpus training, which is characterized in that including training step and voice Switch process：

Training step：

1) AHOcoder sound coders is utilized to extract the mel cepstrum characteristic parameter for the speaker's voice for participating in training respectively X；

2) each frame mel cepstrum characteristic parameter X of extraction difference processing and with former characteristic parameter X splice, when The characteristic parameter X for obtaining splicing on domain_tIt carries out being spliced to form union feature parameter x again with the characteristic parameter of front and back each frame_n；

3) union feature parameter x is utilized_nWith the tag along sort feature y of speaker_nDNN networks are trained, DNN networks are adjusted Weights to reduce error in classification until network convergence, obtain the DNN networks based on Speaker Identification task, extract each frame Bottleneck feature bs_n；

4) union feature parameter x is utilized_nWith the Bottleneck feature bs of each frame of correspondence_nVAE models are trained, until model Training convergence, extraction VAE models imply the sampling feature z of each frames of space z_n；

It 5) will sampling feature z_nWith the tag along sort feature y of the speaker of each frame of correspondence_nSpliced to obtain Bottleneck features The training data of mapping network, and with the Bottleneck feature bs of each frame_nBottleneck features are instructed to reflect as supervision message The training for penetrating network is minimized the output error of Bottleneck Feature Mapping networks by stochastic gradient descent algorithm, obtained Bottleneck Feature Mapping networks；

Voice switch process：

6) by the union feature parameter X of voice to be converted_pBy the encoder modules of VAE models, each frames of implicit space z are obtained Sampling feature z_n；

It 7) will sampling feature z_nWith the tag along sort feature y of target speaker_nSplicing input Bottleneck features frame by frame are carried out to reflect Network is penetrated, the Bottleneck features of target speaker are obtained

8) by Bottleneck featuresWith sampling feature z_nSplicing goes out to turn by the decoder Restructuring Modules of VAE models frame by frame The union feature parameter X of voice after changing_p'；

9) AHOcoder sound coder reconstructed speech signals are utilized.

2. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist In the mel cepstrum of speaker's voice of extraction participation training is characterized in utilizing AHOcoder sound codecs in the step 1) Device extracts the mel cepstrum feature for the speaker's voice for participating in training respectively, and mel cepstrum feature is read in Matlab platforms.

3. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist In obtaining union feature parameter in the step 2) is specially：Each frame characteristic parameter X of extraction is subjected to first-order difference and two Order difference, and spliced to obtain characteristic parameter X with former characteristic parameter X_t=(X, Δ X, Δ²X), splicing is obtained in the time domain Characteristic parameter X_tIt carries out being spliced to form union feature parameter x again with the characteristic parameter of front and back each frame_n=(X_t-1,X_t,X_t+1)。

4. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist In extraction Bottleneck feature bs in the step 3)_nInclude the following steps：

32) unsupervised pre-training is carried out to DNN networks using successively greedy pre-training method, the activation primitive of hidden layer uses ReLU functions；

33) DNN network output layers are set to softmax classification outputs, by the tag along sort feature y of speaker_nAs DNN nets Network carries out the supervision message of Training, and the weights of network are adjusted using stochastic gradient descent algorithm, minimizes DNN networks point Class exports the tag along sort feature y with speaker_nBetween error until convergence, obtain the DNN based on Speaker Identification task Network；

34) by feedforward arithmetic by union feature parameter x_nDNN networks are inputted frame by frame, extract the corresponding Bottleneck of each frame The activation value of layer, i.e., the Bottleneck feature bs corresponding to the mel cepstrum characteristic parameter of each frame_n。

5. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist In VAE model trainings includes the following steps in the step 4):

41) by union feature parameter x_nTraining data, Bottleneck feature bs as VAE model encoder modules_nAs Training data when decoder module decoding and reconstitutings is trained VAE models, will in the decoder modules of VAE models Bottleneck feature bs_nAs the control information of voice spectrum restructuring procedure, i.e., by Bottleneck feature bs_nWith sampling feature z_nSplicing passes through the training of the decoder modules of VAE models, reconstructed voice spectrum signature frame by frame；

42) the KL divergences and mean square error during utilizing ADAM optimizers to optimize VAE model parameter estimations are to adjust VAE models Network weight obtains VAE voice spectrum transformation models；

43) by union feature parameter x_nVAE voice spectrum transformation models are inputted frame by frame, and implied sample is obtained by sampling process Feature z_n。

6. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist In obtaining Bottleneck Feature Mapping networks in the step 5) and include the following steps：

51) by the sampling feature z of VAE voice spectrum transformation models_nWith the tag along sort feature y of the speaker of corresponding each frame_nIt carries out Splice the training data as Bottleneck Feature Mapping networks, Bottleneck Feature Mapping networks are using an input The structure of layer, a hidden layer and an output layer, hidden layer activation primitive are sigmoid functions, and output layer is linear convergent rate；

7. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist In obtaining the union feature parameter X of voice to be converted in the step 6)_pSpecially:It is extracted using AHOcoder to be converted The mel cepstrum characteristic parameter of voice, and on MATLAB platforms to each frame characteristic parameter for extracting carry out first-order difference and Second differnce, and spliced with former feature to obtain characteristic parameter, the characteristic parameter in the time domain obtaining splicing with it is front and back each The characteristic parameter of one frame carries out being spliced to form again union feature parameter to get to the characteristic parameter X of voice spectrum to be converted_p。

8. the phonetics transfer method based on VAE under a kind of non-parallel corpus training according to claim 1, feature exist In reconstructed speech signal is specially in the step 9)：The speech characteristic parameter X that will be obtained after conversion_p' it is reduced to mel cepstrum Characteristic formp removes time domain joint and Difference Terms, AHOcoder sound coders is recycled to synthesize transformed voice.