CN103413548B - A kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine - Google Patents

A kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine Download PDF

Info

Publication number
CN103413548B
CN103413548B CN201310360234.2A CN201310360234A CN103413548B CN 103413548 B CN103413548 B CN 103413548B CN 201310360234 A CN201310360234 A CN 201310360234A CN 103413548 B CN103413548 B CN 103413548B
Authority
CN
China
Prior art keywords
spectrum
frame
rise
spe
envelope characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310360234.2A
Other languages
Chinese (zh)
Other versions
CN103413548A (en
Inventor
刘利娟
陈凌辉
凌震华
戴礼荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201310360234.2A priority Critical patent/CN103413548B/en
Publication of CN103413548A publication Critical patent/CN103413548A/en
Application granted granted Critical
Publication of CN103413548B publication Critical patent/CN103413548B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

A kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine, performing step is: extract spectrum-envelope of voice feature, extract the high-rise spectrum signature of voice, dynamic time warping, GMM model training, joint spectrum envelope characteristic acoustics Subspace partition, Gaussian-Bernoulli? RBM model training or Gaussian-Gaussian? RBM model training, spectral conversion and synthesis converting speech.Invention increases spectrum modeling precision, improve tonequality and the naturalness of converting speech.

Description

A kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine
Technical field
The present invention relates to the method for sound conversion in phonetic synthesis, specifically a kind of sound converting method based on the joint spectrum modeling of limited Boltzmann machine (RestrictedBoltzmannMachine, RBM).
Background technology
The object of sound conversion (also known as voice conversion) voice of a speaker (source speaker) is carried out conversion to make it sound like another speaker (target speaker) to send, keep the semanteme of voice constant simultaneously.At present, based on gauss hybrid models (GaussianMixtureModel, GMM) joint spectrum modeling is (see Y.Stylianou, O.Capp é, andE.Moulines, " Continuousprobabilistictransformforvoiceconversion; " IEEETrans.SpeechAudioProcess., vol.6, no.2, pp.131-142, Mar.1998.) be the main stream approach realizing sound conversion.The cardinal principle of this method be in the training stage according to maximum-likelihood criterion, utilize the joint spectrum characteristic probability distribution of multiple Gaussian distribution to source and target to carry out matching.According to maximal condition probability output criterion, source speaker's frequency spectrum is changed at translate phase, finally the conversion spectrum obtained and conversion fundamental frequency are sent into voice operation demonstrator T.G Grammar voice.Utilize GMM modelling technique to carry out sound conversion and can obtain having the voice of certain intelligibility and similarity, and the introducing of dynamic parameter is (see T.Toda, A.Black, andK.Tokuda, " Voiceconversionbasedonmaximum-likelihoodestimationofspec tralparametertrajectory; " Audio, Speech, andLanguageProcessing, IEEE, Transactionson, vol.15, no.8, pp.2222-2235, nov.2007.) make converting speech in continuity, there has also been good improvement.But there is serious smoothing effect excessively in this model, makes the overall tonequality of converting speech undesirable.
Tradition is the main cause causing the problems referred to above based on the modeling deficiency of sound converting method on frequency spectrum of GMM model, comprises following 2 points specifically:
(1) high-rise spectrum signature is adopted to lost the detailed information of frequency spectrum as training characteristics.In GMM model training, general employing high-rise spectrum signature, such as mel cepstrum characteristic parameter, line spectrum pair characteristic parameter etc., these features are high level performances of spectrum signature, dimension is lower, be beneficial to implementation model training, but leaching process causes the loss of many frequency spectrum detailed information;
(2) the joint spectrum modeling based on GMM model is not enough to relationship modeling between frequency spectrum dimension.GMM model utilizes covariance matrix to characterize between spectrum signature dimension relation between relation and the dimension between source speaker and target speaker characteristic, but owing to there is a large amount of matrix multiple in model training, inversion operation, by the impact of computation complexity and computer calculate precision, the model training of the full covariance matrix of direct employing is often difficult to realize, therefore general covariance matrix to be simplified, only adopt diagonal covariance matrix, so just cause model can not express completely relation between frequency spectrum dimension, thus make the characteristic parameter conversion of source speaker not enough: on the one hand, directly related with the average that conditional Gaussian distributes according to the conversion spectrum that maximal condition probability output criterion obtains:
E m , t ( y ) = μ m ( y ) + Σ m ( yx ) Σ m ( xx ) - 1 ( x t - μ m ( x ) ) - - - ( 1 )
Due to relation between the source that can characterize and target is little, when making conversion item numerical value is very little, and conversion spectrum is approximately equal to the average of target speaker characteristic distribution the i.e. weighted mean of training sample, on such frequency spectrum, many detailed information are smoothed out; On the other hand, to the continuous a few frame voice belonging to same phoneme, on acoustic feature, difference is very little, the Gauss model parameter of same acoustics subspace can be adopted during conversion to change, according to analysis above, the conversion spectrum obtained all can be approximately equal to target speaker characteristic distribution average under this acoustics subspace, thus makes frequency spectrum also occur smoothing effect in the time domain.In sum, converting speech, at the smoothing effect excessively of frequency-domain and time-domain, causes the voice be converted to sound more vexed, finally affects the tonequality of synthetic speech.
Summary of the invention
The technology of the present invention is dealt with problems: crossing smoothing problasm to improve in existing sound converting method, providing a kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine, improve spectrum modeling precision, improving tonequality and the naturalness of converting speech.
The object of the invention is to be reached by following measure:
One of the technology of the present invention solution: a kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine, performing step is as follows:
Step one: extract spectrum-envelope of voice feature
(1) utilize STRAIGHT to analyze compositor to analyze frame by frame the corpus of source and target speaker respectively, obtain speech pitch sequential value and static spectral envelope characteristic with , wherein with be respectively source and target speaker t frame static spectral envelope characteristic vector, dimension is 513, T 1and T 2be respectively source and target eigenvector frame number;
(2) based on static spectral envelope characteristic with first-order dynamic spectrum envelope feature is obtained according to formula (2) (3) with second order dynamic spectrum envelope characteristic is obtained according to formula (4) (5) with
Δc t = 0.5 c t + 1 - 0.5 c t - 1 , ∀ t ∈ [ 2 , T - 1 ] - - - ( 2 )
Δc 1=Δc 2,Δc T=Δc T1(3)
Δ 2 c t = c t + 1 - 2 c t + c t - 1 , ∀ t ∈ [ 2 , T - 1 ] - - - ( 4 )
Δ 2c 1=Δ 2c 2,Δ 2c T=Δ 2c T-1(5)
Wherein, Δ represents first order difference, Δ 2represent second order difference, T is the frame number of characteristic sequence, c trepresent t frame feature vector;
(3) will with be stitched together, finally obtain the spectrum envelope feature of source speaker X SPE = [ X 1 SPE T , X 2 SPE T , · · · X t SPE T , · · · , X T 1 SPE T ] T , Wherein, t frame frequency spectrum envelope feature X t SPE = [ x t SPE T , Δx t SPE T , Δ 2 x t SPE T ] T , [ · ] T Represent vector transposition.Will with be stitched together, finally obtain the spectrum envelope feature of target speaker Y SPE = [ Y 1 SPE T , Y 2 SPE T , · · · Y t SPE T , · · · , Y T 2 SPE T ] T , Wherein, t frame frequency spectrum envelope feature Y t SPE = [ y t SPE T , Δy t SPE T , Δ 2 y t SPE T ] T .
Step 2: extract the high-rise spectrum signature of voice
(1) the static spectral envelope characteristic obtained with on basis, extract high-rise spectrum signature corresponding to every frame voice further, use 40 rank mel cepstrum features here, obtain the high-rise spectrum signature of static state of source and target speaker with
(2) based on with the high-rise spectrum signature of first-order dynamic is obtained according to formula (2) (3) with the dynamic high-rise spectrum signature of second order is obtained according to formula (4) (5) with
(3) will with be stitched together, finally obtain the high-rise spectrum signature of source speaker X MCEP = [ X 1 MCEP T , X 2 MCEP T , · · · X t MCEP T , · · · , X T 1 MCEP T ] T , Wherein, t vertical frame dimension layer spectrum signature X t MCEP = [ x t MCEP T , Δx t MCEP T , Δ 2 x t MCEP T ] T . Will with be stitched together, finally obtain the high-rise spectrum signature of target speaker Y MCEP = [ Y 1 MCEP T , Y 2 MCEP T , · · · Y t MCEP T , · · · , Y T 2 MCEP T ] T , Wherein, t vertical frame dimension layer spectrum signature Y t SPE = [ y t MCEP T , Δy t MCEP T , Δ 2 y t MCEP T ] T .
Step 3: dynamic time warping
(1) X is calculated according to dynamic time warping (DynamicTimeAlign, DTW) algorithm mCEPand Y mCEPbetween alignment function, and according to this alignment function by X mCEPand Y mCEPalignment, by the X after aliging mCEPand Y mCEPsplicing obtains combining high-rise spectrum signature Z MCEP = [ Z 1 MCEP T , Z 2 MCEP T , · · · Z t MCEP T , · · · , Z T MCEP T ] T , Wherein, t frame combines high-rise spectrum signature Z t MCEP = [ X t MCEP T , Y t MCEP T ] T , T represents the frame length after alignment;
(2) alignment function obtained in basis (1) is by X sPEand y sPEalignment, by the X after aliging sPEand y sPEsplicing obtains joint spectrum envelope characteristic Z t SPE = [ Z 1 SPE T , Z 2 SPE T , Z t SPE T , · · · , Z T SPE T ] T . Wherein, t frame joint spectrum envelope characteristic Z t SPE = [ X t SPE T , Y t SPE T ] T ;
Step 4: GMM model training
Utilize the high-rise spectrum signature Z of associating obtained in the previous step mCEP, according to maximum-likelihood criterion, utilize EM algorithm to the training of GMM model, obtain model parameter wherein, M is the number of Gaussian mixtures in GMM model, ω m, μ m, ∑ mrepresent the weight of m Gaussian mixtures, mean vector and covariance matrix respectively;
Step 5: joint spectrum envelope characteristic acoustics Subspace partition
After GMM model training completes, utilize the GMM model parameter λ obtained gMM, according to maximum posteriori criterion, to the high-rise spectrum signature Z of associating mCEPcarry out acoustics Subspace partition, obtain by Z mCEPin the index sequence m=[m of every acoustics subspace index composition belonging to frame feature 1, m 2..., m t... m t];
According to index sequence m distich sum of fundamental frequencies spectrum envelope characteristic Z sPEcarry out acoustics Subspace partition, will there is the joint spectrum envelope characteristic frame classification of identical subspace index together, as the training characteristics parameter set of Gaussian-BernoulliRBM model under this acoustics subspace;
Step 6: Gaussian-BernoulliRBM model training
Because spectrum envelope characteristic ginseng value is continuous real number, obey continuous print probability distribution to can describe each aobvious node layer of hypothesis more accurately to its distribution, be assumed to be Gaussian distribution here, the implicit node of hypothesis obeys { 0,1} bis-Distribution value simultaneously;
According to the division result in step 5, to each acoustics subspace stand-alone training RBM model, the energy function that the RBM model of the Gaussian-Bernoulli form of employing is corresponding is:
E ( v , h ) = Σ i = 1 V ( v i - a i ) 2 2 σ i 2 - Σ j = 1 H b j h j - Σ i = 1 V Σ j = 1 H v i σ i w ij h j - - - ( 7 )
Wherein, variable v=[v 1, v 2..., v v] tcorresponding RBM model shows node layer, and V is the number of aobvious node layer, variable h=[h 1, h 2..., h h] tcorresponding RBM model implies node, and H is the number of implicit node.θ={ W, a, b} are model parameter, W={w ij} v × H, w ijrepresent aobvious node layer v iwith implicit node h jconnection weights, a=[a 1, a 2..., a v] twith b=[b 1, b 2..., b h] tfor offset parameter, aobvious node layer v ivariance, in model training, being fixed as a definite value not upgrading, in order to represent convenient, making it be 1 here.The joint probability distribution of aobvious node layer v and implicit node h is defined as:
Wherein, for partition item
According to formula (7) (8), obtain the joint probability distribution of aobvious node layer
Utilize the training data of each acoustics subspace obtained in step 5, according to maximum-likelihood criterion, adopt ContrastiveDivergence (CD) algorithm (see G.Hinton, " Trainingproductsofexpelsbyminimizingcontrastivedivergenc e; ' ' NeuralComputation, vo1.12, no.14; PP.1711-1800,2002.) to model parameter estimate, wherein, { W m, b m, a mit is the model parameter of m Gaussian-BemoulliRBM;
Step 7: spectral conversion
(1) at translate phase, extract the static spectral envelope characteristic of voice to be converted, and obtain its single order and second order dynamic spectrum envelope characteristic according to formula (2) (3), (4) (5), static state and single order, second order dynamic spectrum envelope characteristic are stitched together and obtain spectrum envelope feature to be converted, t frame spectrum envelope feature to be converted is used represent.Static high-rise spectrum signature is extracted on static spectral envelope characteristic basis, and obtain single order and the dynamic high-rise spectrum signature of second order according to formula (2) (3), (4) (5), static and single order, the dynamic high-rise spectrum signature of second order are stitched together and obtain the high-rise spectrum signature of voice to be converted, t vertical frame dimension layer spectrum signature is used represent.Acoustics subspace index m corresponding to voice t frame frequency spectrum signature to be converted is calculated according to maximum posteriori criterion;
m = arg max m P ( m | X t MCEP , λ GMM ) - - - ( 11 )
(2) conversion spectrum envelope characteristic is treated , change according to maximal condition probability output criterion, the conversion spectrum envelope characteristic obtained is:
Y ~ t SPE = arg max Y t SPE P ( Y t SPE | X t SPE ) = arg max Y t SPE P ( X t SPE , Y t SPE ) P ( X t SPE ) - - - ( 12 )
Above formula can be reduced to further:
Y ~ t SPE = arg max Y t SPE P ( X t SPE , Y t SPE ) - - - ( 13 )
Because (13) formula can not get closed solution, adopt gradient descent search algorithm to obtain conversion spectrum envelope characteristic parameter, the more new formula of gradient descent algorithm is:
Y t SPE ( i + 1 ) = Y t SPE ( i ) + α · ∂ log P ( X t SPE , Y t SPE ) ∂ Y t SPE | Y t SPE = Y t SPE ( i ) - - - ( 14 )
Wherein, i is iterations, and α is step-length, according to formula (10), about partial derivative;
∂ log P ( X t SPE , Y t SPE ) ∂ Y t SPE = - ( Y t SPE - a m ( y ) ) + Σ j = 1 H exp ( b m , j + v t T w m , j ) ( 1 + exp ( b m , j + v t T w m , j ) ) w m , j ( y ) - - - ( 15 )
Wherein, a m, b m=[b m, 1... b m, j..., b m, H] t, W m=[w m, 1..., w m, j..., w m, H] v × Hbe m Gaussian-BernoulliRBM model parameter, w m, jfor matrix W mjth row. for a m, w m, jin with target signature continuous item.
Adopt the pattern of RBM model (see Z.Ling, L.Deng, andD.Yu, " ModelingspectralenvelopesusingrestrictedBoltzmannmachine sforstatisticalparametricspeechsynthesis; " inProc.ICASSP, 2013.) as the initial value of searching algorithm;
Due under log-domain in containing function item f (x)=log (1+exp (x)).When | during x| > 4, f *x () realizes accurately approaching f (x);
f * ( x ) = x x &GreaterEqual; 0 , 0 x < 0 . - - - ( 16 )
Utilize this approximate, formula (13) solved and obtains conversion spectrum envelope characteristic and be:
Y ~ t SPE = a m ( y ) + &Sigma; j : b m , j + v T w m , j > 0 w m , j ( y ) - - - ( 17 )
Step 8: synthesis converting speech
Finally, STRAIGHT compositor is sent into, T.G Grammar voice by obtaining conversion spectrum envelope characteristic sequence in the fundamental frequency sequence be converted to and step 7.
The technology of the present invention solution two: a kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine, performing step is as follows:
Step one: extract spectrum-envelope of voice feature
(1) utilize STRAIGHT to analyze compositor to analyze frame by frame the corpus of source and target speaker respectively, obtain speech pitch sequential value and static spectral envelope characteristic with wherein with be respectively source and target speaker t frame static spectral envelope characteristic vector, dimension is 513, T 1and T 2be respectively source and target eigenvector frame number;
(2) based on static spectral envelope characteristic with first-order dynamic spectrum envelope feature is obtained according to formula (2) (3) with second order dynamic spectrum envelope characteristic is obtained according to formula (4) (5) with
&Delta; c t = 0.5 c t + 1 - 0.5 c t - 1 , &ForAll; t &Element; [ 2 , T - 1 ] - - - ( 2 )
Δc 1=Δc 2,Δc T=Δc T-1(3)
&Delta; 2 c t = c t + 1 - 2 c t + c t - 1 , &ForAll; t &Element; [ 2 , T - 1 ] - - - ( 4 )
Δ 2c 1=Δ 2c 2,Δ 2c T=Δ 2c T-1(5)
Wherein, Δ represents first order difference, Δ 2represent second order difference, T is the frame number of characteristic sequence, c trepresent t frame feature vector;
(3) will with be stitched together, finally obtain the spectrum envelope feature of source speaker X SPE = [ X 1 SPE T , X 2 SPE T , . . . X t SPE T , . . . , X T 1 SPE T ] T , wherein, t frame frequency spectrum envelope feature X t SPE = [ x t SPE T , &Delta;x t SPE T , &Delta; 2 x t SPE T ] T , [ &CenterDot; ] T Represent vector transposition.Will with be stitched together, finally obtain the spectrum envelope feature of target speaker Y SPE = [ Y 1 SPE T , Y 2 SPE T , . . . Y t SPE T , . . . , Y T 2 SPE T ] T , Wherein, t frame frequency spectrum envelope feature Y t SPE = [ y t SPE T , &Delta; y t SPE T , &Delta; 2 y t SPE T ] T ;
Step 2: extract the high-rise spectrum signature of voice
(1) the static spectral envelope characteristic obtained with on basis, extract high-rise spectrum signature corresponding to every frame voice further, use 40 rank mel cepstrum features here, obtain the high-rise spectrum signature of static state of source and target speaker with
(2) based on with , obtain the high-rise spectrum signature of first-order dynamic according to formula (2) (3) with , obtain the dynamic high-rise spectrum signature of second order according to formula (4) (5) with
(3) will with be stitched together, finally obtain the high-rise spectrum signature of source speaker X MCEP = [ X 1 MCEP T , X 2 MC EP T , &CenterDot; &CenterDot; &CenterDot; X t MCE P T , &CenterDot; &CenterDot; &CenterDot; , X T 1 MCE P T ] T , Wherein, t vertical frame dimension layer spectrum signature X t MCEP = [ x t MCE P T , &Delta; x t MCE P T , &Delta; 2 x t MCE P T ] T . Will with be stitched together, finally obtain the high-rise spectrum signature of target speaker Y MCEP = [ Y 1 MCEP T , Y 2 MC EP T , &CenterDot; &CenterDot; &CenterDot; Y t MCE P T , &CenterDot; &CenterDot; &CenterDot; , Y T 2 MCE P T ] T , Wherein, t vertical frame dimension layer spectrum signature Y t SPE = [ y t MCE P T , &Delta; y t MCE P T , &Delta; 2 y t MCE P T ] T .
Step 3: dynamic time warping
(1) x is calculated according to dynamic time warping (DynamicTimeAlign, DTW) algorithm mCEPand Y mCEPbetween alignment function, and according to this alignment function by X mCEPand Y mCEPalignment, by the x after aliging mCEPand Y mCEPsplicing obtains combining high-rise spectrum signature Z MCEP = [ Z 1 MCEP T , Z 2 MC EP T , &CenterDot; &CenterDot; &CenterDot; Z t MCE P T , &CenterDot; &CenterDot; &CenterDot; , Z T MCE P T ] T , wherein t frame combines high-rise spectrum signature Z t MCEP = [ X t MCE P T , Y t MCE P T ] T , T represents the frame length after alignment;
(2) alignment function obtained in basis (1) is by x sPEand Y sPEalignment, by the x after aliging sPEand Y sPEsplicing obtains joint spectrum envelope characteristic wherein t frame joint spectrum envelope characteristic Z t SPE = [ X t SPE T , Y t SPE T ] T ;
Step 4: GMM model training
Utilize the high-rise spectrum signature z of associating obtained in the previous step mCEP, according to maximum-likelihood criterion, utilize EM algorithm to the training of GMM model, obtain model parameter , wherein, M is the number of Gaussian mixtures in GMM model, ω m, μ m, ∑ mrepresent the weight of m Gaussian mixtures respectively, mean vector and covariance matrix;
Step 5: joint spectrum envelope characteristic acoustics Subspace partition
After GMM model training completes, utilize the GMM model parameter λ obtained gMM, according to maximum posteriori criterion, to the high-rise spectrum signature z of associating mCEPcarry out acoustics Subspace partition, obtain by z mCEPin the index sequence m=[m of every acoustics subspace index composition belonging to frame feature 1, m 2..., m t... m t];
m = arg max m P ( m | Z MCEP , &lambda; GMM ) - - - ( 6 )
According to index sequence m distich sum of fundamental frequencies spectrum envelope characteristic Z sPEcarry out acoustics Subspace partition, will there is the joint spectrum envelope characteristic frame classification of identical subspace index together, as the training characteristics parameter set of Gaussian-GaussianRBM model under this acoustics subspace;
Step 6: Gaussian-GaussianRBM model training
The aobvious node layer variable Gaussian distributed of same hypothesis in Gaussian-GaussianRBM model, but it is different from Gaussian-BernoulliRBM model, do not consider to contact between aobvious node layer and implicit node, namely set hidden node number as 0, directly show node layer x and target speaker source speaker to show between node layer y and set up full connection, connection matrix is W={w ij} d × D, D is eigenvector dimension, W ijexpression source shows node layer x inode layer y is shown with target jbetween connection weights, energy function corresponding under this form is:
E ( x , y ) = &Sigma; i = 1 D x i 2 2 &sigma; i 2 + &Sigma; j = 1 D y j 2 2 &sigma; j 2 - &Sigma; i = 1 D &Sigma; j = 1 D x i &sigma; i w ij y i &sigma; i - - - ( 18 )
Wherein, θ={ W} is model parameter, the characteristic node variance of source and target with be fixed as 1 here, and then obtain the joint probability distribution of x and y;
Wherein, for partition item
W = 0 W W T 0 - - - ( 21 )
Step 5 is utilized to obtain each acoustics subspace joint spectrum envelope characteristic parameter set, according to maximum-likelihood criterion, adopt CD algorithm (see G.Hinton, " Trainingproductsofexpertsbyminimizingcontrastivedivergen ce; " NeuralComputation, vol.12, no.14, pp.1711-1800,2002.) train the model parameter obtaining corresponding Gaussian-GaussianRBM wherein, { W mit is the model parameter of m Gaussian-GaussianRBM;
Step 7: spectral conversion
(1) at translate phase, extract the static spectral envelope characteristic of voice to be converted, and obtain its single order and second order dynamic spectrum envelope characteristic according to formula (2) (3), (4) (5), static state and single order, second order dynamic spectrum envelope characteristic are stitched together and obtain spectrum envelope feature to be converted, t frame spectrum envelope feature to be converted is used represent, static high-rise spectrum signature is extracted on static spectral envelope characteristic basis, and obtain single order and the dynamic high-rise spectrum signature of second order according to formula (2) (3), (4) (5), static and single order, the dynamic high-rise spectrum signature of second order are stitched together and obtain the high-rise spectrum signature of voice to be converted, t vertical frame dimension layer spectrum signature is used represent.Acoustics subspace index m corresponding to voice t frame frequency spectrum signature to be converted is calculated according to maximum posteriori criterion;
m = arg max m P ( m | X t MCEP , &lambda; GMM ) - - - ( 22 )
(2) conversion spectrum envelope characteristic is treated according to maximal condition probability output criterion carry out changing (as formula (12) (13)), according to formula (18) (19) (20), order
&PartialD; log P ( X t SPE , Y t SPE ) &PartialD; Y t SPE = 0 - - - ( 23 )
Obtain conversion spectrum envelope characteristic parameter
Y ~ t SPE = W m T X t SPE - - - ( 24 )
Step 8: synthesis converting speech
Finally, STRAIGHT compositor is sent obtaining conversion spectrum envelope characteristic sequence in the fundamental frequency sequence be converted to and step 7 into, synthesis converting speech.
Principle of the present invention:
(1) the spectrum envelope feature directly using STRAIGHT compositor to extract replaces high-rise spectrum signature for spectrum signature modeling.STRAIGHT (SpeechTransformationandRepresentationusingAdaptiveInterp olationofWeightedSpectrum, STRAIGHT) be a kind of high-quality voice operation demonstrator, this compositor effectively can realize being separated of voice sound source and wave filter, voice signal is resolved into fundamental frequency and removes spectrum envelope two parts of fundamental frequency impact.Tradition spectrum modeling method, when extracting spectrum signature, can extract high-rise spectrum signature based on original spectrum envelope.And in the modeling method of the present invention's proposition, no longer carry out the extraction of high-rise spectrum signature, but directly use spectrum envelope as Modelling feature to reduce the loss of characteristic extraction procedure intermediate frequency spectrum details;
(2) under GMM framework, limited Boltzmann machine is used to replace single Gaussian distribution in model to the joint spectrum feature modeling of each acoustics subspace.Use the relatively high-rise spectrum signature of spectrum envelope feature can increase the difficulty of spectrum modeling.Because the relatively high-rise spectrum signature of spectrum envelope feature has higher intrinsic dimensionality on the one hand, the relatively high-rise spectrum signature of spectrum envelope feature has correlativity between stronger dimension on the other hand.Limited Boltzmann machine has the advantage of two aspects.First, limited Boltzmann machine not easily crosses training, and relatively single Gaussian distribution better can carry out modeling to correlativity between the dimension of high dimensional feature; Secondly, in limited Boltzmann machine, the feature with maximum output probability is no longer sample average, effectively can improve the mistake smoothing problasm of converting characteristic.Therefore, we use limited Boltzmann machine (RestrictedBoltzmannMachine, RBM) to replace traditional single Gaussian distribution for describing the output probability of spectrum signature in acoustics subspace here.
The present invention's advantage is compared with prior art: the present invention propose based on the joint spectrum modeling of RBM model to source and target speaker realize sound conversion method produce good effect mainly contain 3 points:
(1) the original spectrum envelope feature adopting STRAIGHT compositor to extract carries out modeling, remains more multiple spectra information, avoids minutia and loses;
(2) adopt limited Boltzmann machine to replace single Gaussian distribution to represent acoustics subspace output probability, can realize the study to correlativity between higher-dimension spectrum signature dimension, the description distributed to the union feature of source and target voice is more accurate;
(3) conversion spectrum details is abundanter, crosses smoothing effect and improves.When adopting the RBM of Gaussian-Bernoulli form, when the number of implicit node is set to H, be equivalent to this acoustics subspace to divide into 2 hthe state of kind describes respectively, characterizes more detailed to the transitional information between source and target frequency spectrum.During conversion, after determining acoustics subspace belonging to frequency spectrum to be converted, the transformational relation of corresponding state is selected to change further by implicit node.Continuous a few frame voice of same like this phoneme can be changed more accurately according to the difference of its status, thus well avoid time domain in conventional model and cross smoothing effect; Equally, when adopting Gaussian-Gaussian form RBM, although the same with Traditional GM M model, under maximum a posteriori probability exports criterion, obtain conversion spectrum is still the conditional mean of probability distribution, but RBM model can carry out effective modeling to correlativity between the dimension of spectrum envelope, the matrix ∑ of relation between the sign source and target speaker spectrum signature dimension of acquisition (xy)can play a role in transfer process, thus in transfer process, in formula (1) frequency spectrum parameter that can be different according to source speaker, basis on revise further, obtain conversion spectrum.Even if like this for the continuous speech frame under same phoneme, the conversion spectrum of acquisition also has difference, thus the converting speech obtained is made to change more natural in time domain.Meanwhile, because the conversion spectrum obtained under RBM model is all no longer the weighted mean of training data, frequency domain is crossed smoothing effect and be have also been obtained effective improvement.
Accompanying drawing explanation
Fig. 1 is realization flow figure of the present invention.
Embodiment
In the present invention, according to the thought of the joint spectrum modeling based on limited Boltzmann machine proposed, the idiographic flow of accomplished sound conversion as shown in Figure 1.Be different from and based in GMM model conversion method, each single Gauss in acoustics subspace be described, in the present invention, adopt limited Boltzmann machine (RBM) model to be described.At RBM training module, the RBM of different structure can be adopted according to RBM model concrete form difference, such as Gaussian-BernoulliRBM, Gaussian-GaussianRBM etc.
Limited Boltzmann machine is (see R.Salakhutdinov, " Learningdeepgenerativemodels; " Ph.D.dissertation, UniversityofToronto, 2009.) be a kind ofly have double-deck non-directed graph model for what describe relation of interdependence between one group of stochastic variable, it is by one group of visible stochastic variable v=[v 1, v 2..., v v] tnode and one group imply stochastic variable h=[h 1, h 2..., h h] tnode forms, V and H is respectively the number of aobvious node layer and implicit node.Show and be interconnected between node layer and implicit node.We are by the joint spectrum envelope characteristic vector of source and target as aobvious layer stochastic variable, even utilize RBM model to learn the transformational relation between source and target speaker spectrum envelope feature.
Suppose according to the difference of aobvious node layer variable and implicit node variable probability distribution, limited Boltzmann machine can have multi-form.Describe in detail based on the associating modeling of RBM model and sound converting method step under employing Gaussian-Bernoulli and Gaussian-Gaussian form respectively below.
1. one of the technology of the present invention solution: based on Gaussian-BernoulliRBM model joint spectrum modeling sound converting method
Step one: extract speech spectral characteristics
Utilize STRAIGHT to analyze compositor to analyze frame by frame the corpus of source and target speaker respectively, the characteristic sequence x obtaining speech pitch sequential value and be made up of 513 dimension spectrum envelope eigenvectors sPEand y sPE.On the basis of static spectral envelope characteristic, obtain behavioral characteristics (adopting single order and second order difference here) further.First order difference feature calculates according to formula (2) (3);
Δc t=1.5c t+1-0.5c t-1 (2)
Δc 1=Δc 2,Δc T=Δc T-1(3)
Second order difference feature calculates according to formula (4) (5)
Δ 2c t=c t+1-2c t+c t-1 (4)
Δ 2c 1=Δ 2c 2,Δ 2c T=Δ 2c T-1(5)
Wherein, Δ represents first order difference, Δ 2represent second order difference, T is the frame number of characteristic sequence, c trepresent t frame feature vector;
Finally obtain the spectrum envelope characteristic sequence X of source and target speaker sPEand Y sPE, wherein, the concrete form of t frame feature vector is X t SPE = [ x t SPE T , &Delta; x t SPE T , &Delta; 2 x t SPE T ] T With Y t SPE = [ y t SPE T , &Delta; y t SPE T , &Delta; 2 y t SPE T ] T , [ &CenterDot; ] T Represent vector transposition;
On the static spectrum envelope characteristic sequence basis obtained, the high-rise spectrum signature that further extraction every frame voice are corresponding, here 40 Jan Vermeer cepstrum features are used, and their single order and second order difference is calculated respectively according to formula (2) (3), (4) (5), obtain the high-rise spectrum signature sequence X of source and target speaker mCEPand Y mCEP.
To the high-rise spectrum signature sequence X of source and target speaker mCEPand Y mCEP, use dynamic time warping (DynamicTimeAlign, DTW) algorithm to align, and generate the high-rise spectrum signature sequence Z of associating under high-rise spectrum signature mCEP, wherein t is the totalframes after alignment.
Spectrum envelope characteristic sequence X sPEand Y sPEEequally according to the alignment function alignment that high-rise spectrum signature sequence obtains, and generate joint spectrum envelope characteristic sequence Z sPE.
Step 2: GMM model training
Utilize the high-rise spectrum signature of associating obtained in the previous step, according to maximum-likelihood criterion, utilize EM algorithm to the training of GMM model, obtain model parameter , wherein, M is the number of Gaussian mixtures in GMM model, ω m, μ m, ∑ mrepresent the weight of m Gaussian mixtures, mean vector and covariance matrix respectively, adopt diagonal covariance matrix here.
Step 3: joint spectrum envelope characteristic acoustics Subspace partition
After GMM model training completes, utilize the GMM model parameter obtained, according to maximum posteriori criterion, acoustics Subspace partition is carried out to high-rise spectrum signature, obtain the acoustics subspace index sequence m=[m belonging to training utterance frame 1, m 2..., m t... m t];
m = arg max P m ( m | Z MCEP , &lambda; GMM ) - - - ( 6 )
According to index sequence m to total joint spectrum envelope characteristic Z sPEcarry out acoustics Subspace partition, will there is the spectrum envelope tagsort of identical subspace index together, as the training dataset of Gaussian-BernoulliRBM model under this acoustics subspace.
Step 4: Gaussian-BernoulliRBM model training
Because spectrum envelope characteristic ginseng value is continuous real number, obey continuous print probability distribution to can describe each aobvious node layer of hypothesis more accurately to its distribution, be assumed to be Gaussian distribution here, the implicit node of hypothesis obeys { 0,1} bis-Distribution value simultaneously.
According to the division result in step 3, to each acoustics subspace stand-alone training RBM model.The energy function that the RBM model of the Gaussian-Bernoulli form adopted in this programme is corresponding is:
E ( v , h ) = &Sigma; i = 1 V ( v i - a i ) 2 2 &sigma; i 2 - &Sigma; j = 1 H b j h j - &Sigma; i = 1 V &Sigma; j = 1 H v i &sigma; i w ij h j - - - ( 7 )
Wherein, variable v=[v 1, v 2..., v v] tcorresponding RBM model shows node layer, and V is the number of aobvious node layer, variable h=[h 1, h 2..., h h] tcorresponding RBM model implies node, and H is the number of implicit node.θ={ W, a, b} are model parameter, W={w ij} v × H, w ijrepresent aobvious node layer v iwith implicit node h jconnection weights, a=[a 1, a 2..., a v] twith b=[b 1, b 2..., b h] tfor offset parameter, aobvious node layer v ivariance, in model training, being fixed as a definite value not upgrading, in order to represent convenient, making it be 1 here.The joint probability distribution of aobvious node layer v and implicit node h is defined as:
Wherein, (θ) be partition item
According to formula (7) (8), the joint probability distribution of aobvious node layer can be obtained
(10)
Utilize the training data of each acoustics subspace obtained in step 3, according to maximum-likelihood criterion, adopt ContrastiveDivergence (CD) algorithm (see G.Hinton, " Trainingproductsofexpertsbyminimizingcontrastivedivergen ce; " NeuralComputation, vol.12, no.14, pp.1711-1800,2002.) to model parameter estimate.Wherein, { W m, b m, a mit is the model parameter of m Gaussian-BernoulliRBM;
Step 5: spectral conversion
(1) at translate phase, extract the static spectral envelope characteristic of voice to be converted, and obtain its single order and second order dynamic spectrum envelope characteristic according to formula (2) (3), (4) (5), static state and single order, second order dynamic spectrum envelope characteristic are stitched together and obtain spectrum envelope feature to be converted, t frame spectrum envelope feature to be converted is used represent, static high-rise spectrum signature is extracted on static spectral envelope characteristic basis, and obtain single order and the dynamic high-rise spectrum signature of second order according to formula (2) (3), (4) (5), static and single order, the dynamic high-rise spectrum signature of second order are stitched together and obtain the high-rise spectrum signature of voice to be converted, t vertical frame dimension layer spectrum signature is used represent.Acoustics subspace index m corresponding to voice t frame frequency spectrum signature to be converted is calculated according to maximum posteriori criterion;
m = arg max m P ( m | X t MCEP , &lambda; GMM ) - - - ( 11 )
(2) conversion spectrum envelope characteristic is treated change according to maximal condition probability output criterion, the conversion spectrum envelope characteristic obtained is:
Y ~ t SPE = arg max Y t SPE P ( Y t SPE | X t SPE ) = arg max Y t SPE P ( X t SPE , Y t SPE ) P ( X t SPE ) - - - ( 12 )
Due to with irrelevant, can further above formula be reduced to:
Y ~ t SPE = arg max Y t SPE P ( X t SPE , Y t SPE ) - - - ( 13 )
Because (13) formula can not get closed solution, gradient descent search algorithm is adopted to obtain conversion spectrum parameter:
Y t SPE ( i + 1 ) = Y t SPE ( i ) + &alpha; &CenterDot; &PartialD; log P ( X t SPE , Y t SPE ) &PartialD; Y t SPE | Y t SPE = Y t SPE ( i ) - - - ( 14 )
Wherein, i is iterations, and α is step-length, according to formula (10), about partial derivative;
&PartialD; log P ( X t SPE , Y t SPE ) &PartialD; Y t SPE = - ( Y t SPE - a m ( y ) ) + &Sigma; j = 1 H exp ( b m , j + v t T w m , j ) ( 1 + exp ( b m , j + v t T w m , j ) ) w m , j ( y ) - - - ( 15 )
Wherein, a m, b m=[b m, 1..., b m, j..., b m, H] t, W m=[w m, 1..., w m, j..., w m, H] v × Hbe m Gaussian-BernoulliRBM model parameter, w m, jfor matrix W mjth row. for a m, w m, jin with target signature continuous item.
Because gradient descent algorithm is very sensitive to initial value, here adopt the pattern of RBM model (see Z.Ling, L.Deng, andD.Yu, " ModelingspectralenvelopesusingrestrictedBoltzmannmachine sforstatisticalparametricspeechsynthesis; " inProc.ICASSP, 2013.) as the initial value of searching algorithm.
In actual converted, because gradient descent search algorithm is very consuming time, real-time conversion requirements can not be met.Transfer process is done and has been similar to further.Make discovery from observation, under log-domain in containing function item f (x)=log (1+exp (x)).Due to when | during x| > 4, f *x () can realize accurately approaching it;
f * ( x ) = x x &GreaterEqual; 0 , 0 x < 0 . - - - ( 16 )
Further, the excitation item s=b of node is implied in statistics display experiment m, j+ v tw m, jnumerical value or very large, or very little (such as, in training set, P (| s| > 4)=0.94), make this being similar in this model be rational.By this approximate, we can obtain the closed solution of conversion spectrum envelope characteristic, thus simplify transfer process.
Y ~ t SPE = a m ( y ) + &Sigma; j : b m , j + v T w m , j > 0 w m , j ( y ) - - - ( 17 )
Step 6: synthesis converting speech
Finally, STRAIGHT compositor is sent into, T.G Grammar voice by obtaining conversion spectrum envelope characteristic sequence in the fundamental frequency sequence be converted to (adopting Gauss model transformation approach here) and the 5th step.
2. the technology of the present invention solution two: based on Gaussian-GaussianRBM model joint spectrum modeling sound converting method
Step one, two, three with the step one in scheme one, two, three identical, no longer repeat.
Step 4: Gaussian-GaussianRBM model training;
The aobvious node layer variable Gaussian distributed of same hypothesis in Gaussian-GaussianRBM model.But it is different from Gaussian-BernoulliRBM model, here do not consider to contact between aobvious node layer and implicit node, namely set hidden node number as 0, directly show node layer x and target speaker source speaker and show between node layer y and set up full connection, connection matrix is W={w ij} d × D, D is eigenvector dimension, w ijexpression source shows node layer x inode layer y is shown with target jbetween connection weights.Energy function corresponding under this form is:
E ( x , y ) = &Sigma; i = 1 D x i 2 2 &sigma; i 2 + &Sigma; j = 1 D y j 2 2 &sigma; j 2 - &Sigma; i = 1 D &Sigma; j = 1 D x i &sigma; i w ij y j &sigma; j - - - ( 18 )
Wherein, θ={ W} is model parameter.Bias term is set to 0, does not upgrade.In order to represent convenient, the characteristic node variance of source and target with be fixed as 1 here.And then obtain the joint probability distribution of x and y;
Wherein, for partition item
W = 0 W W T 0 - - - ( 21 )
Equally, step 3 is utilized to obtain each acoustics subspace joint spectrum envelope characteristic parameter set, according to maximum-likelihood criterion, employing CD algorithm (see G.Hinton, " Trainingproductsofexpertsbyminimizingcontrastivedivergen ce, " NeuralComputation, vol.12, no.14, pp.1711-1800,2002.) train the model parameter obtaining corresponding Gaussian-GaussianRBM wherein, { W mit is the model parameter of m Gaussian-GaussianRBM;
Step 5: spectral conversion
(1) at translate phase, extract the static spectral envelope characteristic of voice to be converted, and obtain its single order and second order dynamic spectrum envelope characteristic according to formula (2) (3), (4) (5), static state and single order, second order dynamic spectrum envelope characteristic are stitched together and obtain spectrum envelope feature to be converted, t frame spectrum envelope feature to be converted is used represent.Static high-rise spectrum signature is extracted on static spectral envelope characteristic basis, and obtain single order and the dynamic high-rise spectrum signature of second order according to formula (2) (3), (4) (5), static and single order, the dynamic high-rise spectrum signature of second order are stitched together and obtain the high-rise spectrum signature of voice to be converted, t vertical frame dimension layer spectrum signature is used represent.Acoustics subspace index m corresponding to voice t frame frequency spectrum signature to be converted is calculated according to maximum posteriori criterion;
m = arg max m P ( m | X t MCEP , &lambda; GMM ) - - - ( 22 )
(2) conversion spectrum envelope characteristic is treated according to maximal condition probability output criterion carry out changing (as formula (12) (13)).According to formula (18) (19) (20), order
&PartialD; log P ( X t SPE , Y t SPE ) &PartialD; Y t SPE = 0 - - - ( 23 )
Obtain conversion spectrum envelope characteristic parameter
Y ~ t SPE = W m T X t SPE - - - ( 24 )
Step 6: synthesis converting speech
Finally, STRAIGHT compositor is sent into, synthesis converting speech by obtaining conversion spectrum envelope characteristic sequence in the fundamental frequency sequence be converted to (adopting Gauss model transformation approach here) and the 5th step.

Claims (2)

1., based on a sound converting method for the joint spectrum modeling of limited Boltzmann machine, it is characterized in that performing step is as follows:
Step one: extract spectrum-envelope of voice feature
(1) utilize STRAIGHT to analyze compositor to analyze frame by frame the corpus of source and target speaker respectively, obtain speech pitch sequential value and static spectral envelope characteristic with , wherein, with be respectively source and target speaker t frame static spectral envelope characteristic vector, dimension is 513, T 1and T 2be respectively source and target eigenvector frame number;
(2) based on static spectral envelope characteristic with , obtain first-order dynamic spectrum envelope feature according to formula (2) (3) with , obtain second order dynamic spectrum envelope characteristic according to formula (4) (5) with ;
Δc 1=Δc 2,Δc T=Δc T-1(3)
Δ 2c 1=Δ 2c 22c T=Δ 2c T-1(5)
Wherein, Δ represents first order difference, Δ 2represent second order difference, T is the frame number of characteristic sequence, c trepresent t frame feature vector;
(3) will with be stitched together, finally obtain the spectrum envelope feature of source speaker , wherein, t frame frequency spectrum envelope feature represent vector transposition, will with , be stitched together, finally obtain the spectrum envelope feature of target speaker , wherein, t frame frequency spectrum envelope feature ;
Step 2: extract the high-rise spectrum signature of voice
(1) the static spectral envelope characteristic obtained with on basis, extract high-rise spectrum signature corresponding to every frame voice further, use 40 rank mel cepstrum features here, obtain the high-rise spectrum signature of static state of source and target speaker with ;
(2) based on with , obtain the high-rise spectrum signature of first-order dynamic according to formula (2) (3) with , obtain the dynamic high-rise spectrum signature of second order according to formula (4) (5) with ;
(3) will with be stitched together, finally obtain the high-rise spectrum signature of source speaker , wherein, t vertical frame dimension layer spectrum signature , will with be stitched together, finally obtain the high-rise spectrum signature of target speaker , wherein, t vertical frame dimension layer spectrum signature ;
Step 3: dynamic time warping
(1) X is calculated according to DTW dynamic time warping (DynamicTimeAlign, DTW) algorithm mCEPand Y mCEPbetween alignment function, and according to this alignment function by X mCEPand Y mCEPalignment, by the X after aliging mCEPand Y mCEPsplicing obtains combining high-rise spectrum signature , wherein, t frame combines high-rise spectrum signature , T represents the frame length after alignment;
(2) alignment function obtained in basis (1) is by X sPEand Y sPEalignment, by the X after aliging sPEand Y sPEsplicing obtains joint spectrum envelope characteristic , wherein, t frame joint spectrum envelope characteristic ;
Step 4: GMM model training
Utilize the high-rise spectrum signature Z of associating obtained in the previous step mCEP, according to maximum-likelihood criterion, utilize EM algorithm to the training of GMM model, obtain model parameter , wherein, M is the number of Gaussian mixtures in GMM model, ω m, μ m, Σ mrepresent the weight of m Gaussian mixtures, mean vector and covariance matrix respectively;
Step 5: joint spectrum envelope characteristic acoustics Subspace partition
After GMM model training completes, utilize the GMM model parameter λ obtained gMM, according to maximum posteriori criterion, to the high-rise spectrum signature Z of associating mCEPcarry out acoustics Subspace partition, obtain by Z mCEPin the index sequence m=[m of every acoustics subspace index composition belonging to frame feature 1, m 2..., m t... m t];
According to index sequence m distich sum of fundamental frequencies spectrum envelope characteristic Z sPEcarry out acoustics Subspace partition, will there is the joint spectrum envelope characteristic frame classification of identical subspace index together, as the training characteristics parameter set of Gaussian – Bernoulli limited Boltzmann machine RBM model under this acoustics subspace;
Step 6: Gaussian-Bernoulli limited Boltzmann machine RBM model training
Because spectrum envelope characteristic ginseng value is continuous real number, obey continuous print probability distribution to can describe each aobvious node layer of hypothesis more accurately to its distribution, be assumed to be Gaussian distribution here, the implicit node of hypothesis obeys { 0,1} bis-Distribution value simultaneously;
According to the division result in step 5, to each acoustics subspace stand-alone training limited Boltzmann machine RBM model, the energy function that the RBM model of the Gaussian-Bernoulli form of employing is corresponding is:
Wherein, variable v=[v 1, v 2..., v v] tcorresponding RBM model shows node layer, and V is the number of aobvious node layer, variable h=[h 1, h 2..., h h] Τcorresponding limited Boltzmann machine RBM model implies node, and H is the number of implicit node; θ={ W, a, b} are model parameter, W={w ij} v × H, w ijrepresent aobvious node layer v iwith implicit node h jconnection weights, a=[a 1, a 2..., a v] Τwith b=[b 1, b 2..., b h] Τfor offset parameter; aobvious node layer v ivariance, in model training, being fixed as a definite value not upgrading, in order to represent convenient, making it be 1 here; The joint probability distribution of aobvious node layer v and implicit node h is defined as:
Wherein, for partition item
According to formula (7) (8), obtain the joint probability distribution of aobvious node layer;
Utilize the training data of each acoustics subspace obtained in step 5, according to maximum-likelihood criterion, adopt ContrastiveDivergence (CD) algorithm to model parameter estimate, wherein, { W m, b m, a mit is the model parameter of m Gaussian-BernoulliRBM;
Step 7: spectral conversion
(1) at translate phase, extract the static spectral envelope characteristic of voice to be converted, and obtain its single order and second order dynamic spectrum envelope characteristic according to (2) (3), (4) (5), static state and single order, second order dynamic spectrum envelope characteristic are stitched together and obtain spectrum envelope feature to be converted, t frame spectrum envelope feature to be converted is used represent, static high-rise spectrum signature is extracted on static spectral envelope characteristic basis, and obtain single order and the dynamic high-rise spectrum signature of second order according to (2) (3), (4) (5), static and single order, the dynamic high-rise spectrum signature of second order are stitched together and obtain the high-rise spectrum signature of voice to be converted, t vertical frame dimension layer spectrum signature is used represent; Acoustics subspace index m corresponding to voice t frame frequency spectrum signature to be converted is calculated according to maximum posteriori criterion;
(2) conversion spectrum envelope characteristic is treated , change according to maximal condition probability output criterion, the conversion spectrum envelope characteristic obtained is:
Above formula can be reduced to further:
Because (13) formula can not get closed solution, adopt gradient descent search algorithm to obtain conversion spectrum envelope characteristic parameter, the more new formula of gradient descent algorithm is:
Wherein, i is iterations, and α is step-length, according to formula (10), about partial derivative;
Wherein, a m, b m=[b m, 1..., b m,j..., b m,H] Τ, W m=[w m, 1..., w m,j..., w m,H] v × Hbe m Gaussian-Bernoulli limited Boltzmann machine RBM model parameter, w m,jfor matrix W mjth row; for a m, w m,jin with target signature continuous item;
Adopt the initial value of pattern as gradient descent search algorithm of RBM model;
Due under log-domain in containing function item f (x)=log (1+exp (x)); When | during x| > 4, f *x () realizes accurately approaching f (x);
Utilize this approximate, formula (13) solved and obtains conversion spectrum envelope characteristic and be:
Step 8: synthesis converting speech
Finally, STRAIGHT compositor is sent into, T.G Grammar voice by obtaining conversion spectrum envelope characteristic sequence in the fundamental frequency sequence be converted to and step 7.
2., based on a sound converting method for the joint spectrum modeling of limited Boltzmann machine, it is characterized in that performing step is as follows:
Step one: extract spectrum-envelope of voice feature
(1) utilize STRAIGHT to analyze compositor to analyze frame by frame the corpus of source and target speaker respectively, obtain speech pitch sequential value and static spectral envelope characteristic with , wherein, with be respectively source and target speaker t frame static spectral envelope characteristic vector, dimension is 513, T 1and T 2be respectively source and target eigenvector frame number;
(2) based on static spectral envelope characteristic with , obtain first-order dynamic spectrum envelope feature according to formula (2) (3) with , obtain second order dynamic spectrum envelope characteristic according to formula (4) (5) with ;
Δc 1=Δc 2,Δc T=Δc T-1(3)
Δ 2c 1=Δ 2c 22c T=Δ 2c T-1(5)
Wherein, Δ represents first order difference, Δ 2represent second order difference, T is the frame number of characteristic sequence, c trepresent t frame feature vector;
(3) will with be stitched together, finally obtain the spectrum envelope feature of source speaker , wherein, t frame frequency spectrum envelope feature , [] trepresent vector transposition, will with be stitched together, finally obtain the spectrum envelope feature of target speaker , wherein, t frame frequency spectrum envelope feature ;
Step 2: extract the high-rise spectrum signature of voice
(1) the static spectral envelope characteristic obtained with on basis, extract high-rise spectrum signature corresponding to every frame voice further, use 40 rank mel cepstrum features here, obtain the high-rise spectrum signature of static state of source and target speaker with ;
(2) based on with , obtain the high-rise spectrum signature of first-order dynamic according to formula (2) (3) with , obtain the dynamic high-rise spectrum signature of second order according to formula (4) (5) with ;
(3) will with be stitched together, finally obtain the high-rise spectrum signature of source speaker , wherein, t vertical frame dimension layer spectrum signature , will with be stitched together, finally obtain the high-rise spectrum signature of target speaker , wherein, t vertical frame dimension layer spectrum signature ;
Step 3: dynamic time warping
(1) X is calculated according to DTW dynamic time warping (DynamicTimeAlign, DTW) algorithm mCEPand Y mCEPbetween alignment function, and according to this alignment function by X mCEPand Y mCEPalignment, by the X after aliging mCEPand Y mCEPsplicing obtains combining high-rise spectrum signature , wherein t frame combines high-rise spectrum signature , T represents the frame length after alignment;
(2) alignment function obtained in basis (1) is by X sPEand Y sPEalignment, by the X after aliging sPEand Y sPEsplicing obtains joint spectrum envelope characteristic , wherein t frame joint spectrum envelope characteristic ;
Step 4: GMM model training
Utilize the high-rise spectrum signature Z of associating obtained in the previous step mCEP, according to maximum-likelihood criterion, utilize EM algorithm to the training of GMM model, obtain model parameter , wherein, M is the number of Gaussian mixtures in GMM model, ω m, μ m, Σ mrepresent the weight of m Gaussian mixtures, mean vector and covariance matrix respectively;
Step 5: joint spectrum envelope characteristic acoustics Subspace partition
After GMM model training completes, utilize the GMM model parameter λ obtained gMM, according to maximum posteriori criterion, to the high-rise spectrum signature Z of associating mCEPcarry out acoustics Subspace partition, obtain by Z mCEPin the index sequence m=[m of every acoustics subspace index composition belonging to frame feature 1, m 2..., m t... m t];
According to index sequence m distich sum of fundamental frequencies spectrum envelope characteristic Z sPEcarry out acoustics Subspace partition, will there is the joint spectrum envelope characteristic frame classification of identical subspace index together, as the training characteristics parameter set of Gaussian-GaussianRBM model under this acoustics subspace;
Step 6: Gaussian-GaussianRBM model training
The aobvious node layer variable Gaussian distributed of same hypothesis in Gaussian-GaussianRBM model, but it is different from Gaussian-BernoulliRBM model, do not consider to contact between aobvious node layer and implicit node, namely set hidden node number as 0, directly show node layer x and target speaker source speaker to show between node layer y and set up full connection, connection matrix is W={w ij} d × D, D is eigenvector dimension, w ijexpression source shows node layer x inode layer y is shown with target jbetween connection weights, energy function corresponding under this form is:
Wherein, θ={ W} model parameter ,the characteristic node x of source and target iand y jcorresponding variance with be fixed as 1 here, and then obtain the joint probability distribution of x and y;
Wherein, for partition item
Utilize step 5 to obtain each acoustics subspace joint spectrum envelope characteristic parameter set, according to maximum-likelihood criterion, adopt ContrastiveDivergenceCD Algorithm for Training to obtain the model parameter of corresponding Gaussian-GaussianRBM ; Wherein, { W mit is the model parameter of m Gaussian-GaussianRBM;
Step 7: spectral conversion
(1) at translate phase, extract the static spectral envelope characteristic of voice to be converted, and obtain its single order and second order dynamic spectrum envelope characteristic according to formula (2) (3), (4) (5), static state and single order, second order dynamic spectrum envelope characteristic are stitched together and obtain spectrum envelope feature to be converted, t frame spectrum envelope feature to be converted is used represent; Static high-rise spectrum signature is extracted on static spectral envelope characteristic basis, and obtain single order and the dynamic high-rise spectrum signature of second order according to formula (2) (3), (4) (5), static and single order, the dynamic high-rise spectrum signature of second order are stitched together and obtain the high-rise spectrum signature of voice to be converted, t vertical frame dimension layer spectrum signature is used represent, calculate acoustics subspace index m corresponding to voice t frame frequency spectrum signature to be converted according to maximum posteriori criterion;
(2) conversion spectrum envelope characteristic is treated according to maximal condition probability output criterion change, according to formula (18) (19) (20), order
Obtain conversion spectrum envelope characteristic parameter
Step 8: synthesis converting speech
Finally, STRAIGHT compositor is sent obtaining conversion spectrum envelope characteristic sequence in the fundamental frequency sequence be converted to and step 7 into, synthesis converting speech.
CN201310360234.2A 2013-08-16 2013-08-16 A kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine Active CN103413548B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310360234.2A CN103413548B (en) 2013-08-16 2013-08-16 A kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310360234.2A CN103413548B (en) 2013-08-16 2013-08-16 A kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine

Publications (2)

Publication Number Publication Date
CN103413548A CN103413548A (en) 2013-11-27
CN103413548B true CN103413548B (en) 2016-02-03

Family

ID=49606551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310360234.2A Active CN103413548B (en) 2013-08-16 2013-08-16 A kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine

Country Status (1)

Country Link
CN (1) CN103413548B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104217721B (en) * 2014-08-14 2017-03-08 东南大学 Based on the phonetics transfer method under the conditions of the asymmetric sound bank that speaker model aligns
CN106997476B (en) * 2017-03-01 2020-04-28 西安交通大学 Transmission system performance degradation evaluation method for multi-source label-free data learning modeling
CN106782520B (en) * 2017-03-14 2019-11-26 华中师范大学 Phonetic feature mapping method under a kind of complex environment
CN108198566B (en) * 2018-01-24 2021-07-20 咪咕文化科技有限公司 Information processing method and device, electronic device and storage medium
CN108764340A (en) * 2018-05-29 2018-11-06 上海大学 A kind of quantitative analysis method of Type B ultrasound and Ultrasonic elasticity bimodal image
CN111772422A (en) * 2020-06-12 2020-10-16 广州城建职业学院 Intelligent crib

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data
CN103035236A (en) * 2012-11-27 2013-04-10 河海大学常州校区 High-quality voice conversion method based on modeling of signal timing characteristics
CN103226946A (en) * 2013-03-26 2013-07-31 中国科学技术大学 Voice synthesis method based on limited Boltzmann machine

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2489473B (en) * 2011-03-29 2013-09-18 Toshiba Res Europ Ltd A voice conversion method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data
CN103035236A (en) * 2012-11-27 2013-04-10 河海大学常州校区 High-quality voice conversion method based on modeling of signal timing characteristics
CN103226946A (en) * 2013-03-26 2013-07-31 中国科学技术大学 Voice synthesis method based on limited Boltzmann machine

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Zhen-Hua Ling等.MODELING SPECTRAL ENVELOPES USING RESTRICTED BOLTZMANN MACHINES FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS.《2013 IEEE International Conference on Acoustics,Speech,and Singnal Processing》.2013, *

Also Published As

Publication number Publication date
CN103413548A (en) 2013-11-27

Similar Documents

Publication Publication Date Title
CN103413548B (en) A kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine
Sun et al. Voice conversion using deep bidirectional long short-term memory based recurrent neural networks
CN101178896B (en) Unit selection voice synthetic method based on acoustics statistical model
CN104392718B (en) A kind of robust speech recognition methods based on acoustic model array
CN109767778B (en) Bi-L STM and WaveNet fused voice conversion method
CN102568476B (en) Voice conversion method based on self-organizing feature map network cluster and radial basis network
CN105845140A (en) Speaker confirmation method and speaker confirmation device used in short voice condition
Sivaraman et al. Unsupervised speaker adaptation for speaker independent acoustic to articulatory speech inversion
CN104616663A (en) Music separation method of MFCC (Mel Frequency Cepstrum Coefficient)-multi-repetition model in combination with HPSS (Harmonic/Percussive Sound Separation)
CN104123933A (en) Self-adaptive non-parallel training based voice conversion method
CN113506562B (en) End-to-end voice synthesis method and system based on fusion of acoustic features and text emotional features
CN106128450A (en) The bilingual method across language voice conversion and system thereof hidden in a kind of Chinese
CN102592607A (en) Voice converting system and method using blind voice separation
CN110047501B (en) Many-to-many voice conversion method based on beta-VAE
CN102306492A (en) Voice conversion method based on convolutive nonnegative matrix factorization
CN110648684B (en) Bone conduction voice enhancement waveform generation method based on WaveNet
Das et al. Bangladeshi dialect recognition using Mel frequency cepstral coefficient, delta, delta-delta and Gaussian mixture model
CN108766409A (en) A kind of opera synthetic method, device and computer readable storage medium
CN110085254A (en) Multi-to-multi phonetics transfer method based on beta-VAE and i-vector
Bhardwaj et al. Development of robust automatic speech recognition system for children's using kaldi toolkit
CN106847248A (en) Chord recognition methods based on robustness scale contour feature and vector machine
CN109377981A (en) The method and device of phoneme alignment
CN105023570A (en) method and system of transforming speech
CN106782599A (en) The phonetics transfer method of post filtering is exported based on Gaussian process
CN109036376A (en) A kind of the south of Fujian Province language phoneme synthesizing method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant