CN103413548B

CN103413548B - A kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine

Info

Publication number: CN103413548B
Application number: CN201310360234.2A
Authority: CN
Inventors: 刘利娟; 陈凌辉; 凌震华; 戴礼荣
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2013-08-16
Filing date: 2013-08-16
Publication date: 2016-02-03
Anticipated expiration: 2033-08-16
Also published as: CN103413548A

Abstract

A kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine, performing step is: extract spectrum-envelope of voice feature, extract the high-rise spectrum signature of voice, dynamic time warping, GMM model training, joint spectrum envelope characteristic acoustics Subspace partition, Gaussian-Bernoulli? RBM model training or Gaussian-Gaussian? RBM model training, spectral conversion and synthesis converting speech.Invention increases spectrum modeling precision, improve tonequality and the naturalness of converting speech.

Description

A kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine

Technical field

The present invention relates to the method for sound conversion in phonetic synthesis, specifically a kind of sound converting method based on the joint spectrum modeling of limited Boltzmann machine (RestrictedBoltzmannMachine, RBM).

Background technology

The object of sound conversion (also known as voice conversion) voice of a speaker (source speaker) is carried out conversion to make it sound like another speaker (target speaker) to send, keep the semanteme of voice constant simultaneously.At present, based on gauss hybrid models (GaussianMixtureModel, GMM) joint spectrum modeling is (see Y.Stylianou, O.Capp é, andE.Moulines, " Continuousprobabilistictransformforvoiceconversion; " IEEETrans.SpeechAudioProcess., vol.6, no.2, pp.131-142, Mar.1998.) be the main stream approach realizing sound conversion.The cardinal principle of this method be in the training stage according to maximum-likelihood criterion, utilize the joint spectrum characteristic probability distribution of multiple Gaussian distribution to source and target to carry out matching.According to maximal condition probability output criterion, source speaker's frequency spectrum is changed at translate phase, finally the conversion spectrum obtained and conversion fundamental frequency are sent into voice operation demonstrator T.G Grammar voice.Utilize GMM modelling technique to carry out sound conversion and can obtain having the voice of certain intelligibility and similarity, and the introducing of dynamic parameter is (see T.Toda, A.Black, andK.Tokuda, " Voiceconversionbasedonmaximum-likelihoodestimationofspec tralparametertrajectory; " Audio, Speech, andLanguageProcessing, IEEE, Transactionson, vol.15, no.8, pp.2222-2235, nov.2007.) make converting speech in continuity, there has also been good improvement.But there is serious smoothing effect excessively in this model, makes the overall tonequality of converting speech undesirable.

Tradition is the main cause causing the problems referred to above based on the modeling deficiency of sound converting method on frequency spectrum of GMM model, comprises following 2 points specifically:

(1) high-rise spectrum signature is adopted to lost the detailed information of frequency spectrum as training characteristics.In GMM model training, general employing high-rise spectrum signature, such as mel cepstrum characteristic parameter, line spectrum pair characteristic parameter etc., these features are high level performances of spectrum signature, dimension is lower, be beneficial to implementation model training, but leaching process causes the loss of many frequency spectrum detailed information;

(2) the joint spectrum modeling based on GMM model is not enough to relationship modeling between frequency spectrum dimension.GMM model utilizes covariance matrix to characterize between spectrum signature dimension relation between relation and the dimension between source speaker and target speaker characteristic, but owing to there is a large amount of matrix multiple in model training, inversion operation, by the impact of computation complexity and computer calculate precision, the model training of the full covariance matrix of direct employing is often difficult to realize, therefore general covariance matrix to be simplified, only adopt diagonal covariance matrix, so just cause model can not express completely relation between frequency spectrum dimension, thus make the characteristic parameter conversion of source speaker not enough: on the one hand, directly related with the average that conditional Gaussian distributes according to the conversion spectrum that maximal condition probability output criterion obtains:

E_{m, t}^{(y)} = μ_{m}^{(y)} + Σ_{m}^{(yx)} {Σ_{m}^{(xx)}}^{- 1} (x_{t} - μ_{m}^{(x)}) - - - (1)

Due to relation between the source that can characterize and target is little, when making conversion item numerical value is very little, and conversion spectrum is approximately equal to the average of target speaker characteristic distribution the i.e. weighted mean of training sample, on such frequency spectrum, many detailed information are smoothed out; On the other hand, to the continuous a few frame voice belonging to same phoneme, on acoustic feature, difference is very little, the Gauss model parameter of same acoustics subspace can be adopted during conversion to change, according to analysis above, the conversion spectrum obtained all can be approximately equal to target speaker characteristic distribution average under this acoustics subspace, thus makes frequency spectrum also occur smoothing effect in the time domain.In sum, converting speech, at the smoothing effect excessively of frequency-domain and time-domain, causes the voice be converted to sound more vexed, finally affects the tonequality of synthetic speech.

Summary of the invention

The technology of the present invention is dealt with problems: crossing smoothing problasm to improve in existing sound converting method, providing a kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine, improve spectrum modeling precision, improving tonequality and the naturalness of converting speech.

The object of the invention is to be reached by following measure:

One of the technology of the present invention solution: a kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine, performing step is as follows:

Step one: extract spectrum-envelope of voice feature

(1) utilize STRAIGHT to analyze compositor to analyze frame by frame the corpus of source and target speaker respectively, obtain speech pitch sequential value and static spectral envelope characteristic with , wherein with be respectively source and target speaker t frame static spectral envelope characteristic vector, dimension is 513, T ₁and T ₂be respectively source and target eigenvector frame number;

(2) based on static spectral envelope characteristic with first-order dynamic spectrum envelope feature is obtained according to formula (2) (3) with second order dynamic spectrum envelope characteristic is obtained according to formula (4) (5) with

{Δc}_{t} = {0.5 c}_{t + 1} - {0.5 c}_{t - 1}, &ForAll; t &Element; [2, T - 1] - - - (2)

Δc ₁＝Δc ₂，Δc _T＝Δc _T1(3)

Δ^{2} c_{t} = c_{t + 1} - {2 c}_{t} + c_{t - 1}, &ForAll; t &Element; [2, T - 1] - - - (4)

Δ ²c ₁＝Δ ²c ₂，Δ ²c _T＝Δ ²c _T-1(5)

Wherein, Δ represents first order difference, Δ ²represent second order difference, T is the frame number of characteristic sequence, c _trepresent t frame feature vector;

(3) will with be stitched together, finally obtain the spectrum envelope feature of source speaker

X^{SPE} = {[X_{1}^{{SPE}^{T}}, X_{2}^{{SPE}^{T}}, \cdot \cdot \cdot X_{t}^{{SPE}^{T}}, \cdot \cdot \cdot, X_{T_{1}}^{{SPE}^{T}}]}^{T},

Wherein, t frame frequency spectrum envelope feature

X_{t}^{SPE} = {[x_{t}^{{SPE}^{T}}, {Δx}_{t}^{{SPE}^{T}}, Δ^{2} x_{t}^{{SPE}^{T}}]}^{T}, {[\cdot]}^{T}

Represent vector transposition.Will with be stitched together, finally obtain the spectrum envelope feature of target speaker

Y^{SPE} = {[Y_{1}^{{SPE}^{T}}, Y_{2}^{{SPE}^{T}}, \cdot \cdot \cdot Y_{t}^{{SPE}^{T}}, \cdot \cdot \cdot, Y_{T_{2}}^{{SPE}^{T}}]}^{T},

Wherein, t frame frequency spectrum envelope feature

Y_{t}^{SPE} = {[y_{t}^{{SPE}^{T}}, {Δy}_{t}^{{SPE}^{T}}, Δ^{2} y_{t}^{{SPE}^{T}}]}^{T} .

Step 2: extract the high-rise spectrum signature of voice

(1) the static spectral envelope characteristic obtained with on basis, extract high-rise spectrum signature corresponding to every frame voice further, use 40 rank mel cepstrum features here, obtain the high-rise spectrum signature of static state of source and target speaker with

(2) based on with the high-rise spectrum signature of first-order dynamic is obtained according to formula (2) (3) with the dynamic high-rise spectrum signature of second order is obtained according to formula (4) (5) with

(3) will with be stitched together, finally obtain the high-rise spectrum signature of source speaker

X^{MCEP} = {[X_{1}^{{MCEP}^{T}}, X_{2}^{{MCEP}^{T}}, \cdot \cdot \cdot X_{t}^{{MCEP}^{T}}, \cdot \cdot \cdot, X_{T_{1}}^{{MCEP}^{T}}]}^{T},

Wherein, t vertical frame dimension layer spectrum signature

X_{t}^{MCEP} = {[x_{t}^{{MCEP}^{T}}, {Δx}_{t}^{{MCEP}^{T}}, Δ^{2} x_{t}^{{MCEP}^{T}}]}^{T} .

Will with be stitched together, finally obtain the high-rise spectrum signature of target speaker

Y^{MCEP} = {[Y_{1}^{{MCEP}^{T}}, Y_{2}^{{MCEP}^{T}}, \cdot \cdot \cdot Y_{t}^{{MCEP}^{T}}, \cdot \cdot \cdot, Y_{T_{2}}^{{MCEP}^{T}}]}^{T},

Wherein, t vertical frame dimension layer spectrum signature

Y_{t}^{SPE} = {[y_{t}^{{MCEP}^{T}}, {Δy}_{t}^{{MCEP}^{T}}, Δ^{2} y_{t}^{{MCEP}^{T}}]}^{T} .

Step 3: dynamic time warping

(1) X is calculated according to dynamic time warping (DynamicTimeAlign, DTW) algorithm ^mCEPand Y ^mCEPbetween alignment function, and according to this alignment function by X ^mCEPand Y ^mCEPalignment, by the X after aliging ^mCEPand Y ^mCEPsplicing obtains combining high-rise spectrum signature

Z^{MCEP} = {[Z_{1}^{{MCEP}^{T}}, Z_{2}^{{MCEP}^{T}}, \cdot \cdot \cdot Z_{t}^{{MCEP}^{T}}, \cdot \cdot \cdot, Z_{T}^{{MCEP}^{T}}]}^{T},

Wherein, t frame combines high-rise spectrum signature

Z_{t}^{MCEP} = {[X_{t}^{{MCEP}^{T}}, Y_{t}^{{MCEP}^{T}}]}^{T},

T represents the frame length after alignment;

(2) alignment function obtained in basis (1) is by X ^sPEand y ^sPEalignment, by the X after aliging ^sPEand y ^sPEsplicing obtains joint spectrum envelope characteristic

Z_{t}^{SPE} = {[Z_{1}^{{SPE}^{T}}, Z_{2}^{{SPE}^{T}}, Z_{t}^{{SPE}^{T}}, \cdot \cdot \cdot, Z_{T}^{{SPE}^{T}}]}^{T} .

Wherein, t frame joint spectrum envelope characteristic

Z_{t}^{SPE} = {[X_{t}^{{SPE}^{T}}, Y_{t}^{{SPE}^{T}}]}^{T};

Step 4: GMM model training

Utilize the high-rise spectrum signature Z of associating obtained in the previous step ^mCEP, according to maximum-likelihood criterion, utilize EM algorithm to the training of GMM model, obtain model parameter wherein, M is the number of Gaussian mixtures in GMM model, ω _m, μ _m, ∑ _mrepresent the weight of m Gaussian mixtures, mean vector and covariance matrix respectively;

Step 5: joint spectrum envelope characteristic acoustics Subspace partition

After GMM model training completes, utilize the GMM model parameter λ obtained _gMM, according to maximum posteriori criterion, to the high-rise spectrum signature Z of associating ^mCEPcarry out acoustics Subspace partition, obtain by Z ^mCEPin the index sequence m=[m of every acoustics subspace index composition belonging to frame feature ₁, m ₂..., m _t... m _t];

According to index sequence m distich sum of fundamental frequencies spectrum envelope characteristic Z ^sPEcarry out acoustics Subspace partition, will there is the joint spectrum envelope characteristic frame classification of identical subspace index together, as the training characteristics parameter set of Gaussian-BernoulliRBM model under this acoustics subspace;

Step 6: Gaussian-BernoulliRBM model training

Because spectrum envelope characteristic ginseng value is continuous real number, obey continuous print probability distribution to can describe each aobvious node layer of hypothesis more accurately to its distribution, be assumed to be Gaussian distribution here, the implicit node of hypothesis obeys { 0,1} bis-Distribution value simultaneously;

According to the division result in step 5, to each acoustics subspace stand-alone training RBM model, the energy function that the RBM model of the Gaussian-Bernoulli form of employing is corresponding is:

E (v, h) = Σ_{i = 1}^{V} \frac{{(v_{i} - a_{i})}^{2}}{2 σ_{i}^{2}} - Σ_{j = 1}^{H} b_{j} h_{j} - Σ_{i = 1}^{V} Σ_{j = 1}^{H} \frac{v_{i}}{σ_{i}} w_{ij} h_{j} - - - (7)

Wherein, variable v=[v ₁, v ₂..., v _v] ^tcorresponding RBM model shows node layer, and V is the number of aobvious node layer, variable h=[h ₁, h ₂..., h _h] ^tcorresponding RBM model implies node, and H is the number of implicit node.θ={ W, a, b} are model parameter, W={w _ij} _{v × H}, w _ijrepresent aobvious node layer v _iwith implicit node h _jconnection weights, a=[a ₁, a ₂..., a _v] ^twith b=[b ₁, b ₂..., b _h] ^tfor offset parameter, aobvious node layer v _ivariance, in model training, being fixed as a definite value not upgrading, in order to represent convenient, making it be 1 here.The joint probability distribution of aobvious node layer v and implicit node h is defined as:

Wherein, for partition item

According to formula (7) (8), obtain the joint probability distribution of aobvious node layer

Utilize the training data of each acoustics subspace obtained in step 5, according to maximum-likelihood criterion, adopt ContrastiveDivergence (CD) algorithm (see G.Hinton, " Trainingproductsofexpelsbyminimizingcontrastivedivergenc e; ' ' NeuralComputation, vo1.12, no.14; PP.1711-1800,2002.) to model parameter estimate, wherein, { W _m, b _m, a _mit is the model parameter of m Gaussian-BemoulliRBM;

Step 7: spectral conversion

(1) at translate phase, extract the static spectral envelope characteristic of voice to be converted, and obtain its single order and second order dynamic spectrum envelope characteristic according to formula (2) (3), (4) (5), static state and single order, second order dynamic spectrum envelope characteristic are stitched together and obtain spectrum envelope feature to be converted, t frame spectrum envelope feature to be converted is used represent.Static high-rise spectrum signature is extracted on static spectral envelope characteristic basis, and obtain single order and the dynamic high-rise spectrum signature of second order according to formula (2) (3), (4) (5), static and single order, the dynamic high-rise spectrum signature of second order are stitched together and obtain the high-rise spectrum signature of voice to be converted, t vertical frame dimension layer spectrum signature is used represent.Acoustics subspace index m corresponding to voice t frame frequency spectrum signature to be converted is calculated according to maximum posteriori criterion;

m = \arg \max_{m} P (m | X_{t}^{MCEP}, λ_{GMM}) - - - (11)

(2) conversion spectrum envelope characteristic is treated , change according to maximal condition probability output criterion, the conversion spectrum envelope characteristic obtained is:

{\tilde{Y}}_{t}^{SPE} = \arg \max_{Y_{t}^{SPE}} P (Y_{t}^{SPE} | X_{t}^{SPE}) = \arg \max_{Y_{t}^{SPE}} \frac{P (X_{t}^{SPE}, Y_{t}^{SPE})}{P (X_{t}^{SPE})} - - - (12)

Above formula can be reduced to further:

{\tilde{Y}}_{t}^{SPE} = \arg \max_{Y_{t}^{SPE}} P (X_{t}^{SPE}, Y_{t}^{SPE}) - - - (13)

Because (13) formula can not get closed solution, adopt gradient descent search algorithm to obtain conversion spectrum envelope characteristic parameter, the more new formula of gradient descent algorithm is:

Y_{t}^{{SPE}^{(i + 1)}} = Y_{t}^{{SPE}^{(i)}} + α \cdot \frac{&PartialD; \log P (X_{t}^{SPE}, Y_{t}^{SPE})}{{&PartialD; Y}_{t}^{SPE}} |_{Y_{t}^{SPE} = Y_{t}^{{SPE}^{(i)}}} - - - (14)

Wherein, i is iterations, and α is step-length, according to formula (10), about partial derivative;

\frac{&PartialD; \log P (X_{t}^{SPE}, Y_{t}^{SPE})}{&PartialD; Y_{t}^{SPE}} = - (Y_{t}^{SPE} - a_{m}^{(y)}) + Σ_{j = 1}^{H} \frac{\exp (b_{m, j} + v_{t}^{T} w_{m, j})}{(1 + \exp (b_{m, j} + v_{t}^{T} w_{m, j}))} w_{m, j}^{(y)} - - - (15)

Wherein, a _m, b _m=[b _{m, 1}... b _{m, j}..., b _{m, H}] ^t, W _m=[w _{m, 1}..., w _{m, j}..., w _{m, H}] _{v × H}be m Gaussian-BernoulliRBM model parameter, w _{m, j}for matrix W _mjth row. for a _m, w _{m, j}in with target signature continuous item.

Adopt the pattern of RBM model (see Z.Ling, L.Deng, andD.Yu, " ModelingspectralenvelopesusingrestrictedBoltzmannmachine sforstatisticalparametricspeechsynthesis; " inProc.ICASSP, 2013.) as the initial value of searching algorithm;

Due under log-domain in containing function item f (x)=log (1+exp (x)).When | during x| > 4, f ^*x () realizes accurately approaching f (x);

f^{*} (x) = \{\begin{matrix} x & x &GreaterEqual; 0, \\ 0 & x < 0 . \end{matrix} - - - (16)

Utilize this approximate, formula (13) solved and obtains conversion spectrum envelope characteristic and be:

{\tilde{Y}}_{t}^{SPE} = a_{m}^{(y) +} \underset{j : b_{m, j} + v^{T} w_{m, j} > 0}{Σ} w_{m, j}^{(y)} - - - (17)

Step 8: synthesis converting speech

Finally, STRAIGHT compositor is sent into, T.G Grammar voice by obtaining conversion spectrum envelope characteristic sequence in the fundamental frequency sequence be converted to and step 7.

The technology of the present invention solution two: a kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine, performing step is as follows:

Step one: extract spectrum-envelope of voice feature

(1) utilize STRAIGHT to analyze compositor to analyze frame by frame the corpus of source and target speaker respectively, obtain speech pitch sequential value and static spectral envelope characteristic with wherein with be respectively source and target speaker t frame static spectral envelope characteristic vector, dimension is 513, T ₁and T ₂be respectively source and target eigenvector frame number;

Δ c_{t} = 0.5 c_{t + 1} - 0.5 c_{t - 1}, &ForAll; t &Element; [2, T - 1] - - - (2)

Δc ₁＝Δc ₂，Δc _T＝Δc _T-1(3)

Δ^{2} c_{t} = c_{t + 1} - 2 c_{t} + c_{t - 1}, &ForAll; t &Element; [2, T - 1] - - - (4)

Δ ²c ₁＝Δ ²c ₂，Δ ²c _T＝Δ ²c _T-1(5)

X^{SPE} = {[X_{1}^{{SPE}^{T}}, X_{2}^{{SPE}^{T}}, . . . X_{t}^{{SPE}^{T}}, . . ., X_{T_{1}}^{{SPE}^{T}}]}^{T}

, wherein, t frame frequency spectrum envelope feature

X_{t}^{SPE} = {[x_{t}^{{SPE}^{T}}, {Δx}_{t}^{{SPE}^{T}}, Δ^{2} x_{t}^{{SPE}^{T}}]}^{T}, {[\cdot]}^{T}

Y^{SPE} = {[Y_{1}^{{SPE}^{T}}, Y_{2}^{{SPE}^{T}}, . . . Y_{t}^{{SPE}^{T}}, . . ., Y_{T_{2}}^{{SPE}^{T}}]}^{T},

Wherein, t frame frequency spectrum envelope feature

Y_{t}^{SPE} = {[y_{t}^{{SPE}^{T}}, Δ y_{t}^{{SPE}^{T}}, Δ^{2} y_{t}^{{SPE}^{T}}]}^{T};

Step 2: extract the high-rise spectrum signature of voice

(2) based on with , obtain the high-rise spectrum signature of first-order dynamic according to formula (2) (3) with , obtain the dynamic high-rise spectrum signature of second order according to formula (4) (5) with

X^{MCEP} = {[X_{1}^{{MCEP}^{T}}, X_{2}^{MC {EP}^{T}}, \cdot \cdot \cdot X_{t}^{MCE P^{T}}, \cdot \cdot \cdot, X_{T_{1}}^{MCE P^{T}}]}^{T},

Wherein, t vertical frame dimension layer spectrum signature

X_{t}^{MCEP} = {[x_{t}^{MCE P^{T}}, Δ x_{t}^{MCE P^{T}}, Δ^{2} x_{t}^{MCE P^{T}}]}^{T} .

Y^{MCEP} = {[Y_{1}^{{MCEP}^{T}}, Y_{2}^{MC {EP}^{T}}, \cdot \cdot \cdot Y_{t}^{MCE P^{T}}, \cdot \cdot \cdot, Y_{T_{2}}^{MCE P^{T}}]}^{T},

Wherein, t vertical frame dimension layer spectrum signature

Y_{t}^{SPE} = {[y_{t}^{MCE P^{T}}, Δ y_{t}^{MCE P^{T}}, Δ^{2} y_{t}^{MCE P^{T}}]}^{T} .

Step 3: dynamic time warping

Z^{MCEP} = {[Z_{1}^{{MCEP}^{T}}, Z_{2}^{MC {EP}^{T}}, \cdot \cdot \cdot Z_{t}^{MCE P^{T}}, \cdot \cdot \cdot, Z_{T}^{MCE P^{T}}]}^{T}

, wherein t frame combines high-rise spectrum signature

Z_{t}^{MCEP} = {[X_{t}^{MCE P^{T}}, Y_{t}^{MCE P^{T}}]}^{T},

T represents the frame length after alignment;

(2) alignment function obtained in basis (1) is by x ^sPEand Y ^sPEalignment, by the x after aliging ^sPEand Y ^sPEsplicing obtains joint spectrum envelope characteristic wherein t frame joint spectrum envelope characteristic

Z_{t}^{SPE} = {[X_{t}^{{SPE}^{T}}, Y_{t}^{{SPE}^{T}}]}^{T};

Step 4: GMM model training

Utilize the high-rise spectrum signature z of associating obtained in the previous step ^mCEP, according to maximum-likelihood criterion, utilize EM algorithm to the training of GMM model, obtain model parameter , wherein, M is the number of Gaussian mixtures in GMM model, ω _m, μ _m, ∑ _mrepresent the weight of m Gaussian mixtures respectively, mean vector and covariance matrix;

Step 5: joint spectrum envelope characteristic acoustics Subspace partition

m = \arg \max_{m} P (m | Z^{MCEP}, λ_{GMM}) - - - (6)

According to index sequence m distich sum of fundamental frequencies spectrum envelope characteristic Z ^sPEcarry out acoustics Subspace partition, will there is the joint spectrum envelope characteristic frame classification of identical subspace index together, as the training characteristics parameter set of Gaussian-GaussianRBM model under this acoustics subspace;

Step 6: Gaussian-GaussianRBM model training

The aobvious node layer variable Gaussian distributed of same hypothesis in Gaussian-GaussianRBM model, but it is different from Gaussian-BernoulliRBM model, do not consider to contact between aobvious node layer and implicit node, namely set hidden node number as 0, directly show node layer x and target speaker source speaker to show between node layer y and set up full connection, connection matrix is W={w _ij} _{d × D}, D is eigenvector dimension, W _ijexpression source shows node layer x _inode layer y is shown with target _jbetween connection weights, energy function corresponding under this form is:

E (x, y) = Σ_{i = 1}^{D} \frac{x_{i}^{2}}{2 σ_{i}^{2}} + Σ_{j = 1}^{D} \frac{y_{j}^{2}}{2 σ_{j}^{2}} - Σ_{i = 1}^{D} Σ_{j = 1}^{D} \frac{x_{i}}{σ_{i}} w_{ij} \frac{y_{i}}{σ_{i}} - - - (18)

Wherein, θ={ W} is model parameter, the characteristic node variance of source and target with be fixed as 1 here, and then obtain the joint probability distribution of x and y;

Wherein, for partition item

W = [\begin{matrix} 0 & W \\ W^{T} & 0 \end{matrix}] - - - (21)

Step 5 is utilized to obtain each acoustics subspace joint spectrum envelope characteristic parameter set, according to maximum-likelihood criterion, adopt CD algorithm (see G.Hinton, " Trainingproductsofexpertsbyminimizingcontrastivedivergen ce; " NeuralComputation, vol.12, no.14, pp.1711-1800,2002.) train the model parameter obtaining corresponding Gaussian-GaussianRBM wherein, { W _mit is the model parameter of m Gaussian-GaussianRBM;

Step 7: spectral conversion

(1) at translate phase, extract the static spectral envelope characteristic of voice to be converted, and obtain its single order and second order dynamic spectrum envelope characteristic according to formula (2) (3), (4) (5), static state and single order, second order dynamic spectrum envelope characteristic are stitched together and obtain spectrum envelope feature to be converted, t frame spectrum envelope feature to be converted is used represent, static high-rise spectrum signature is extracted on static spectral envelope characteristic basis, and obtain single order and the dynamic high-rise spectrum signature of second order according to formula (2) (3), (4) (5), static and single order, the dynamic high-rise spectrum signature of second order are stitched together and obtain the high-rise spectrum signature of voice to be converted, t vertical frame dimension layer spectrum signature is used represent.Acoustics subspace index m corresponding to voice t frame frequency spectrum signature to be converted is calculated according to maximum posteriori criterion;

m = \arg \max_{m} P (m | X_{t}^{MCEP}, λ_{GMM}) - - - (22)

(2) conversion spectrum envelope characteristic is treated according to maximal condition probability output criterion carry out changing (as formula (12) (13)), according to formula (18) (19) (20), order

\frac{&PartialD; \log P (X_{t}^{SPE}, Y_{t}^{SPE})}{&PartialD; Y_{t}^{SPE}} = 0 - - - (23)

Obtain conversion spectrum envelope characteristic parameter

{\tilde{Y}}_{t}^{SPE} = W_{m}^{T} X_{t}^{SPE} - - - (24)

Step 8: synthesis converting speech

Finally, STRAIGHT compositor is sent obtaining conversion spectrum envelope characteristic sequence in the fundamental frequency sequence be converted to and step 7 into, synthesis converting speech.

Principle of the present invention:

(1) the spectrum envelope feature directly using STRAIGHT compositor to extract replaces high-rise spectrum signature for spectrum signature modeling.STRAIGHT (SpeechTransformationandRepresentationusingAdaptiveInterp olationofWeightedSpectrum, STRAIGHT) be a kind of high-quality voice operation demonstrator, this compositor effectively can realize being separated of voice sound source and wave filter, voice signal is resolved into fundamental frequency and removes spectrum envelope two parts of fundamental frequency impact.Tradition spectrum modeling method, when extracting spectrum signature, can extract high-rise spectrum signature based on original spectrum envelope.And in the modeling method of the present invention's proposition, no longer carry out the extraction of high-rise spectrum signature, but directly use spectrum envelope as Modelling feature to reduce the loss of characteristic extraction procedure intermediate frequency spectrum details;

(2) under GMM framework, limited Boltzmann machine is used to replace single Gaussian distribution in model to the joint spectrum feature modeling of each acoustics subspace.Use the relatively high-rise spectrum signature of spectrum envelope feature can increase the difficulty of spectrum modeling.Because the relatively high-rise spectrum signature of spectrum envelope feature has higher intrinsic dimensionality on the one hand, the relatively high-rise spectrum signature of spectrum envelope feature has correlativity between stronger dimension on the other hand.Limited Boltzmann machine has the advantage of two aspects.First, limited Boltzmann machine not easily crosses training, and relatively single Gaussian distribution better can carry out modeling to correlativity between the dimension of high dimensional feature; Secondly, in limited Boltzmann machine, the feature with maximum output probability is no longer sample average, effectively can improve the mistake smoothing problasm of converting characteristic.Therefore, we use limited Boltzmann machine (RestrictedBoltzmannMachine, RBM) to replace traditional single Gaussian distribution for describing the output probability of spectrum signature in acoustics subspace here.

The present invention's advantage is compared with prior art: the present invention propose based on the joint spectrum modeling of RBM model to source and target speaker realize sound conversion method produce good effect mainly contain 3 points:

(1) the original spectrum envelope feature adopting STRAIGHT compositor to extract carries out modeling, remains more multiple spectra information, avoids minutia and loses;

(2) adopt limited Boltzmann machine to replace single Gaussian distribution to represent acoustics subspace output probability, can realize the study to correlativity between higher-dimension spectrum signature dimension, the description distributed to the union feature of source and target voice is more accurate;

(3) conversion spectrum details is abundanter, crosses smoothing effect and improves.When adopting the RBM of Gaussian-Bernoulli form, when the number of implicit node is set to H, be equivalent to this acoustics subspace to divide into 2 ^hthe state of kind describes respectively, characterizes more detailed to the transitional information between source and target frequency spectrum.During conversion, after determining acoustics subspace belonging to frequency spectrum to be converted, the transformational relation of corresponding state is selected to change further by implicit node.Continuous a few frame voice of same like this phoneme can be changed more accurately according to the difference of its status, thus well avoid time domain in conventional model and cross smoothing effect; Equally, when adopting Gaussian-Gaussian form RBM, although the same with Traditional GM M model, under maximum a posteriori probability exports criterion, obtain conversion spectrum is still the conditional mean of probability distribution, but RBM model can carry out effective modeling to correlativity between the dimension of spectrum envelope, the matrix ∑ of relation between the sign source and target speaker spectrum signature dimension of acquisition ^(xy)can play a role in transfer process, thus in transfer process, in formula (1) frequency spectrum parameter that can be different according to source speaker, basis on revise further, obtain conversion spectrum.Even if like this for the continuous speech frame under same phoneme, the conversion spectrum of acquisition also has difference, thus the converting speech obtained is made to change more natural in time domain.Meanwhile, because the conversion spectrum obtained under RBM model is all no longer the weighted mean of training data, frequency domain is crossed smoothing effect and be have also been obtained effective improvement.

Accompanying drawing explanation

Fig. 1 is realization flow figure of the present invention.

Embodiment

In the present invention, according to the thought of the joint spectrum modeling based on limited Boltzmann machine proposed, the idiographic flow of accomplished sound conversion as shown in Figure 1.Be different from and based in GMM model conversion method, each single Gauss in acoustics subspace be described, in the present invention, adopt limited Boltzmann machine (RBM) model to be described.At RBM training module, the RBM of different structure can be adopted according to RBM model concrete form difference, such as Gaussian-BernoulliRBM, Gaussian-GaussianRBM etc.

Limited Boltzmann machine is (see R.Salakhutdinov, " Learningdeepgenerativemodels; " Ph.D.dissertation, UniversityofToronto, 2009.) be a kind ofly have double-deck non-directed graph model for what describe relation of interdependence between one group of stochastic variable, it is by one group of visible stochastic variable v=[v ₁, v ₂..., v _v] ^tnode and one group imply stochastic variable h=[h ₁, h ₂..., h _h] ^tnode forms, V and H is respectively the number of aobvious node layer and implicit node.Show and be interconnected between node layer and implicit node.We are by the joint spectrum envelope characteristic vector of source and target as aobvious layer stochastic variable, even utilize RBM model to learn the transformational relation between source and target speaker spectrum envelope feature.

Suppose according to the difference of aobvious node layer variable and implicit node variable probability distribution, limited Boltzmann machine can have multi-form.Describe in detail based on the associating modeling of RBM model and sound converting method step under employing Gaussian-Bernoulli and Gaussian-Gaussian form respectively below.

1. one of the technology of the present invention solution: based on Gaussian-BernoulliRBM model joint spectrum modeling sound converting method

Step one: extract speech spectral characteristics

Utilize STRAIGHT to analyze compositor to analyze frame by frame the corpus of source and target speaker respectively, the characteristic sequence x obtaining speech pitch sequential value and be made up of 513 dimension spectrum envelope eigenvectors ^sPEand y ^sPE.On the basis of static spectral envelope characteristic, obtain behavioral characteristics (adopting single order and second order difference here) further.First order difference feature calculates according to formula (2) (3);

Δc _t＝1.5c _t+1-0.5c _t-1 (2)

Δc ₁＝Δc ₂，Δc _T＝Δc _T-1(3)

Second order difference feature calculates according to formula (4) (5)

Δ ²c _t＝c _t+1-2c _t+c _t-1 (4)

Δ ²c ₁＝Δ ²c ₂，Δ ²c _T＝Δ ²c _T-1(5)

Finally obtain the spectrum envelope characteristic sequence X of source and target speaker ^sPEand Y ^sPE, wherein, the concrete form of t frame feature vector is

X_{t}^{SPE} = {[x_{t}^{{SPE}^{T}}, Δ x_{t}^{{SPE}^{T}}, Δ^{2} x_{t}^{{SPE}^{T}}]}^{T}

With

Y_{t}^{SPE} = {[y_{t}^{{SPE}^{T}}, Δ y_{t}^{{SPE}^{T}}, Δ^{2} y_{t}^{{SPE}^{T}}]}^{T}, {[\cdot]}^{T}

Represent vector transposition;

On the static spectrum envelope characteristic sequence basis obtained, the high-rise spectrum signature that further extraction every frame voice are corresponding, here 40 Jan Vermeer cepstrum features are used, and their single order and second order difference is calculated respectively according to formula (2) (3), (4) (5), obtain the high-rise spectrum signature sequence X of source and target speaker ^mCEPand Y ^mCEP.

To the high-rise spectrum signature sequence X of source and target speaker ^mCEPand Y ^mCEP, use dynamic time warping (DynamicTimeAlign, DTW) algorithm to align, and generate the high-rise spectrum signature sequence Z of associating under high-rise spectrum signature ^mCEP, wherein t is the totalframes after alignment.

Spectrum envelope characteristic sequence X ^sPEand Y ^sPEEequally according to the alignment function alignment that high-rise spectrum signature sequence obtains, and generate joint spectrum envelope characteristic sequence Z ^sPE.

Step 2: GMM model training

Utilize the high-rise spectrum signature of associating obtained in the previous step, according to maximum-likelihood criterion, utilize EM algorithm to the training of GMM model, obtain model parameter , wherein, M is the number of Gaussian mixtures in GMM model, ω _m, μ _m, ∑ _mrepresent the weight of m Gaussian mixtures, mean vector and covariance matrix respectively, adopt diagonal covariance matrix here.

Step 3: joint spectrum envelope characteristic acoustics Subspace partition

After GMM model training completes, utilize the GMM model parameter obtained, according to maximum posteriori criterion, acoustics Subspace partition is carried out to high-rise spectrum signature, obtain the acoustics subspace index sequence m=[m belonging to training utterance frame ₁, m ₂..., m _t... m _t];

m = \underset{m}{\arg \max P} (m | Z^{MCEP}, λ_{GMM}) - - - (6)

According to index sequence m to total joint spectrum envelope characteristic Z ^sPEcarry out acoustics Subspace partition, will there is the spectrum envelope tagsort of identical subspace index together, as the training dataset of Gaussian-BernoulliRBM model under this acoustics subspace.

Step 4: Gaussian-BernoulliRBM model training

Because spectrum envelope characteristic ginseng value is continuous real number, obey continuous print probability distribution to can describe each aobvious node layer of hypothesis more accurately to its distribution, be assumed to be Gaussian distribution here, the implicit node of hypothesis obeys { 0,1} bis-Distribution value simultaneously.

According to the division result in step 3, to each acoustics subspace stand-alone training RBM model.The energy function that the RBM model of the Gaussian-Bernoulli form adopted in this programme is corresponding is:

E (v, h) = Σ_{i = 1}^{V} \frac{{(v_{i} - a_{i})}^{2}}{2 σ_{i}^{2}} - Σ_{j = 1}^{H} b_{j} h_{j} - Σ_{i = 1}^{V} Σ_{j = 1}^{H} \frac{v_{i}}{σ_{i}} w_{ij} h_{j} - - - (7)

Wherein, (θ) be partition item

According to formula (7) (8), the joint probability distribution of aobvious node layer can be obtained

(10)

Utilize the training data of each acoustics subspace obtained in step 3, according to maximum-likelihood criterion, adopt ContrastiveDivergence (CD) algorithm (see G.Hinton, " Trainingproductsofexpertsbyminimizingcontrastivedivergen ce; " NeuralComputation, vol.12, no.14, pp.1711-1800,2002.) to model parameter estimate.Wherein, { W _m, b _m, a _mit is the model parameter of m Gaussian-BernoulliRBM;

Step 5: spectral conversion

m = \arg \max_{m} P (m | X_{t}^{MCEP}, λ_{GMM}) - - - (11)

(2) conversion spectrum envelope characteristic is treated change according to maximal condition probability output criterion, the conversion spectrum envelope characteristic obtained is:

{\tilde{Y}}_{t}^{SPE} = \arg \max_{Y_{t}^{SPE}} P (Y_{t}^{SPE} | X_{t}^{SPE}) = \arg \max_{Y_{t}^{SPE}} \frac{P (X_{t}^{SPE}, Y_{t}^{SPE})}{P (X_{t}^{SPE})} - - - (12)

Due to with irrelevant, can further above formula be reduced to:

{\tilde{Y}}_{t}^{SPE} = \arg \max_{Y_{t}^{SPE}} P (X_{t}^{SPE}, Y_{t}^{SPE}) - - - (13)

Because (13) formula can not get closed solution, gradient descent search algorithm is adopted to obtain conversion spectrum parameter:

Y_{t}^{{SPE}^{(i + 1)}} = Y_{t}^{{SPE}^{(i)}} + α \cdot \frac{&PartialD; \log P (X_{t}^{SPE}, Y_{t}^{SPE})}{{&PartialD; Y}_{t}^{SPE}} |_{Y_{t}^{SPE} = Y_{t}^{{SPE}^{(i)}}} - - - (14)

\frac{&PartialD; \log P (X_{t}^{SPE}, Y_{t}^{SPE})}{{&PartialD; Y}_{t}^{SPE}} = - (Y_{t}^{SPE} - a_{m}^{(y)}) + Σ_{j = 1}^{H} \frac{\exp (b_{m, j} + v_{t}^{T} w_{m, j})}{(1 + \exp (b_{m, j} + v_{t}^{T} w_{m, j}))} w_{m, j}^{(y)} - - - (15)

Wherein, a _m, b _m=[b _{m, 1}..., b _{m, j}..., b _{m, H}] ^t, W _m=[w _{m, 1}..., w _{m, j}..., w _{m, H}] _{v × H}be m Gaussian-BernoulliRBM model parameter, w _{m, j}for matrix W _mjth row. for a _m, w _{m, j}in with target signature continuous item.

Because gradient descent algorithm is very sensitive to initial value, here adopt the pattern of RBM model (see Z.Ling, L.Deng, andD.Yu, " ModelingspectralenvelopesusingrestrictedBoltzmannmachine sforstatisticalparametricspeechsynthesis; " inProc.ICASSP, 2013.) as the initial value of searching algorithm.

In actual converted, because gradient descent search algorithm is very consuming time, real-time conversion requirements can not be met.Transfer process is done and has been similar to further.Make discovery from observation, under log-domain in containing function item f (x)=log (1+exp (x)).Due to when | during x| > 4, f ^*x () can realize accurately approaching it;

f^{*} (x) = \{\begin{matrix} x & x &GreaterEqual; 0, \\ 0 & x < 0 . \end{matrix} - - - (16)

Further, the excitation item s=b of node is implied in statistics display experiment _{m, j}+ v ^tw _{m, j}numerical value or very large, or very little (such as, in training set, P (| s| > 4)=0.94), make this being similar in this model be rational.By this approximate, we can obtain the closed solution of conversion spectrum envelope characteristic, thus simplify transfer process.

{\tilde{Y}}_{t}^{SPE} = a_{m}^{(y)} + \underset{j : b_{m, j} + v^{T} w_{m, j} > 0}{Σ} w_{m, j}^{(y)} - - - (17)

Step 6: synthesis converting speech

Finally, STRAIGHT compositor is sent into, T.G Grammar voice by obtaining conversion spectrum envelope characteristic sequence in the fundamental frequency sequence be converted to (adopting Gauss model transformation approach here) and the 5th step.

2. the technology of the present invention solution two: based on Gaussian-GaussianRBM model joint spectrum modeling sound converting method

Step one, two, three with the step one in scheme one, two, three identical, no longer repeat.

Step 4: Gaussian-GaussianRBM model training;

The aobvious node layer variable Gaussian distributed of same hypothesis in Gaussian-GaussianRBM model.But it is different from Gaussian-BernoulliRBM model, here do not consider to contact between aobvious node layer and implicit node, namely set hidden node number as 0, directly show node layer x and target speaker source speaker and show between node layer y and set up full connection, connection matrix is W={w _ij} _{d × D}, D is eigenvector dimension, w _ijexpression source shows node layer x _inode layer y is shown with target _jbetween connection weights.Energy function corresponding under this form is:

E (x, y) = Σ_{i = 1}^{D} \frac{x_{i}^{2}}{{2 σ}_{i}^{2}} + Σ_{j = 1}^{D} \frac{y_{j}^{2}}{{2 σ}_{j}^{2}} - Σ_{i = 1}^{D} Σ_{j = 1}^{D} \frac{x_{i}}{σ_{i}} w_{ij} \frac{y_{j}}{σ_{j}} - - - (18)

Wherein, θ={ W} is model parameter.Bias term is set to 0, does not upgrade.In order to represent convenient, the characteristic node variance of source and target with be fixed as 1 here.And then obtain the joint probability distribution of x and y;

Wherein, for partition item

W = [\begin{matrix} 0 & W \\ W^{T} & 0 \end{matrix}] - - - (21)

Equally, step 3 is utilized to obtain each acoustics subspace joint spectrum envelope characteristic parameter set, according to maximum-likelihood criterion, employing CD algorithm (see G.Hinton, " Trainingproductsofexpertsbyminimizingcontrastivedivergen ce, " NeuralComputation, vol.12, no.14, pp.1711-1800,2002.) train the model parameter obtaining corresponding Gaussian-GaussianRBM wherein, { W _mit is the model parameter of m Gaussian-GaussianRBM;

Step 5: spectral conversion

m = \arg \max_{m} P (m | X_{t}^{MCEP}, λ_{GMM}) - - - (22)

(2) conversion spectrum envelope characteristic is treated according to maximal condition probability output criterion carry out changing (as formula (12) (13)).According to formula (18) (19) (20), order

\frac{&PartialD; \log P (X_{t}^{SPE}, Y_{t}^{SPE})}{{&PartialD; Y}_{t}^{SPE}} = 0 - - - (23)

Obtain conversion spectrum envelope characteristic parameter

{\tilde{Y}}_{t}^{SPE} = W_{m}^{T} X_{t}^{SPE} - - - (24)

Step 6: synthesis converting speech

Finally, STRAIGHT compositor is sent into, synthesis converting speech by obtaining conversion spectrum envelope characteristic sequence in the fundamental frequency sequence be converted to (adopting Gauss model transformation approach here) and the 5th step.

Claims

1., based on a sound converting method for the joint spectrum modeling of limited Boltzmann machine, it is characterized in that performing step is as follows:

Step one: extract spectrum-envelope of voice feature

(1) utilize STRAIGHT to analyze compositor to analyze frame by frame the corpus of source and target speaker respectively, obtain speech pitch sequential value and static spectral envelope characteristic with , wherein, with be respectively source and target speaker t frame static spectral envelope characteristic vector, dimension is 513, T ₁and T ₂be respectively source and target eigenvector frame number;

(2) based on static spectral envelope characteristic with , obtain first-order dynamic spectrum envelope feature according to formula (2) (3) with , obtain second order dynamic spectrum envelope characteristic according to formula (4) (5) with ;

Δc ₁＝Δc ₂,Δc _T＝Δc _T-1(3)

Δ ²c ₁＝Δ ²c ₂,Δ ²c _T＝Δ ²c _T-1(5)

(3) will with be stitched together, finally obtain the spectrum envelope feature of source speaker , wherein, t frame frequency spectrum envelope feature represent vector transposition, will with , be stitched together, finally obtain the spectrum envelope feature of target speaker , wherein, t frame frequency spectrum envelope feature ;

Step 2: extract the high-rise spectrum signature of voice

(1) the static spectral envelope characteristic obtained with on basis, extract high-rise spectrum signature corresponding to every frame voice further, use 40 rank mel cepstrum features here, obtain the high-rise spectrum signature of static state of source and target speaker with ;

(2) based on with , obtain the high-rise spectrum signature of first-order dynamic according to formula (2) (3) with , obtain the dynamic high-rise spectrum signature of second order according to formula (4) (5) with ;

(3) will with be stitched together, finally obtain the high-rise spectrum signature of source speaker , wherein, t vertical frame dimension layer spectrum signature , will with be stitched together, finally obtain the high-rise spectrum signature of target speaker , wherein, t vertical frame dimension layer spectrum signature ;

Step 3: dynamic time warping

(1) X is calculated according to DTW dynamic time warping (DynamicTimeAlign, DTW) algorithm ^mCEPand Y ^mCEPbetween alignment function, and according to this alignment function by X ^mCEPand Y ^mCEPalignment, by the X after aliging ^mCEPand Y ^mCEPsplicing obtains combining high-rise spectrum signature , wherein, t frame combines high-rise spectrum signature , T represents the frame length after alignment;

(2) alignment function obtained in basis (1) is by X ^sPEand Y ^sPEalignment, by the X after aliging ^sPEand Y ^sPEsplicing obtains joint spectrum envelope characteristic , wherein, t frame joint spectrum envelope characteristic ;

Step 4: GMM model training

Utilize the high-rise spectrum signature Z of associating obtained in the previous step ^mCEP, according to maximum-likelihood criterion, utilize EM algorithm to the training of GMM model, obtain model parameter , wherein, M is the number of Gaussian mixtures in GMM model, ω _m, μ _m, Σ _mrepresent the weight of m Gaussian mixtures, mean vector and covariance matrix respectively;

Step 5: joint spectrum envelope characteristic acoustics Subspace partition

According to index sequence m distich sum of fundamental frequencies spectrum envelope characteristic Z ^sPEcarry out acoustics Subspace partition, will there is the joint spectrum envelope characteristic frame classification of identical subspace index together, as the training characteristics parameter set of Gaussian – Bernoulli limited Boltzmann machine RBM model under this acoustics subspace;

Step 6: Gaussian-Bernoulli limited Boltzmann machine RBM model training

According to the division result in step 5, to each acoustics subspace stand-alone training limited Boltzmann machine RBM model, the energy function that the RBM model of the Gaussian-Bernoulli form of employing is corresponding is:

Wherein, variable v=[v ₁, v ₂..., v _v] ^tcorresponding RBM model shows node layer, and V is the number of aobvious node layer, variable h=[h ₁, h ₂..., h _h] ^Τcorresponding limited Boltzmann machine RBM model implies node, and H is the number of implicit node; θ={ W, a, b} are model parameter, W={w _ij} _{v × H}, w _ijrepresent aobvious node layer v _iwith implicit node h _jconnection weights, a=[a ₁, a ₂..., a _v] ^Τwith b=[b ₁, b ₂..., b _h] ^Τfor offset parameter; aobvious node layer v _ivariance, in model training, being fixed as a definite value not upgrading, in order to represent convenient, making it be 1 here; The joint probability distribution of aobvious node layer v and implicit node h is defined as:

Wherein, for partition item

According to formula (7) (8), obtain the joint probability distribution of aobvious node layer;

Utilize the training data of each acoustics subspace obtained in step 5, according to maximum-likelihood criterion, adopt ContrastiveDivergence (CD) algorithm to model parameter estimate, wherein, { W _m, b _m, a _mit is the model parameter of m Gaussian-BernoulliRBM;

Step 7: spectral conversion

(1) at translate phase, extract the static spectral envelope characteristic of voice to be converted, and obtain its single order and second order dynamic spectrum envelope characteristic according to (2) (3), (4) (5), static state and single order, second order dynamic spectrum envelope characteristic are stitched together and obtain spectrum envelope feature to be converted, t frame spectrum envelope feature to be converted is used represent, static high-rise spectrum signature is extracted on static spectral envelope characteristic basis, and obtain single order and the dynamic high-rise spectrum signature of second order according to (2) (3), (4) (5), static and single order, the dynamic high-rise spectrum signature of second order are stitched together and obtain the high-rise spectrum signature of voice to be converted, t vertical frame dimension layer spectrum signature is used represent; Acoustics subspace index m corresponding to voice t frame frequency spectrum signature to be converted is calculated according to maximum posteriori criterion;

Above formula can be reduced to further:

Wherein, a _m, b _m=[b _{m, 1}..., b _m,j..., b _m,H] ^Τ, W _m=[w _{m, 1}..., w _m,j..., w _m,H] _{v × H}be m Gaussian-Bernoulli limited Boltzmann machine RBM model parameter, w _m,jfor matrix W _mjth row; for a _m, w _m,jin with target signature continuous item;

Adopt the initial value of pattern as gradient descent search algorithm of RBM model;

Due under log-domain in containing function item f (x)=log (1+exp (x)); When | during x| > 4, f ^*x () realizes accurately approaching f (x);

Step 8: synthesis converting speech

2., based on a sound converting method for the joint spectrum modeling of limited Boltzmann machine, it is characterized in that performing step is as follows:

Step one: extract spectrum-envelope of voice feature

Δc ₁＝Δc ₂,Δc _T＝Δc _T-1(3)

Δ ²c ₁＝Δ ²c ₂,Δ ²c _T＝Δ ²c _T-1(5)

(3) will with be stitched together, finally obtain the spectrum envelope feature of source speaker , wherein, t frame frequency spectrum envelope feature , [] ^trepresent vector transposition, will with be stitched together, finally obtain the spectrum envelope feature of target speaker , wherein, t frame frequency spectrum envelope feature ;

Step 2: extract the high-rise spectrum signature of voice

Step 3: dynamic time warping

(1) X is calculated according to DTW dynamic time warping (DynamicTimeAlign, DTW) algorithm ^mCEPand Y ^mCEPbetween alignment function, and according to this alignment function by X ^mCEPand Y ^mCEPalignment, by the X after aliging ^mCEPand Y ^mCEPsplicing obtains combining high-rise spectrum signature , wherein t frame combines high-rise spectrum signature , T represents the frame length after alignment;

(2) alignment function obtained in basis (1) is by X ^sPEand Y ^sPEalignment, by the X after aliging ^sPEand Y ^sPEsplicing obtains joint spectrum envelope characteristic , wherein t frame joint spectrum envelope characteristic ;

Step 4: GMM model training

Step 5: joint spectrum envelope characteristic acoustics Subspace partition

Step 6: Gaussian-GaussianRBM model training

Wherein, θ={ W} model parameter _,the characteristic node x of source and target _iand y _jcorresponding variance with be fixed as 1 here, and then obtain the joint probability distribution of x and y;

Wherein, for partition item

Utilize step 5 to obtain each acoustics subspace joint spectrum envelope characteristic parameter set, according to maximum-likelihood criterion, adopt ContrastiveDivergenceCD Algorithm for Training to obtain the model parameter of corresponding Gaussian-GaussianRBM ; Wherein, { W _mit is the model parameter of m Gaussian-GaussianRBM;

Step 7: spectral conversion

(1) at translate phase, extract the static spectral envelope characteristic of voice to be converted, and obtain its single order and second order dynamic spectrum envelope characteristic according to formula (2) (3), (4) (5), static state and single order, second order dynamic spectrum envelope characteristic are stitched together and obtain spectrum envelope feature to be converted, t frame spectrum envelope feature to be converted is used represent; Static high-rise spectrum signature is extracted on static spectral envelope characteristic basis, and obtain single order and the dynamic high-rise spectrum signature of second order according to formula (2) (3), (4) (5), static and single order, the dynamic high-rise spectrum signature of second order are stitched together and obtain the high-rise spectrum signature of voice to be converted, t vertical frame dimension layer spectrum signature is used represent, calculate acoustics subspace index m corresponding to voice t frame frequency spectrum signature to be converted according to maximum posteriori criterion;

(2) conversion spectrum envelope characteristic is treated according to maximal condition probability output criterion change, according to formula (18) (19) (20), order

Obtain conversion spectrum envelope characteristic parameter

Step 8: synthesis converting speech