CN103035236B

CN103035236B - High-quality voice conversion method based on modeling of signal timing characteristics

Info

Publication number: CN103035236B
Application number: CN201210490464.6A
Authority: CN
Inventors: 徐宁; 鲍静益; 汤一彬
Original assignee: Changzhou Campus of Hohai University
Current assignee: SHENZHEN TENGRUIFENG TECHNOLOGY CO.,LTD.
Priority date: 2012-11-27
Filing date: 2012-11-27
Publication date: 2014-12-17
Anticipated expiration: 2032-11-27
Also published as: CN103035236A

Abstract

The invention discloses a high-quality voice conversion method based on modeling of signal timing characteristics. The high-quality voice conversion method based on the modeling of the signal timing characteristics comprises the following steps: aiming at parallel data of a source and a target, considering modeling and tracing the timing characteristics of the source and the target, utilizing the hybrid Kalman filter, estimating structural parameters of a model under the criteria of expectation maximization, utilizing characteristic parameter set of mapping voice of the model and finally achieving a high-quality voice conversion effect. According to the high-quality voice conversion method based on the modeling of the signal timing characteristics, strong correlation between the voice signal parameters is fully utilized, a novel hybrid Kalman filter is constructed by means of a physical process that simulation parameters change with time and is used for a parameter mapping process of the voice conversion, a set of special conversion algorithm which associates the parameter of the Kalman filter with physical properties of a voice signal is designed, and therefore a conversion of personality traits of a speaker can be achieved.

Description

Based on the high-quality speech conversion method of signal sequence feature modeling

Technical field

The present invention relates to Voice Conversion Techniques, that one is in conjunction with speech recognition and speech synthesis technique, realize the sound of a conversion speaker, make it sound like the technology of certain voice sound of specifically speaking in addition, particularly relate to a kind of high-quality speech conversion method based on signal sequence feature modeling.

Background technology

Voice Conversion Techniques is the emerging in recent years research branch of field of voice signal, cover the content in the field such as speech recognition and phonetic synthesis, intend when keeping semantic content constant, by changing the speech personal characteristics of a speaker dependent (being called as source speaker), him is made to be thought another speaker dependent (being called as target speaker) word by hearer by (or she) word.The main task of speech conversion comprises extraction and represents the characteristic parameter of speaker's individual character and carry out mathematic(al) manipulation, then the parameter reconstruct after conversion is become voice.In this process, the acoustical quality of reconstructed voice should be kept, whether accurately take into account the personal characteristics after conversion again.

Through development for many years, speech conversion field has emerged the algorithm of some highly effectives, is wherein that the statistics conversion method of representative has become at present in order to this field the recognized standard with gauss hybrid models.But this kind of algorithm also exists some drawback, such as: artificial tentation data meets independent identically distributed condition, and unsteady state operation mode is carried out with order frame by frame in the process of Feature Conversion.Although the way that have ignored interframe dependence on parameter this greatly simplify problem, reduce and solve difficulty, but but run counter to the fact that voice signal exists strong correlation, the ability causing model to describe signal time-varying characteristics declines, and finally affects the effect of speech conversion.

For the problems referred to above, there are some counte-rplan at present.Such as, the thought of " Differential Characteristics parameter " is more typically utilized.So-called " Differential Characteristics parameter " refers to: when carrying out gauss hybrid models modeling, original union feature vector extension is become to comprise the eigenvector of first order difference.So, the roll-off characteristic of interframe parameter has just been preferentially absorbed into new characteristic parameter, thus compensate for the defect of this model shortage to Dynamic Characteristic Modeling to a certain extent.On the other hand, in order to the defect of the independence assumption of thoroughly avoiding gauss hybrid models intrinsic, some new speech conversion schemes start to adopt hidden Markov model as basic mapping model.The principal feature of this model is can the temporal aspect of accurately control signal, and has greatest contacting with the generation of voice signal and transformation on physical layer.

Summary of the invention

Goal of the invention: in order to overcome the deficiencies in the prior art, the invention provides a kind of high-quality speech conversion method based on signal sequence feature modeling, by hybrid Kalman filter, gives model and utilizes raw data to upgrade the algorithm of inherent parameters; And under the condition of panel data, the semantic information comprised is breathed out speaker's individual information give the hidden layer of model and aobvious layer respectively in voice signal, be under a kind of condition keeping semantic information inconvenience, the method for flexible conversion speaker individual information.

Technical scheme: for achieving the above object, the technical solution used in the present invention is:

Based on the high-quality speech conversion method of signal sequence feature modeling, for the panel data of source and target, consider to carry out modeling and tracking to its temporal aspect, utilize hybrid Kalman filter, and under expectation maximization criterion estimation model structural parameters, finally utilize the set of characteristic parameters of these Model Mapping voice, realize high-quality speech conversion effects; Specifically comprise the steps:

(1) speech analysis model is adopted to analyze primary speech signal;

(2) from the parameter that analysis obtains, the set of characteristic parameters relevant to phoneme is extracted;

(3) operation is normalized to the set of characteristic parameters of source and target, realizes the alignment of parameter sets;

(4) parameter sets of alignment is used separately as the input and output of hybrid Kalman filter, the training of implementation model parameter and estimation;

(5) regard the Kalman filter trained as general Functional Mapping function, feature based parameter mapping method maps arbitrary speech signal parameter;

(6) inverse transformation operation is carried out to the characteristic parameter after conversion, namely carry out parameter interpolate and phase compensation, finally synthesize high-quality voice with voice synthetic model;

In above-mentioned steps, step (1) ~ (4) are training step, and step (5) ~ (6) are switch process; The structure of described hybrid Kalman filter is a newly-increased hidden layer in the Kalman filter structure of classics, and described hidden layer is for describing the fade effect between clock signal state.

Described hybrid Kalman filter, because hidden layer can make the observation variable in each moment all likely be in different states, the variable observed each moment, by computing mode probability, observation probability and posterior probability corresponding with it, obtains the classificating knowledge to not observation variable data bottom attribute in the same time; Utilize the classificating knowledge obtained, design variable transition rule, in order to describe the time dependent feature of signal; Utilize Bayesian inference, the estimation of model parameter is existed uncertain, namely remain the posterior probability of often kind of state, thus define so-called degree of mixing.This hybrid Thalmann filter overcomes the divergence expression difficulty that classic card Thalmann filter occurs when following the tracks of fast change clock signal, makes result more accurate.

The course of work of the speech analysis model in described step (1) comprises the steps:

(a1) voice signal is fixed to the framing of duration, with cross-correlation method, fundamental frequency is estimated;

(a2) a maximum voiced sound frequency component is set in Voiced signal part, is used for dividing the main energy area of harmonic components and random element; Recycling least-squares algorithm is estimated to obtain discrete harmonic amplitude value and phase value;

(a3) in the voiceless sound stage, utilize classical linear prediction analysis method to analyze it, thus obtain linear predictor coefficient.

Corresponding with the speech analysis model in step (1), the course of work of the phonetic synthesis model in step (6) comprises the steps:

(b1) the discrete harmonic amplitude of voiced portions signal and phase value are used as range value and the phase value of sinusoidal signal, and superpose; Interpositioning and Phase Compensation is used to make reconstruction signal not produce distortion in time domain waveform;

(b2) white noise signal of unvoiced part signal is passed through an all-pole filter, approximate reconstruction signal can be obtained;

(b3) voiced portions signal and unvoiced part signal are superposed, namely obtain the voice signal reconstructed.

Described step (2) comprises the line spectral frequencies coefficient course of work estimating from discrete harmonic amplitude value and be applicable to speech conversion task, and this course of work comprises the steps:

(b1) discrete harmonic amplitude is asked for square;

(b2) according to the one-to-one relationship of power spectral density function and autocorrelation function, obtain the Top's Ritz matrix equation about linear predictor coefficient, solve this equation;

(b3) linear predictor coefficient is converted to score spectral frequency coefficient.

What realize the alignment of parameter sets in described step (3) to its criterion is: for the characteristic parameter sequence of two Length discrepancy, utilize the thought of dynamic programming to be mapped to nonlinear for the time shaft of wherein one on the time shaft of another one, thus realize the matching relationship of a correspondence; In the process of the alignment of existing parameter sets, by the cumulative distortion function that iteration optimization one is default, and restricted searching area, final acquisition time match function.

Characteristic parameter mapping method in described step (5) comprises the steps:

(c1) feature comprising identical semantic information and different speaker's personal characteristics information of panel data is made full use of, represent on the basis of semantic information in hypothesis hidden layer state variable, ensure that the hidden layer configuration of source and target hybrid Kalman filter is separately in shared state; Then under expectation maximization criterion, estimate the statistical property of observation layer variable;

(c2) on the basis of step (c1), the otherness of reference source and object module structure, the one this otherness being considered as the different individual character of speaker embodies;

(c3) describe the ability of time varying signal in conjunction with Kalman filter, this otherness is mapped to clarification of objective space from the feature space in source, thus complete the transfer process of parameter.

Beneficial effect: the high-quality speech conversion method based on signal sequence feature provided by the invention, take full advantage of the strong correlation between speech signal parameter, by the time dependent physical process of analog parameter, construct a kind of novel hybrid Kalman filter, and use it for the Parameter Mapping process of speech conversion, devise a set of transfer algorithm that is special, that Kalman filter parameter be associated with voice signal physics tropism, realize the conversion of speaker's personal characteristics.

Accompanying drawing explanation

Fig. 1 is hybrid Kalman filter structure;

Fig. 2 is the systematic training block diagram that the present invention relates to;

Fig. 3 is the system conversion block diagram that the present invention relates to.

Embodiment

Below in conjunction with accompanying drawing, the present invention is further described.

(1) speech analysis model is adopted to analyze primary speech signal;

This case is for gauss hybrid models Problems existing in speech conversion, a kind of new solution is proposed, this case has two key points: one is devise a kind of hybrid-type Kalman filter, and gives model and utilize raw data to upgrade the algorithm of inherent parameters; Two is under the condition of panel data, gives the hidden layer of model and aobvious layer respectively by the semantic information comprised in voice signal and speaker's individual information, under proposing a kind of condition keeping semantic information constant, and the method for flexible conversion speaker individual information.

Hybrid Kalman filter structure as shown in Figure 1.Wherein, the circle adding shade represents observational variable, the square expression hidden variable of white.As we can clearly see from the figure: hybrid Kalman filter has two hidden layers, and wherein one deck is (with variable Z={z ₁, z ₂... z _t... represent) be used for describing state variable the classification of bending, be one of innovative point of the present invention.In addition, X={x ₁, x ₂..., x _t... be used for representing continuous print state variable, Y={y ₁, y ₂..., y _t... then represent observational variable itself.Whole process can represent with following formula:

x _t＝A _tx _t-1+w _t (1)

y _t＝B _tx _t+v _t (2)

Wherein:

A _t∈{A ^m，m＝1，2，…M}，B _，∈{B ^m，m＝1，2，…M} (3)

w _t∈{w ^m，m＝1，2，…M}，v _，∈{v ^m，m＝1，2，…M}

Associating (1)-(3) formula illustrates: all parameters all have M classification.In each moment, this model can dope active procedure from M candidate categories should belong to for which classification, then by such other Estimating The Model Coefficients data.Suppose w ^mand v ^mall obeying average is 0, and covariance is respectively Q ^mand R ^mmulti-dimensional Gaussian distribution, then whole unknown model parameters set can be expressed as: Θ={ Θ ¹, Θ ²..., Θ ^m... Θ ^m, wherein Θ ^m={ A ^m, B ^m, Q ^m, R ^m.

In the present invention, the model parameter of hybrid Kalman filter is estimated by expectation maximization method, is defined as by objective function:

Q(Θ，Θ ^(i-1))＝E[logP(X，Y，Z 、Θ)|Y，Θ ^(i-1)]

＝∫∫logP(X，Y，Z|Θ)·P(X，Z|Y，Θ ^(i-1))dXdZ (4)

＝∫∫logP(X，Y，Z|Θ)·P(X|Z，Y，Θ ^(i-1))·P(Z|Y，Θ ^(i-1))dXdZ

Wherein Θ ^(i-1)the estimates of parameters obtained after representing last iteration, Θ represents this parameter sets to be optimized.The way of expectation maximization is the mode estimation model parameter value by loop iteration, and namely the average of first estimated parameter, then asks for optimal value by optimization.Successive iteration, until algorithm convergence.Specifically, (4) formula can be equivalent to:

Q (Θ, Θ^{(i - 1)}) = \underset{z}{Σ} {&Integral; [\log P (X, Y | Z, Θ) + \log P (Z | Θ)] \cdot P (X | Y, Z, Θ^{(i - 1)}) dX}

\times P (Z | Y, Θ^{(i - 1)})

= \underset{z}{Σ} {&Integral; \log P (X, Y | Z, Θ) \cdot P (X | Y, Z, Θ^{(i - 1)}) dX + \log P (Z | Θ)}

\times P (Z | Y, Θ^{(i - 1)})

(5)

= \underset{z}{Σ} E [\log P (X, Y | Z, Θ) | Y, Z, Θ^{(i - 1)}] \cdot P ({Z | Y, Θ}^{(i - 1)})

+ \underset{z}{Σ} \log P (Z | Θ) \cdot P (Z | Y, Θ^{(i - 1)})

= Q_{1} + Q_{2}

Step is below respectively to Q ₁and Q ₂be optimized.Consider the hypothesis of observational variable and the equal Gaussian distributed of hidden variable, therefore substituted into Q ₁, following result can be obtained:

{\hat{A}}^{m} = (Σ_{t = 2}^{T} ω_{t}^{m} \cdot E [x_{t} x_{t - 1}^{T}]) {(Σ_{t = 2}^{T} ω_{t}^{m} \cdot E [x_{t - 1} x_{t - 1}^{T}])}^{- 1} - - - (6)

{\hat{B}}^{m} = (Σ_{t = 1}^{T} ω_{t}^{m} \cdot E [y_{t} x_{t}^{T}]) {(Σ_{t = 1}^{T} ω_{t}^{m} \cdot E [x_{t} x_{t}^{T}])}^{- 1} - - - (7)

{\hat{Q}}^{m} = Σ_{t = 2}^{T} ω_{t}^{m} \cdot E [(x_{t} - {\hat{A}}^{m} x_{t - 1}) {(x_{t} - {\hat{A}}^{m} x_{t - 1})}^{T}] / Σ_{t = 2}^{T} ω_{t}^{m} - - - (8)

{\hat{R}}^{m} = Σ_{t = 1}^{T} ω_{t}^{m} E [(y_{t} - {\hat{B}}^{m} x_{t}) {(y_{t} - {\hat{B}}^{m} x_{t})}^{T}] / Σ_{t = 1}^{T} ω_{t}^{m} - - - (9)

On the other hand, by introducing Lagrange factor, to Q ₂solve restricted problem, following result can be obtained:

ω_{t}^{m} = p (m | y_{t}, Θ^{(i - 1)}) = \frac{p (y_{t} | m, Θ^{(i - 1)}) \cdot p (m | Θ^{(i - 1)})}{Σ_{i = 1}^{M} p (y_{t} | i, Θ^{(i - 1)}) \cdot p (i | Θ^{(i - 1)})} - - - (10)

Simultaneous (6)-(10) formula, finally can obtain the estimated value of this model parameter.Notice the operation containing in above-mentioned formula and unknown stochastic variable is asked for mathematical expectation.Fortunately, these mathematical expectations seeming complexity can be obtained by classical Kalman's forward and backward Filtering Formula reasoning.Therefore, whole problem has just been readily solved.

In sum, expectation maximization approach is utilized to estimate that the step of hybrid Kalman filter structural parameters can be summarized as follows: 1. to make iteration count i=0, random initializtion model parameter Θ ⁽⁰⁾, setting greatest iteration step ζ; 2. make i=i+1, calculate the mathematical expectation calculating frame by frame and comprise in the formula of (6)-(9), and calculate (10) formula, final substitution (6)-(9) formula, obtains the model parameter set of estimation; If 3. iteration count i < ζ, then jump to step and 2. continue to perform, otherwise termination algorithm program.

Innovation two of the present invention is embodied in: above-mentioned hybrid Kalman filter is organically embedded speech conversion system, makes it play a role.Specifically, consider that selected database is panel data storehouse, therefore must comprise identical semantic information.Now utilize the architectural feature of hybrid Kalman filter, from the hidden layer comprised and aobvious layer, extract hidden layer information, and think that this layer of information is the equivalently represented of semantic information, then speaker's personal characteristics information can be given aobvious layer and processed.Based on a kind of like this hypothesis, only need slightly to make an amendment to model, make hidden layer knowledge sharing, then the Kalman model of source and target will embody speaker's personal characteristics of common semantic feature and differentiation in the process of modeling.Concrete operation steps is described below.

Training stage:

1. the set of characteristic parameters of source and target is alignd by regular algorithm between drawing time dynamic, make the parameter sets after aliging meet the requirement of panel data.

2. utilize expectation-maximization algorithm to estimate the parameter of source model, and oppositely solve hidden layer sequence.For each node of hidden layer, merge by the probability size of generic, namely characterize nodal information with the linear combination of various possibility, the final estimated value obtaining the source hidden layer sequence of training stage.

3. according to the hypothesis of hidden layer information sharing, the target hidden layer sequence of training stage should equal source hidden layer sequence.Therefore, utilize the estimated value of this hidden layer sequence, and in conjunction with the filtering of Kalman's forward and backward, the otherness information about object module can be obtained.

Translate phase:

1. by the analysis of speech analysis model, translate phase source characteristic parameter sequence is obtained.

On the basis of the model structure parameter 2. obtained in given characteristic parameter sequence and training, infer hidden layer information, namely calculated by (1) formula progressive alternate.Current time characteristic parameter classification in the wrong can be similar to use (10) formula and estimate to obtain.

3. the object module parameter that the source hidden layer sequence information of translate phase and training stage obtain is combined, can predictive conversion phase targets observation sequence, namely iteration is called (2) formula and is calculated.In this process, need to carry out mixing operation, namely according to the size of posterior probability, by the possibility weighted sum of all degree of mixings.Observed reading after merging the most at last is as to the approximate evaluation predicted the outcome.

Below in conjunction with example, the invention will be further described.

In the training stage:

1. the voice of source and target people add probabilistic model by harmonic wave and decompose, and obtain range value and the phase value of fundamental frequency track harmonious wave sound road spectrum parameter.Detail is described below:

A. framing is carried out to voice signal, frame length 20ms, frame section gap 10ms.

B. in every frame, estimate fundamental frequency with correlation method, if this frame is unvoiced frames, then fundamental frequency is set and equals zero.

C. for unvoiced frame (frame that namely fundamental frequency is non-vanishing), suppose that voice signal can be formed by a series of sine-wave superimposed:

s_{h} (n) = Σ_{l = - L}^{L} C_{l} e^{j ω_{0} n} - - - (11)

In formula, L is sinusoidal wave number, { C _lit is sinusoidal wave complex magnitude.Make s _hrepresent s _hn vector that () sampling point in a frame forms, then (11) formula can be rewritten into:

Above { C can be determined by least-squares algorithm _l}:

ϵ = Σ_{n = - \frac{N}{2}}^{\frac{N}{2}} w^{2} (n) \cdot {(s (n) - s_{h} (n))}^{2} - - - (13)

Wherein s (n) is actual speech signal, and w (n) is window function, generally gets Hamming window.Window function is also rewritten into matrix form:

Then optimum x can obtain like this:

WBx = Ws &DoubleRightArrow; x_{opt} = B^{H} W^{H} Ws - - - (15)

D. { C is obtained _l, then harmonic amplitude and phase value as follows:

A _l＝2|C _l|＝2|C _-l|，

2. because original harmonics plus noise model parameter dimension is higher, be not easy to subsequent calculations, therefore must carry out dimensionality reduction to it.Because pitch contour is one dimension parameter, therefore, the main object of dimensionality reduction is sound channel amplitude spectrum parameter and phase parameter.Meanwhile, the target of dimensionality reduction channel parameters is converted into classical linear forecasting parameter, and then produce the linear spectral frequency parameter being applicable to speech conversion system.Solution procedure is as follows:

A. discrete L range value A is asked for respectively _lsquare, and thought the sampled value P (ω that discrete power is composed _l).

B. according to Pascal law, power spectral density function and autocorrelation function are a pair Fourier transforms pair, namely therefore we can by solving the preliminary valuation that following formula obtains linear forecasting parameter coefficient:

Wherein a ₁, a ₂..., a _pp rank linear forecasting parameter coefficients.

C. the all-pole modeling that p rank linear forecasting parameter coefficient represents is converted to time domain impulse response function h ^*[n]:

h^{*} [n] = \frac{1}{L} Re {\underset{l}{Σ} \frac{1}{A (e^{j ω_{l}})} e^{j ω_{l} n}} - - - (18)

Wherein

A (e^{j ω_{l}}) = A {(z)}_{| z = e^{{jω}_{l}}} = 1 + a_{1} z^{- 1} + a_{2} z^{- 2} + . . . + a_{p} z^{- p} .

Can prove, h ^*with the autocorrelation sequence R estimating to obtain ^*meet:

Σ_{i = 0}^{p} a_{i} R^{*} (n - i) = h^{*} [- n] - - - (19)

When meeting plate storehouse-vegetarian field distance (Itakura-Satio, IS) distance minimization, there is the R of real R and estimation ^*relation as follows:

Σ_{i = 0}^{p} a_{i} R^{*} (n - i) = Σ_{i = 0}^{p} a_{i} R (n - i) - - - (20)

So d. (19) formula is replaced (20) formula, and revaluation (17) formula, have:

E. use IS criteria evaluation error, if error is greater than the threshold value of setting, then repeat step c ~ e.Otherwise, then iteration is stopped.

The linear forecasting parameter coefficient obtained, by two equatioies below simultaneous solution, is converted into linear spectral frequency parameter:

P(z)＝A(z)+z ^-(p+1)A(z ^-1) (22)

Q(z)＝A(z)-z ^-(p+1)A(z ^-1)

The linear spectral frequency parameter of the source and target 3. 2. obtained by step, is alignd with dynamic time warping algorithm.So-called " alignment " refers to: make the linear spectral frequency of corresponding source and target have minimum distortion distance in the distortion criterion of setting.The object done like this is: the characteristic sequence of source and target people is associated in the aspect of parameter, is convenient to subsequent statistical model study mapping principle wherein.Dynamic time warping algorithm steps brief overview is as follows:

For the pronunciation of same statement, assuming that the acoustics personal characteristics argument sequence of source speaker is and the characteristic parameter sequence of target speaker is and N _x≠ N _y.The characteristic parameter sequence of setting source speaker is reference template, then dynamic time warping algorithm is exactly want hunting time warping function make the time shaft n of target signature sequence _ynon-linearly be mapped to the time shaft n of source characteristic parameter sequence _x, thus make total cumulative distortion amount minimum, mathematically can be expressed as:

Wherein represent n-th _ythe target speaker characteristic parameter of frame and certain measure distance between the speaker characteristic parameter of frame source.In the regular process of dynamic time warping, warping function be to meet following constraint condition, have boundary condition and the condition of continuity to be respectively:

Dynamic time warping is a kind of optimization algorithm, and it turns to the decision process of N number of single phase a N stage decision process, is namely converted into the N number of subproblem made a policy one by one, calculates to simplify.The process of dynamic time warping is generally carry out from the last stage, and also namely it is a vice versa, and its recursive process can be expressed as:

D(n _y+1，n _x)＝d(n _y+1，n _x)+min[D(n _y，n _x)g(n _y，n _x)，D(n _y，n _x-1)，D(n _y，n _x-2)] (26)

Wherein, g (n _y, n _x) be in order to n _y, n _xvalue meet the constraint condition of Time alignment function.

4. average fundamental frequency ratio is calculated: think that the fundamental frequency sequence of source and target obeys single Gaussian distribution, then estimate the parameter of Gauss model, be i.e. average μ and variances sigma.

5. using dynamic time warping align after characteristic parameter as the input of hybrid Kalman filter, learn its structural parameters by expectation maximization method.Meanwhile, according to the principle of hidden layer information sharing, extrapolate the structural parameters of object module.Concrete operation step is shown in the description of summary of the invention part.

At translate phase:

1. voice harmonic wave plus noise model to be converted is analyzed, and obtain range value and the phase value of fundamental frequency track harmonious wave sound road spectrum parameter, this process is identical with the first step in the training stage.

2. the same with the training stage, harmonic wave plus noise model parameter is converted to linear spectral frequency parameter.

The model parameter of the relevant fundamental frequency 3. utilizing the training stage to obtain, design fundamental frequency transfer function is:

\log f_{0}^{'} = μ^{y} + \frac{σ^{y}}{σ^{x}} (\log f_{0} - μ^{x}) - - - (27)

Wherein f ' ₀the fundamental frequency after conversion, μ ^y, μ ^xthe average of the source and target Gauss model of training out respectively, equally, σ ^y, σ ^xthe variance of source and target Gauss model respectively.

4. the hybrid Kalman filter trained is regarded as functional, for map source characteristic parameter.By source linear spectral frequency parameter as input, be supplied to this model and carry out iteration prediction, finally obtain target signature parameter sets.Concrete steps are shown in the relevant portion of summary of the invention.

5. the linear spectral frequency parameter contravariant after conversion is changed to harmonic wave plus noise model coefficient, then synthesize the voice after conversion together with amended pitch contour, detailed step is as follows:

A. by obtain with the voice of the definition synthesis kth of sinusoidal model, that is:

B. in order to reduce the error produced when interframe replaces, adopting splicing adding method to synthesize whole voice, namely for two frames of arbitrary neighborhood, having:

s (kN + m) = (\frac{N - m}{N}) \cdot s^{(k)} (m) + (\frac{m}{N}) \cdot s^{(k + 1)} (m - N), 0 \leq m \leq N - - - (29)

Wherein N represents the number of samples comprised in frame voice.

The above is only the preferred embodiment of the present invention; be noted that for those skilled in the art; under the premise without departing from the principles of the invention, can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. based on the high-quality speech conversion method of signal sequence feature modeling, it is characterized in that: for the panel data of source and target, modeling and tracking are carried out to its temporal aspect, utilize hybrid Kalman filter, and under expectation maximization criterion estimation model structural parameters, finally utilize the set of characteristic parameters of these Model Mapping voice, realize speech conversion; Specifically comprise the steps:

(1) speech analysis model is adopted to analyze primary speech signal;

2. the high-quality speech conversion method based on signal sequence feature modeling according to claim 1, is characterized in that: the course of work of the speech analysis model in described step (1) comprises the steps:

3. the high-quality speech conversion method based on signal sequence feature modeling according to claim 2, it is characterized in that: described step (2) comprises the line spectral frequencies coefficient course of work estimating from discrete harmonic amplitude value and be applicable to speech conversion task, and this course of work comprises the steps:

(b1) discrete harmonic amplitude is asked for square;

4. the high-quality speech conversion method based on signal sequence feature modeling according to claim 1, it is characterized in that: the alignment criterion realizing the alignment of parameter sets in described step (3) is: for the characteristic parameter sequence of two Length discrepancy, utilize the thought of dynamic programming to be mapped to nonlinear for the time shaft of wherein one on the time shaft of another one, thus realize matching relationship one to one; In the process of alignment realizing parameter sets, by the cumulative distortion function that iteration optimization one is default, and restricted searching area, final acquisition time match function.

5. the high-quality speech conversion method based on signal sequence feature modeling according to claim 1, is characterized in that: the characteristic parameter mapping method in described step (5) comprises the steps:

6. the high-quality speech conversion method based on signal sequence feature modeling according to claim 2, is characterized in that: the course of work of the phonetic synthesis model in described step (6) comprises the steps: