CN103035236B - High-quality voice conversion method based on modeling of signal timing characteristics - Google Patents

High-quality voice conversion method based on modeling of signal timing characteristics Download PDF

Info

Publication number
CN103035236B
CN103035236B CN201210490464.6A CN201210490464A CN103035236B CN 103035236 B CN103035236 B CN 103035236B CN 201210490464 A CN201210490464 A CN 201210490464A CN 103035236 B CN103035236 B CN 103035236B
Authority
CN
China
Prior art keywords
signal
parameter
kalman filter
model
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210490464.6A
Other languages
Chinese (zh)
Other versions
CN103035236A (en
Inventor
徐宁
鲍静益
汤一彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHENZHEN TENGRUIFENG TECHNOLOGY CO.,LTD.
Original Assignee
Changzhou Campus of Hohai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changzhou Campus of Hohai University filed Critical Changzhou Campus of Hohai University
Priority to CN201210490464.6A priority Critical patent/CN103035236B/en
Publication of CN103035236A publication Critical patent/CN103035236A/en
Application granted granted Critical
Publication of CN103035236B publication Critical patent/CN103035236B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a high-quality voice conversion method based on modeling of signal timing characteristics. The high-quality voice conversion method based on the modeling of the signal timing characteristics comprises the following steps: aiming at parallel data of a source and a target, considering modeling and tracing the timing characteristics of the source and the target, utilizing the hybrid Kalman filter, estimating structural parameters of a model under the criteria of expectation maximization, utilizing characteristic parameter set of mapping voice of the model and finally achieving a high-quality voice conversion effect. According to the high-quality voice conversion method based on the modeling of the signal timing characteristics, strong correlation between the voice signal parameters is fully utilized, a novel hybrid Kalman filter is constructed by means of a physical process that simulation parameters change with time and is used for a parameter mapping process of the voice conversion, a set of special conversion algorithm which associates the parameter of the Kalman filter with physical properties of a voice signal is designed, and therefore a conversion of personality traits of a speaker can be achieved.

Description

Based on the high-quality speech conversion method of signal sequence feature modeling
Technical field
The present invention relates to Voice Conversion Techniques, that one is in conjunction with speech recognition and speech synthesis technique, realize the sound of a conversion speaker, make it sound like the technology of certain voice sound of specifically speaking in addition, particularly relate to a kind of high-quality speech conversion method based on signal sequence feature modeling.
Background technology
Voice Conversion Techniques is the emerging in recent years research branch of field of voice signal, cover the content in the field such as speech recognition and phonetic synthesis, intend when keeping semantic content constant, by changing the speech personal characteristics of a speaker dependent (being called as source speaker), him is made to be thought another speaker dependent (being called as target speaker) word by hearer by (or she) word.The main task of speech conversion comprises extraction and represents the characteristic parameter of speaker's individual character and carry out mathematic(al) manipulation, then the parameter reconstruct after conversion is become voice.In this process, the acoustical quality of reconstructed voice should be kept, whether accurately take into account the personal characteristics after conversion again.
Through development for many years, speech conversion field has emerged the algorithm of some highly effectives, is wherein that the statistics conversion method of representative has become at present in order to this field the recognized standard with gauss hybrid models.But this kind of algorithm also exists some drawback, such as: artificial tentation data meets independent identically distributed condition, and unsteady state operation mode is carried out with order frame by frame in the process of Feature Conversion.Although the way that have ignored interframe dependence on parameter this greatly simplify problem, reduce and solve difficulty, but but run counter to the fact that voice signal exists strong correlation, the ability causing model to describe signal time-varying characteristics declines, and finally affects the effect of speech conversion.
For the problems referred to above, there are some counte-rplan at present.Such as, the thought of " Differential Characteristics parameter " is more typically utilized.So-called " Differential Characteristics parameter " refers to: when carrying out gauss hybrid models modeling, original union feature vector extension is become to comprise the eigenvector of first order difference.So, the roll-off characteristic of interframe parameter has just been preferentially absorbed into new characteristic parameter, thus compensate for the defect of this model shortage to Dynamic Characteristic Modeling to a certain extent.On the other hand, in order to the defect of the independence assumption of thoroughly avoiding gauss hybrid models intrinsic, some new speech conversion schemes start to adopt hidden Markov model as basic mapping model.The principal feature of this model is can the temporal aspect of accurately control signal, and has greatest contacting with the generation of voice signal and transformation on physical layer.
Summary of the invention
Goal of the invention: in order to overcome the deficiencies in the prior art, the invention provides a kind of high-quality speech conversion method based on signal sequence feature modeling, by hybrid Kalman filter, gives model and utilizes raw data to upgrade the algorithm of inherent parameters; And under the condition of panel data, the semantic information comprised is breathed out speaker's individual information give the hidden layer of model and aobvious layer respectively in voice signal, be under a kind of condition keeping semantic information inconvenience, the method for flexible conversion speaker individual information.
Technical scheme: for achieving the above object, the technical solution used in the present invention is:
Based on the high-quality speech conversion method of signal sequence feature modeling, for the panel data of source and target, consider to carry out modeling and tracking to its temporal aspect, utilize hybrid Kalman filter, and under expectation maximization criterion estimation model structural parameters, finally utilize the set of characteristic parameters of these Model Mapping voice, realize high-quality speech conversion effects; Specifically comprise the steps:
(1) speech analysis model is adopted to analyze primary speech signal;
(2) from the parameter that analysis obtains, the set of characteristic parameters relevant to phoneme is extracted;
(3) operation is normalized to the set of characteristic parameters of source and target, realizes the alignment of parameter sets;
(4) parameter sets of alignment is used separately as the input and output of hybrid Kalman filter, the training of implementation model parameter and estimation;
(5) regard the Kalman filter trained as general Functional Mapping function, feature based parameter mapping method maps arbitrary speech signal parameter;
(6) inverse transformation operation is carried out to the characteristic parameter after conversion, namely carry out parameter interpolate and phase compensation, finally synthesize high-quality voice with voice synthetic model;
In above-mentioned steps, step (1) ~ (4) are training step, and step (5) ~ (6) are switch process; The structure of described hybrid Kalman filter is a newly-increased hidden layer in the Kalman filter structure of classics, and described hidden layer is for describing the fade effect between clock signal state.
Described hybrid Kalman filter, because hidden layer can make the observation variable in each moment all likely be in different states, the variable observed each moment, by computing mode probability, observation probability and posterior probability corresponding with it, obtains the classificating knowledge to not observation variable data bottom attribute in the same time; Utilize the classificating knowledge obtained, design variable transition rule, in order to describe the time dependent feature of signal; Utilize Bayesian inference, the estimation of model parameter is existed uncertain, namely remain the posterior probability of often kind of state, thus define so-called degree of mixing.This hybrid Thalmann filter overcomes the divergence expression difficulty that classic card Thalmann filter occurs when following the tracks of fast change clock signal, makes result more accurate.
The course of work of the speech analysis model in described step (1) comprises the steps:
(a1) voice signal is fixed to the framing of duration, with cross-correlation method, fundamental frequency is estimated;
(a2) a maximum voiced sound frequency component is set in Voiced signal part, is used for dividing the main energy area of harmonic components and random element; Recycling least-squares algorithm is estimated to obtain discrete harmonic amplitude value and phase value;
(a3) in the voiceless sound stage, utilize classical linear prediction analysis method to analyze it, thus obtain linear predictor coefficient.
Corresponding with the speech analysis model in step (1), the course of work of the phonetic synthesis model in step (6) comprises the steps:
(b1) the discrete harmonic amplitude of voiced portions signal and phase value are used as range value and the phase value of sinusoidal signal, and superpose; Interpositioning and Phase Compensation is used to make reconstruction signal not produce distortion in time domain waveform;
(b2) white noise signal of unvoiced part signal is passed through an all-pole filter, approximate reconstruction signal can be obtained;
(b3) voiced portions signal and unvoiced part signal are superposed, namely obtain the voice signal reconstructed.
Described step (2) comprises the line spectral frequencies coefficient course of work estimating from discrete harmonic amplitude value and be applicable to speech conversion task, and this course of work comprises the steps:
(b1) discrete harmonic amplitude is asked for square;
(b2) according to the one-to-one relationship of power spectral density function and autocorrelation function, obtain the Top's Ritz matrix equation about linear predictor coefficient, solve this equation;
(b3) linear predictor coefficient is converted to score spectral frequency coefficient.
What realize the alignment of parameter sets in described step (3) to its criterion is: for the characteristic parameter sequence of two Length discrepancy, utilize the thought of dynamic programming to be mapped to nonlinear for the time shaft of wherein one on the time shaft of another one, thus realize the matching relationship of a correspondence; In the process of the alignment of existing parameter sets, by the cumulative distortion function that iteration optimization one is default, and restricted searching area, final acquisition time match function.
Characteristic parameter mapping method in described step (5) comprises the steps:
(c1) feature comprising identical semantic information and different speaker's personal characteristics information of panel data is made full use of, represent on the basis of semantic information in hypothesis hidden layer state variable, ensure that the hidden layer configuration of source and target hybrid Kalman filter is separately in shared state; Then under expectation maximization criterion, estimate the statistical property of observation layer variable;
(c2) on the basis of step (c1), the otherness of reference source and object module structure, the one this otherness being considered as the different individual character of speaker embodies;
(c3) describe the ability of time varying signal in conjunction with Kalman filter, this otherness is mapped to clarification of objective space from the feature space in source, thus complete the transfer process of parameter.
Beneficial effect: the high-quality speech conversion method based on signal sequence feature provided by the invention, take full advantage of the strong correlation between speech signal parameter, by the time dependent physical process of analog parameter, construct a kind of novel hybrid Kalman filter, and use it for the Parameter Mapping process of speech conversion, devise a set of transfer algorithm that is special, that Kalman filter parameter be associated with voice signal physics tropism, realize the conversion of speaker's personal characteristics.
Accompanying drawing explanation
Fig. 1 is hybrid Kalman filter structure;
Fig. 2 is the systematic training block diagram that the present invention relates to;
Fig. 3 is the system conversion block diagram that the present invention relates to.
Embodiment
Below in conjunction with accompanying drawing, the present invention is further described.
Based on the high-quality speech conversion method of signal sequence feature modeling, for the panel data of source and target, consider to carry out modeling and tracking to its temporal aspect, utilize hybrid Kalman filter, and under expectation maximization criterion estimation model structural parameters, finally utilize the set of characteristic parameters of these Model Mapping voice, realize high-quality speech conversion effects; Specifically comprise the steps:
(1) speech analysis model is adopted to analyze primary speech signal;
(2) from the parameter that analysis obtains, the set of characteristic parameters relevant to phoneme is extracted;
(3) operation is normalized to the set of characteristic parameters of source and target, realizes the alignment of parameter sets;
(4) parameter sets of alignment is used separately as the input and output of hybrid Kalman filter, the training of implementation model parameter and estimation;
(5) regard the Kalman filter trained as general Functional Mapping function, feature based parameter mapping method maps arbitrary speech signal parameter;
(6) inverse transformation operation is carried out to the characteristic parameter after conversion, namely carry out parameter interpolate and phase compensation, finally synthesize high-quality voice with voice synthetic model;
In above-mentioned steps, step (1) ~ (4) are training step, and step (5) ~ (6) are switch process; The structure of described hybrid Kalman filter is a newly-increased hidden layer in the Kalman filter structure of classics, and described hidden layer is for describing the fade effect between clock signal state.
This case is for gauss hybrid models Problems existing in speech conversion, a kind of new solution is proposed, this case has two key points: one is devise a kind of hybrid-type Kalman filter, and gives model and utilize raw data to upgrade the algorithm of inherent parameters; Two is under the condition of panel data, gives the hidden layer of model and aobvious layer respectively by the semantic information comprised in voice signal and speaker's individual information, under proposing a kind of condition keeping semantic information constant, and the method for flexible conversion speaker individual information.
Hybrid Kalman filter structure as shown in Figure 1.Wherein, the circle adding shade represents observational variable, the square expression hidden variable of white.As we can clearly see from the figure: hybrid Kalman filter has two hidden layers, and wherein one deck is (with variable Z={z 1, z 2... z t... represent) be used for describing state variable the classification of bending, be one of innovative point of the present invention.In addition, X={x 1, x 2..., x t... be used for representing continuous print state variable, Y={y 1, y 2..., y t... then represent observational variable itself.Whole process can represent with following formula:
x t=A tx t-1+w t (1)
y t=B tx t+v t (2)
Wherein:
A t∈{A m,m=1,2,…M},B ∈{B m,m=1,2,…M} (3)
w t∈{w m,m=1,2,…M},v ∈{v m,m=1,2,…M}
Associating (1)-(3) formula illustrates: all parameters all have M classification.In each moment, this model can dope active procedure from M candidate categories should belong to for which classification, then by such other Estimating The Model Coefficients data.Suppose w mand v mall obeying average is 0, and covariance is respectively Q mand R mmulti-dimensional Gaussian distribution, then whole unknown model parameters set can be expressed as: Θ={ Θ 1, Θ 2..., Θ m... Θ m, wherein Θ m={ A m, B m, Q m, R m.
In the present invention, the model parameter of hybrid Kalman filter is estimated by expectation maximization method, is defined as by objective function:
Q(Θ,Θ (i-1))=E[logP(X,Y,Z 、Θ)|Y,Θ (i-1)]
=∫∫logP(X,Y,Z|Θ)·P(X,Z|Y,Θ (i-1))dXdZ (4)
=∫∫logP(X,Y,Z|Θ)·P(X|Z,Y,Θ (i-1))·P(Z|Y,Θ (i-1))dXdZ
Wherein Θ (i-1)the estimates of parameters obtained after representing last iteration, Θ represents this parameter sets to be optimized.The way of expectation maximization is the mode estimation model parameter value by loop iteration, and namely the average of first estimated parameter, then asks for optimal value by optimization.Successive iteration, until algorithm convergence.Specifically, (4) formula can be equivalent to:
Q ( Θ , Θ ( i - 1 ) ) = Σ z { ∫ [ log P ( X , Y | Z , Θ ) + log P ( Z | Θ ) ] · P ( X | Y , Z , Θ ( i - 1 ) ) dX }
× P ( Z | Y , Θ ( i - 1 ) )
= Σ z { ∫ log P ( X , Y | Z , Θ ) · P ( X | Y , Z , Θ ( i - 1 ) ) dX + log P ( Z | Θ ) }
× P ( Z | Y , Θ ( i - 1 ) ) ( 5 )
= Σ z E [ log P ( X , Y | Z , Θ ) | Y , Z , Θ ( i - 1 ) ] · P ( Z | Y , Θ ( i - 1 ) )
+ Σ z log P ( Z | Θ ) · P ( Z | Y , Θ ( i - 1 ) )
= Q 1 + Q 2
Step is below respectively to Q 1and Q 2be optimized.Consider the hypothesis of observational variable and the equal Gaussian distributed of hidden variable, therefore substituted into Q 1, following result can be obtained:
A ^ m = ( Σ t = 2 T ω t m · E [ x t x t - 1 T ] ) ( Σ t = 2 T ω t m · E [ x t - 1 x t - 1 T ] ) - 1 - - - ( 6 )
B ^ m = ( Σ t = 1 T ω t m · E [ y t x t T ] ) ( Σ t = 1 T ω t m · E [ x t x t T ] ) - 1 - - - ( 7 )
Q ^ m = Σ t = 2 T ω t m · E [ ( x t - A ^ m x t - 1 ) ( x t - A ^ m x t - 1 ) T ] / Σ t = 2 T ω t m - - - ( 8 )
R ^ m = Σ t = 1 T ω t m E [ ( y t - B ^ m x t ) ( y t - B ^ m x t ) T ] / Σ t = 1 T ω t m - - - ( 9 )
On the other hand, by introducing Lagrange factor, to Q 2solve restricted problem, following result can be obtained:
ω t m = p ( m | y t , Θ ( i - 1 ) ) = p ( y t | m , Θ ( i - 1 ) ) · p ( m | Θ ( i - 1 ) ) Σ i = 1 M p ( y t | i , Θ ( i - 1 ) ) · p ( i | Θ ( i - 1 ) ) - - - ( 10 )
Simultaneous (6)-(10) formula, finally can obtain the estimated value of this model parameter.Notice the operation containing in above-mentioned formula and unknown stochastic variable is asked for mathematical expectation.Fortunately, these mathematical expectations seeming complexity can be obtained by classical Kalman's forward and backward Filtering Formula reasoning.Therefore, whole problem has just been readily solved.
In sum, expectation maximization approach is utilized to estimate that the step of hybrid Kalman filter structural parameters can be summarized as follows: 1. to make iteration count i=0, random initializtion model parameter Θ (0), setting greatest iteration step ζ; 2. make i=i+1, calculate the mathematical expectation calculating frame by frame and comprise in the formula of (6)-(9), and calculate (10) formula, final substitution (6)-(9) formula, obtains the model parameter set of estimation; If 3. iteration count i < ζ, then jump to step and 2. continue to perform, otherwise termination algorithm program.
Innovation two of the present invention is embodied in: above-mentioned hybrid Kalman filter is organically embedded speech conversion system, makes it play a role.Specifically, consider that selected database is panel data storehouse, therefore must comprise identical semantic information.Now utilize the architectural feature of hybrid Kalman filter, from the hidden layer comprised and aobvious layer, extract hidden layer information, and think that this layer of information is the equivalently represented of semantic information, then speaker's personal characteristics information can be given aobvious layer and processed.Based on a kind of like this hypothesis, only need slightly to make an amendment to model, make hidden layer knowledge sharing, then the Kalman model of source and target will embody speaker's personal characteristics of common semantic feature and differentiation in the process of modeling.Concrete operation steps is described below.
Training stage:
1. the set of characteristic parameters of source and target is alignd by regular algorithm between drawing time dynamic, make the parameter sets after aliging meet the requirement of panel data.
2. utilize expectation-maximization algorithm to estimate the parameter of source model, and oppositely solve hidden layer sequence.For each node of hidden layer, merge by the probability size of generic, namely characterize nodal information with the linear combination of various possibility, the final estimated value obtaining the source hidden layer sequence of training stage.
3. according to the hypothesis of hidden layer information sharing, the target hidden layer sequence of training stage should equal source hidden layer sequence.Therefore, utilize the estimated value of this hidden layer sequence, and in conjunction with the filtering of Kalman's forward and backward, the otherness information about object module can be obtained.
Translate phase:
1. by the analysis of speech analysis model, translate phase source characteristic parameter sequence is obtained.
On the basis of the model structure parameter 2. obtained in given characteristic parameter sequence and training, infer hidden layer information, namely calculated by (1) formula progressive alternate.Current time characteristic parameter classification in the wrong can be similar to use (10) formula and estimate to obtain.
3. the object module parameter that the source hidden layer sequence information of translate phase and training stage obtain is combined, can predictive conversion phase targets observation sequence, namely iteration is called (2) formula and is calculated.In this process, need to carry out mixing operation, namely according to the size of posterior probability, by the possibility weighted sum of all degree of mixings.Observed reading after merging the most at last is as to the approximate evaluation predicted the outcome.
Below in conjunction with example, the invention will be further described.
In the training stage:
1. the voice of source and target people add probabilistic model by harmonic wave and decompose, and obtain range value and the phase value of fundamental frequency track harmonious wave sound road spectrum parameter.Detail is described below:
A. framing is carried out to voice signal, frame length 20ms, frame section gap 10ms.
B. in every frame, estimate fundamental frequency with correlation method, if this frame is unvoiced frames, then fundamental frequency is set and equals zero.
C. for unvoiced frame (frame that namely fundamental frequency is non-vanishing), suppose that voice signal can be formed by a series of sine-wave superimposed:
s h ( n ) = &Sigma; l = - L L C l e j &omega; 0 n - - - ( 11 )
In formula, L is sinusoidal wave number, { C lit is sinusoidal wave complex magnitude.Make s hrepresent s hn vector that () sampling point in a frame forms, then (11) formula can be rewritten into:
Above { C can be determined by least-squares algorithm l}:
&epsiv; = &Sigma; n = - N 2 N 2 w 2 ( n ) &CenterDot; ( s ( n ) - s h ( n ) ) 2 - - - ( 13 )
Wherein s (n) is actual speech signal, and w (n) is window function, generally gets Hamming window.Window function is also rewritten into matrix form:
Then optimum x can obtain like this:
WBx = Ws &DoubleRightArrow; x opt = B H W H Ws - - - ( 15 )
D. { C is obtained l, then harmonic amplitude and phase value as follows:
A l=2|C l|=2|C -l|,
2. because original harmonics plus noise model parameter dimension is higher, be not easy to subsequent calculations, therefore must carry out dimensionality reduction to it.Because pitch contour is one dimension parameter, therefore, the main object of dimensionality reduction is sound channel amplitude spectrum parameter and phase parameter.Meanwhile, the target of dimensionality reduction channel parameters is converted into classical linear forecasting parameter, and then produce the linear spectral frequency parameter being applicable to speech conversion system.Solution procedure is as follows:
A. discrete L range value A is asked for respectively lsquare, and thought the sampled value P (ω that discrete power is composed l).
B. according to Pascal law, power spectral density function and autocorrelation function are a pair Fourier transforms pair, namely therefore we can by solving the preliminary valuation that following formula obtains linear forecasting parameter coefficient:
Wherein a 1, a 2..., a pp rank linear forecasting parameter coefficients.
C. the all-pole modeling that p rank linear forecasting parameter coefficient represents is converted to time domain impulse response function h *[n]:
h * [ n ] = 1 L Re { &Sigma; l 1 A ( e j &omega; l ) e j &omega; l n } - - - ( 18 )
Wherein A ( e j &omega; l ) = A ( z ) | z = e j&omega; l = 1 + a 1 z - 1 + a 2 z - 2 + . . . + a p z - p . Can prove, h *with the autocorrelation sequence R estimating to obtain *meet:
&Sigma; i = 0 p a i R * ( n - i ) = h * [ - n ] - - - ( 19 )
When meeting plate storehouse-vegetarian field distance (Itakura-Satio, IS) distance minimization, there is the R of real R and estimation *relation as follows:
&Sigma; i = 0 p a i R * ( n - i ) = &Sigma; i = 0 p a i R ( n - i ) - - - ( 20 )
So d. (19) formula is replaced (20) formula, and revaluation (17) formula, have:
E. use IS criteria evaluation error, if error is greater than the threshold value of setting, then repeat step c ~ e.Otherwise, then iteration is stopped.
The linear forecasting parameter coefficient obtained, by two equatioies below simultaneous solution, is converted into linear spectral frequency parameter:
P(z)=A(z)+z -(p+1)A(z -1) (22)
Q(z)=A(z)-z -(p+1)A(z -1)
The linear spectral frequency parameter of the source and target 3. 2. obtained by step, is alignd with dynamic time warping algorithm.So-called " alignment " refers to: make the linear spectral frequency of corresponding source and target have minimum distortion distance in the distortion criterion of setting.The object done like this is: the characteristic sequence of source and target people is associated in the aspect of parameter, is convenient to subsequent statistical model study mapping principle wherein.Dynamic time warping algorithm steps brief overview is as follows:
For the pronunciation of same statement, assuming that the acoustics personal characteristics argument sequence of source speaker is and the characteristic parameter sequence of target speaker is and N x≠ N y.The characteristic parameter sequence of setting source speaker is reference template, then dynamic time warping algorithm is exactly want hunting time warping function make the time shaft n of target signature sequence ynon-linearly be mapped to the time shaft n of source characteristic parameter sequence x, thus make total cumulative distortion amount minimum, mathematically can be expressed as:
Wherein represent n-th ythe target speaker characteristic parameter of frame and certain measure distance between the speaker characteristic parameter of frame source.In the regular process of dynamic time warping, warping function be to meet following constraint condition, have boundary condition and the condition of continuity to be respectively:
Dynamic time warping is a kind of optimization algorithm, and it turns to the decision process of N number of single phase a N stage decision process, is namely converted into the N number of subproblem made a policy one by one, calculates to simplify.The process of dynamic time warping is generally carry out from the last stage, and also namely it is a vice versa, and its recursive process can be expressed as:
D(n y+1,n x)=d(n y+1,n x)+min[D(n y,n x)g(n y,n x),D(n y,n x-1),D(n y,n x-2)] (26)
Wherein, g (n y, n x) be in order to n y, n xvalue meet the constraint condition of Time alignment function.
4. average fundamental frequency ratio is calculated: think that the fundamental frequency sequence of source and target obeys single Gaussian distribution, then estimate the parameter of Gauss model, be i.e. average μ and variances sigma.
5. using dynamic time warping align after characteristic parameter as the input of hybrid Kalman filter, learn its structural parameters by expectation maximization method.Meanwhile, according to the principle of hidden layer information sharing, extrapolate the structural parameters of object module.Concrete operation step is shown in the description of summary of the invention part.
At translate phase:
1. voice harmonic wave plus noise model to be converted is analyzed, and obtain range value and the phase value of fundamental frequency track harmonious wave sound road spectrum parameter, this process is identical with the first step in the training stage.
2. the same with the training stage, harmonic wave plus noise model parameter is converted to linear spectral frequency parameter.
The model parameter of the relevant fundamental frequency 3. utilizing the training stage to obtain, design fundamental frequency transfer function is:
log f 0 &prime; = &mu; y + &sigma; y &sigma; x ( log f 0 - &mu; x ) - - - ( 27 )
Wherein f ' 0the fundamental frequency after conversion, μ y, μ xthe average of the source and target Gauss model of training out respectively, equally, σ y, σ xthe variance of source and target Gauss model respectively.
4. the hybrid Kalman filter trained is regarded as functional, for map source characteristic parameter.By source linear spectral frequency parameter as input, be supplied to this model and carry out iteration prediction, finally obtain target signature parameter sets.Concrete steps are shown in the relevant portion of summary of the invention.
5. the linear spectral frequency parameter contravariant after conversion is changed to harmonic wave plus noise model coefficient, then synthesize the voice after conversion together with amended pitch contour, detailed step is as follows:
A. by obtain with the voice of the definition synthesis kth of sinusoidal model, that is:
B. in order to reduce the error produced when interframe replaces, adopting splicing adding method to synthesize whole voice, namely for two frames of arbitrary neighborhood, having:
s ( kN + m ) = ( N - m N ) &CenterDot; s ( k ) ( m ) + ( m N ) &CenterDot; s ( k + 1 ) ( m - N ) , 0 &le; m &le; N - - - ( 29 )
Wherein N represents the number of samples comprised in frame voice.
The above is only the preferred embodiment of the present invention; be noted that for those skilled in the art; under the premise without departing from the principles of the invention, can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (6)

1. based on the high-quality speech conversion method of signal sequence feature modeling, it is characterized in that: for the panel data of source and target, modeling and tracking are carried out to its temporal aspect, utilize hybrid Kalman filter, and under expectation maximization criterion estimation model structural parameters, finally utilize the set of characteristic parameters of these Model Mapping voice, realize speech conversion; Specifically comprise the steps:
(1) speech analysis model is adopted to analyze primary speech signal;
(2) from the parameter that analysis obtains, the set of characteristic parameters relevant to phoneme is extracted;
(3) operation is normalized to the set of characteristic parameters of source and target, realizes the alignment of parameter sets;
(4) parameter sets of alignment is used separately as the input and output of hybrid Kalman filter, the training of implementation model parameter and estimation;
(5) regard the Kalman filter trained as general Functional Mapping function, feature based parameter mapping method maps arbitrary speech signal parameter;
(6) inverse transformation operation is carried out to the characteristic parameter after conversion, namely carry out parameter interpolate and phase compensation, finally synthesize high-quality voice with voice synthetic model;
In above-mentioned steps, step (1) ~ (4) are training step, and step (5) ~ (6) are switch process; The structure of described hybrid Kalman filter is a newly-increased hidden layer in the Kalman filter structure of classics, and described hidden layer is for describing the fade effect between clock signal state.
2. the high-quality speech conversion method based on signal sequence feature modeling according to claim 1, is characterized in that: the course of work of the speech analysis model in described step (1) comprises the steps:
(a1) voice signal is fixed to the framing of duration, with cross-correlation method, fundamental frequency is estimated;
(a2) a maximum voiced sound frequency component is set in Voiced signal part, is used for dividing the main energy area of harmonic components and random element; Recycling least-squares algorithm is estimated to obtain discrete harmonic amplitude value and phase value;
(a3) in the voiceless sound stage, utilize classical linear prediction analysis method to analyze it, thus obtain linear predictor coefficient.
3. the high-quality speech conversion method based on signal sequence feature modeling according to claim 2, it is characterized in that: described step (2) comprises the line spectral frequencies coefficient course of work estimating from discrete harmonic amplitude value and be applicable to speech conversion task, and this course of work comprises the steps:
(b1) discrete harmonic amplitude is asked for square;
(b2) according to the one-to-one relationship of power spectral density function and autocorrelation function, obtain the Top's Ritz matrix equation about linear predictor coefficient, solve this equation;
(b3) linear predictor coefficient is converted to score spectral frequency coefficient.
4. the high-quality speech conversion method based on signal sequence feature modeling according to claim 1, it is characterized in that: the alignment criterion realizing the alignment of parameter sets in described step (3) is: for the characteristic parameter sequence of two Length discrepancy, utilize the thought of dynamic programming to be mapped to nonlinear for the time shaft of wherein one on the time shaft of another one, thus realize matching relationship one to one; In the process of alignment realizing parameter sets, by the cumulative distortion function that iteration optimization one is default, and restricted searching area, final acquisition time match function.
5. the high-quality speech conversion method based on signal sequence feature modeling according to claim 1, is characterized in that: the characteristic parameter mapping method in described step (5) comprises the steps:
(c1) feature comprising identical semantic information and different speaker's personal characteristics information of panel data is made full use of, represent on the basis of semantic information in hypothesis hidden layer state variable, ensure that the hidden layer configuration of source and target hybrid Kalman filter is separately in shared state; Then under expectation maximization criterion, estimate the statistical property of observation layer variable;
(c2) on the basis of step (c1), the otherness of reference source and object module structure, the one this otherness being considered as the different individual character of speaker embodies;
(c3) describe the ability of time varying signal in conjunction with Kalman filter, this otherness is mapped to clarification of objective space from the feature space in source, thus complete the transfer process of parameter.
6. the high-quality speech conversion method based on signal sequence feature modeling according to claim 2, is characterized in that: the course of work of the phonetic synthesis model in described step (6) comprises the steps:
(b1) the discrete harmonic amplitude of voiced portions signal and phase value are used as range value and the phase value of sinusoidal signal, and superpose; Interpositioning and Phase Compensation is used to make reconstruction signal not produce distortion in time domain waveform;
(b2) white noise signal of unvoiced part signal is passed through an all-pole filter, approximate reconstruction signal can be obtained;
(b3) voiced portions signal and unvoiced part signal are superposed, namely obtain the voice signal reconstructed.
CN201210490464.6A 2012-11-27 2012-11-27 High-quality voice conversion method based on modeling of signal timing characteristics Expired - Fee Related CN103035236B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210490464.6A CN103035236B (en) 2012-11-27 2012-11-27 High-quality voice conversion method based on modeling of signal timing characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210490464.6A CN103035236B (en) 2012-11-27 2012-11-27 High-quality voice conversion method based on modeling of signal timing characteristics

Publications (2)

Publication Number Publication Date
CN103035236A CN103035236A (en) 2013-04-10
CN103035236B true CN103035236B (en) 2014-12-17

Family

ID=48022068

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210490464.6A Expired - Fee Related CN103035236B (en) 2012-11-27 2012-11-27 High-quality voice conversion method based on modeling of signal timing characteristics

Country Status (1)

Country Link
CN (1) CN103035236B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108613679A (en) * 2018-06-14 2018-10-02 河北工业大学 A kind of mobile robot Extended Kalman filter synchronous superposition method

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103413548B (en) * 2013-08-16 2016-02-03 中国科学技术大学 A kind of sound converting method of the joint spectrum modeling based on limited Boltzmann machine
CN105425319B (en) * 2015-09-16 2017-10-13 河海大学 Rainfall satellite heavy rain assimilation method based on ground survey Data correction
CN106782599A (en) * 2016-12-21 2017-05-31 河海大学常州校区 The phonetics transfer method of post filtering is exported based on Gaussian process
CN107068165B (en) * 2016-12-31 2020-07-24 南京邮电大学 Voice conversion method
CN107103914B (en) * 2017-03-20 2020-06-16 南京邮电大学 High-quality voice conversion method
CN108681709B (en) * 2018-05-16 2020-01-17 深圳大学 Intelligent input method and system based on bone conduction vibration and machine learning
CN113112030B (en) * 2019-04-28 2023-12-26 第四范式(北京)技术有限公司 Method and system for training model and method and system for predicting sequence data
CN110880315A (en) * 2019-10-17 2020-03-13 深圳市声希科技有限公司 Personalized voice and video generation system based on phoneme posterior probability

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751921B (en) * 2009-12-16 2011-09-14 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108613679A (en) * 2018-06-14 2018-10-02 河北工业大学 A kind of mobile robot Extended Kalman filter synchronous superposition method
CN108613679B (en) * 2018-06-14 2020-06-16 河北工业大学 Method for synchronous positioning and map construction of extended Kalman filtering of mobile robot

Also Published As

Publication number Publication date
CN103035236A (en) 2013-04-10

Similar Documents

Publication Publication Date Title
CN103035236B (en) High-quality voice conversion method based on modeling of signal timing characteristics
CN103531205B (en) The asymmetrical voice conversion method mapped based on deep neural network feature
CN105023580B (en) Unsupervised noise estimation based on separable depth automatic coding and sound enhancement method
Sun et al. Unseen noise estimation using separable deep auto encoder for speech enhancement
Yu et al. Continuous F0 modeling for HMM based statistical parametric speech synthesis
CN102800316B (en) Optimal codebook design method for voiceprint recognition system based on nerve network
Cui et al. Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR
CN104685562B (en) Method and apparatus for reconstructing echo signal from noisy input signal
CN110648684B (en) Bone conduction voice enhancement waveform generation method based on WaveNet
Juvela et al. Speaker-independent raw waveform model for glottal excitation
CN103021418A (en) Voice conversion method facing to multi-time scale prosodic features
CN102306492A (en) Voice conversion method based on convolutive nonnegative matrix factorization
Nørholm et al. Instantaneous fundamental frequency estimation with optimal segmentation for nonstationary voiced speech
Nirmal et al. Voice conversion using general regression neural network
CN106782599A (en) The phonetics transfer method of post filtering is exported based on Gaussian process
Juvela et al. Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks
Chetouani et al. Investigation on LP-residual representations for speaker identification
Narendra et al. Estimation of the glottal source from coded telephone speech using deep neural networks
CN103886859B (en) Phonetics transfer method based on one-to-many codebook mapping
Tobing et al. Voice conversion with CycleRNN-based spectral mapping and finely tuned WaveNet vocoder
US20220172703A1 (en) Acoustic model learning apparatus, method and program and speech synthesis apparatus, method and program
CN104538026A (en) Fundamental frequency modeling method used for parametric speech synthesis
Ai et al. Denoising-and-dereverberation hierarchical neural vocoder for statistical parametric speech synthesis
Wu et al. Nonlinear speech coding model based on genetic programming
Aroon et al. Statistical parametric speech synthesis: A review

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
CB03 Change of inventor or designer information

Inventor after: Xu Ningtao

Inventor after: Liu Pingsheng

Inventor after: Xie Daokuang

Inventor before: Xu Ning

Inventor before: Bao Jingyi

Inventor before: Tang Yibin

COR Change of bibliographic data
TR01 Transfer of patent right

Effective date of registration: 20160504

Address after: 518042 Guangdong city of Shenzhen province Futian District Che Kung Temple Cheonan Digital City Tienhsiang building 7B1

Patentee after: SHENZHEN TENGRUIFENG TECHNOLOGY CO.,LTD.

Address before: 213022 Changzhou Jin Ling North Road, Jiangsu, No. 200

Patentee before: CHANGZHOU CAMPUS OF HOHAI University

CB03 Change of inventor or designer information

Inventor after: Xu Ningtao

Inventor after: Liu Pingsheng

Inventor after: Xie Daokuang

Inventor before: Xu Ningtao

Inventor before: Liu Pingsheng

Inventor before: Xie Daokuang

COR Change of bibliographic data
PP01 Preservation of patent right
PP01 Preservation of patent right

Effective date of registration: 20190814

Granted publication date: 20141217

PD01 Discharge of preservation of patent
PD01 Discharge of preservation of patent

Date of cancellation: 20210814

Granted publication date: 20141217

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20141217

Termination date: 20191127