CN104392717A

CN104392717A - Sound track spectrum Gaussian mixture model based rapid voice conversion system and method

Info

Publication number: CN104392717A
Application number: CN201410742549.8A
Authority: CN
Inventors: 鲍静益; 徐宁
Original assignee: Changzhou Institute of Technology
Current assignee: Changzhou Institute of Technology
Priority date: 2014-12-08
Filing date: 2014-12-08
Publication date: 2015-03-04

Abstract

The invention discloses a sound track spectrum Gaussian mixture model based rapid voice conversion system and method. The method comprises the steps of parameter extraction and synthesis, characteristic parameter time aligning and characteristic parameter training and conversion. By the technologies of fixation of the Gaussian average on Mel frequency spectra, adaptive Gaussian variance adjusting, selecting of sampling points as weight coefficients on logarithm magnitude spectra and the like, the calculation complexity of voice parameter characterization is greatly reduced, and the operating rate is improved greatly.

Description

A kind of Rapid Speech converting system based on the modeling of sound channel spectrum Gaussian Mixture and method thereof

Technical field

The present invention relates to a kind of voice process technology, particularly a kind of Rapid Speech converting system based on the modeling of sound channel spectrum Gaussian Mixture and method thereof.

Background technology

In order to realize the task of speech conversion, a point several step is needed: characteristic parameter extraction, parameter build, mapping relations, parameter is changed in real time.Each process relates to complicated signal transacting computing, require higher, and operation time is longer to software and hardware configuration, is unfavorable for Voice Conversion Techniques instantiation on the wider mobile device of some ranges of application, embedded device.Particularly traditional speech conversion system, at this one-phase of characteristic parameter extraction, usually needs the conversion between time domain, frequency domain, cepstrum domain, and calculated amount is huge especially.In addition, be limited to concrete hardware device, too complicated parameter extraction algorithm also can cause result of calculation out of true.

Summary of the invention

For the problems referred to above, the object of the present invention is to provide a kind of Rapid Speech conversion method based on the modeling of sound channel spectrum Gaussian Mixture, the method is by presetting fixed frequency value, to measures such as the sampling of spectrum envelope amplitude logarithmetics, self-adaptative adjustment variances to speech channel spectrum envelope, the process of parameter characterization is made only to need addition subtraction multiplication and division method to complete, without the need to the signal processing means by complexity, significantly reduce computation complexity, shorten operation time.This characteristic parameter is used for speech conversion system, and experimental result shows that system performance is better than classic method.

In order to achieve the above object, the present invention by the following technical solutions: a kind of Rapid Speech conversion method based on the Gaussian Mixture modeling of sound channel spectrum, step comprises characteristic parameter extraction and train with synthesis, characteristic parameter time unifying, characteristic parameter and change;

Described characteristic parameter extraction is decomposed primary speech signal, and described characteristic parameter synthesis is then the inverse process of characteristic parameter extraction;

Described characteristic parameter time unifying is used for arranging the characteristic parameter of converting objects and the voice that are converted object and screening, and obtains set of characteristic parameters synchronous in time;

The training of described characteristic parameter is the mapping relations between the speech characteristic parameter set for learning converting objects and being converted object, thus obtaining mapping principle, described characteristic parameter conversion utilizes mapping principle that the speech conversion of converting objects is become to be converted the voice of object.

The operation steps of described characteristic parameter extraction is as follows:

(a1) voice signal is carried out to the framing of 20ms, with cross-correlation method, fundamental frequency is estimated;

(a2) according to fundamental frequency, determine that these frame voice are voiceless sound or voiced sound, when these frame voice are Voiced signal, then a maximum voiced sound frequency component is set in Voiced signal part, be used for dividing the main energy area of harmonic components and random element; Recycling least-squares algorithm is estimated to obtain discrete harmonic amplitude value and phase value; Interpolation is carried out to discrete harmonic amplitude, obtains spectrum envelope;

(a3) when these frame voice are Unvoiced signal, then utilize linear prediction analysis method to analyze Unvoiced signal part, thus obtain linear predictor coefficient;

(a4) non-linearization is carried out to the frequency spectrum axle of spectrum envelope, obtain mel-frequency; Preset 24 mel-frequency values, make it be the average of each blending constituent of gauss hybrid models; Logarithmetics is carried out to spectrum envelope amplitude axis, and at Gaussian mean place, it is sampled, sampling point numerical value is preserved, as weight coefficient; Theoretical according to human hearing characteristic, exploitation right coefficient value becomes the relation of approximate reverse ratio with the variance of Gaussian distribution, determines successively to the variance of all Gaussian mixture components;

The operation steps of described characteristic parameter synthesis is as follows:

(b1) amplitude correction is carried out to weight coefficient sequence, revise scaling and determine according to the maximal value of distribution, be approximated to proportional relation;

(b2) carry out interpolation to the weight coefficient sequence after amplitude correction, form spectrum envelope value, make its horizontal ordinate be frequency, ordinate is range value;

(b3) spectrum envelope of Voiced signal is sampled according to fundamental frequency value, obtain discrete harmonic range value;

(b4) the discrete harmonic amplitude of Voiced signal and phase value are used as range value and the phase value of sinusoidal signal, and superpose; Interpositioning and Phase Compensation is used to make reconstruction signal not produce distortion in time domain waveform;

(b5) any random white noise signal is passed through an all-pole filter, obtain approximate reconstruction Unvoiced signal;

(b6) Voiced signal and Unvoiced signal are superposed, obtain the voice signal reconstructed.

The operation steps of described characteristic parameter time unifying is as follows:

(c1) for converting objects and the characteristic parameter sequence of two Length discrepancy being converted object voice signal, utilize dynamic time warping algorithm to be mapped on the time shaft of another one by nonlinear for the time shaft of wherein one, realize matching relationship one to one;

(c2) in the process of the alignment of parameter sets, by the cumulative distortion function that iteration optimization one is default, and restricted searching area, final acquisition time match function.

Described characteristic parameter training and operation step is as follows:

The converting objects of alignment is become augmented matrix with the phonic signal character parametric joint being converted object, and default degree of mixing is N, utilizes expectation maximization rule iterative learning gauss hybrid models parameter, the average of this gauss hybrid models, variance and weight coefficient are parameter to be estimated, approximate evaluation weight coefficient and model parameter is come by Markov chain Monte-Carlo method, the i.e. average of Gaussian process and the associating posterior probability density function of variance, first suppose to meet separate characteristic between weight coefficient and model parameter, then both probability density functions are progressively estimated by the mode of iteration, in each iterative process, first fix a kind of known variables, then another kind of known variables is sampled, its probability distribution approximate is carried out by a large amount of sampled data, finally weight coefficient is multiplied with the probability distribution function of model parameter, associating posterior probability function can be obtained, carry out marginalisation to joint probability density function, obtain the estimation to the probability distribution of weight coefficient and the probability distribution of model parameter respectively, mixed Gaussian random process model structure is determined.

Described characteristic parameter conversion operations step is as follows:

(d1) under the condition of given input observation vector set, according to the structural parameters of the mixed Gaussian stochastic process trained, the posterior probability asking for current speech frame is subordinate to angle value;

(d2) in the subspace of the mixing composition of each sub-clustering, ask for the conditional expectation of output variable, get its average and export as conversion;

(d3) stacked up by the Output rusults of all compositions, weight coefficient is exactly that posterior probability is subordinate to angle value, finally obtains the speech characteristic parameter after mapping.

Adopt technique scheme, the present invention at least has following advantages:

1, under this speech conversion scheme is applicable to big data quantity environment.

Under big data quantity environment, there is between data very strong relevance and plyability.With regard to Chinese speech, under the voice that surface change is abundant, the voice metadata that its essence is formed is limited.Therefore, by setting up the phonetics transfer method with mixed structure, sub-clustering modeling can be carried out to speech data, thus making full use of large data, improve system performance.

2, this voice conversion algorithm has the advantages that non-thread maps, data relationship complicated under the good simulating reality environment of energy.

By building the phonetics transfer method based on Gaussian random process, can make full use of the ability of the Nonlinear Mapping of Gaussian random process, the signal for this kind of variability of voice signal complexity is especially applicable.

Accompanying drawing explanation

Fig. 1 is system chart of the present invention.

Embodiment

Below in conjunction with accompanying drawing, the present invention is further described.

As shown in Figure 1, a kind of Rapid Speech conversion method based on the modeling of sound channel spectrum Gaussian Mixture, is characterized in that step comprises characteristic parameter extraction and synthesis, characteristic parameter time unifying, characteristic parameter are trained and conversion;

Characteristic parameter extraction comprises following operation:

(a1) voice signal is fixed to the framing of duration, frame length 20ms, frame moves 10ms.In frame voice, solve the autocorrelation function of these voice, utilize the first side lobe peak of autocorrelation function to carry out approximate evaluation pitch period, the inverse of pitch period is fundamental frequency;

(a2) according to the fundamental frequency value (voiceless sound is 0, and voiced sound is non-zero) obtained in (a1) step, determine that these frame voice are voiceless sound or voiced sound.If voiced sound, then for it arranges a maximum voiced sound frequency component, be used for dividing the main energy area of harmonic components and random element.Frequency range below maximum voiced sound frequency, modeling is carried out to signal---utilize the superposition of several sine waves to carry out fitted signal.Least-squares algorithm is utilized to come discrete amplitude values and the phase value of constraint solving sine wave; For the signal frequency range being greater than maximum voiced sound frequency, do not process;

(a3) if this frame signal is at voiceless sound, then utilize classical linear prediction analysis method to analyze it, set up an all-pole modeling, and utilize least square method constraint solving model coefficient, thus obtain linear predictor coefficient.

Characteristic parameter synthesis comprises following operation:

(b5) for Unvoiced signal, by any random white noise signal by an all-pole filter, approximate reconstruction Unvoiced signal is obtained;

Characteristic parameter time unifying:

(c1) for the characteristic parameter sequence of two Length discrepancy, utilize dynamic time warping algorithm to be mapped on the time shaft of another one by nonlinear for the time shaft of wherein one, realize matching relationship one to one;

Parameter training and modular converter take Gaussian random process as theoretical foundation, and expand a set of mixed structure on basic framework, for carrying out sub-clustering modeling to data, improve accuracy.Meanwhile, have benefited from the Nonlinear Mapping feature of Gaussian random process, system can realize the conversion between the comparatively complicated characteristic parameter words of understanding relation.Whole running engineering comprises two stages, and training stage and translate phase, operation steps is as follows.

Characteristic parameter training and operation step is as follows:

The converting objects of alignment is become augmented matrix with the phonic signal character parametric joint being converted object, and build the Gaussian random process model comprising mixed structure, if degree of mixing is N, the weight coefficient of each blending constituent is respectively r _i, wherein i=1,2,3..., N.Then under the prerequisite of given input and output vector set, output vector sequence is approximately equal to the weighted array of N number of Gaussian random process.Wherein, the input of Gaussian random process had both been given input vector sequence.The average of all weight coefficients and each Gaussian random process and variance parameter, be unknown parameter to be estimated, the average of approximate evaluation weight coefficient and model parameter (average of Gaussian process and variance) and the associating posterior probability density function of variance is come by Markov chain Monte-Carlo method, namely first suppose to meet separate characteristic between weight coefficient and model parameter, then both probability density functions are progressively estimated by the mode of iteration, in each iterative process, first fix a kind of known variables, then another kind of known variables is sampled, its probability distribution approximate is carried out by a large amount of sampled data, finally weight coefficient is multiplied with the probability distribution function of model parameter, associating posterior probability function can be obtained, carry out marginalisation to joint probability density function, obtain the estimation to the probability distribution of weight coefficient and the probability distribution of model parameter respectively, so far, mixed Gaussian random process model structure is determined,

Characteristic parameter conversion operations step is as follows:

(d1) under the condition of given input observation vector set, according to the structural parameters of the mixed Gaussian stochastic process trained, ask for the membership function value of current speech frame, so-called membership function, refer to the ratio of normalization posteriority weight coefficient;

(d2) according to being subordinate to angle value, differentiating which Gauss's subconstiuent is current speech belong to, subsequently in the subspace of each sub-clustering, according to the definition of Gaussian random process, producing the output corresponded;

(d3) stacked up by the Output rusults of all compositions, weight coefficient is exactly the value of membership function, finally obtains the speech characteristic parameter after mapping.

The above, only preferred enforcement of the present invention, not any pro forma restriction is done to the present invention, although the present invention is preferably to implement to disclose as above, but and be not used to limit the present invention, any those skilled in the art, do not departing within the scope of technical solution of the present invention, make a little change when the technology contents of above-mentioned announcement can be utilized or be modified to the Equivalent embodiments of equivalent variations, in every case be the content not departing from technical solution of the present invention, according to any simple modification that technical spirit of the present invention is done above embodiment, equivalent variations and modification, all still belong in the scope of technical solution of the present invention.

Claims

1., based on a Rapid Speech conversion method for sound channel spectrum Gaussian Mixture modeling, it is characterized in that step comprises characteristic parameter extraction and synthesis, characteristic parameter time unifying, characteristic parameter are trained and conversion;

2. a kind of Rapid Speech conversion method based on the modeling of sound channel spectrum Gaussian Mixture according to claim 1, is characterized in that: the operation steps of described characteristic parameter extraction is as follows:

(a4) non-linearization is carried out to the frequency spectrum axle of spectrum envelope, obtain mel-frequency; Preset 24 mel-frequency values, make it be the average of each blending constituent of gauss hybrid models; Logarithmetics is carried out to spectrum envelope amplitude axis, and at Gaussian mean place, it is sampled, sampling point numerical value is preserved, as weight coefficient; Theoretical according to human hearing characteristic, exploitation right coefficient value becomes the relation of approximate reverse ratio with the variance of Gaussian distribution, determines successively to the variance of all Gaussian mixture components.

3. a kind of Rapid Speech conversion method based on the modeling of sound channel spectrum Gaussian Mixture according to claim 2, is characterized in that: the operation steps of described characteristic parameter synthesis is as follows:

4. a kind of Rapid Speech conversion method based on the modeling of sound channel spectrum Gaussian Mixture according to claim 3, is characterized in that: the operation steps of described characteristic parameter time unifying is as follows:

5. a kind of Rapid Speech conversion method based on the modeling of sound channel spectrum Gaussian Mixture according to claim 4, is characterized in that: described characteristic parameter training and operation step is as follows:

6. a kind of Rapid Speech conversion method based on the modeling of sound channel spectrum Gaussian Mixture according to claim 5, is characterized in that: described characteristic parameter conversion operations step is as follows: