CN104464744A - Cluster voice transforming method and system based on mixture Gaussian random process - Google Patents

Cluster voice transforming method and system based on mixture Gaussian random process Download PDF

Info

Publication number
CN104464744A
CN104464744A CN201410662811.8A CN201410662811A CN104464744A CN 104464744 A CN104464744 A CN 104464744A CN 201410662811 A CN201410662811 A CN 201410662811A CN 104464744 A CN104464744 A CN 104464744A
Authority
CN
China
Prior art keywords
characteristic parameter
source
speech
parameter
speech data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410662811.8A
Other languages
Chinese (zh)
Inventor
徐宁
胡芳
蒋爱民
刘小峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changzhou Campus of Hohai University
Original Assignee
Changzhou Campus of Hohai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changzhou Campus of Hohai University filed Critical Changzhou Campus of Hohai University
Priority to CN201410662811.8A priority Critical patent/CN104464744A/en
Publication of CN104464744A publication Critical patent/CN104464744A/en
Pending legal-status Critical Current

Links

Abstract

The invention relates to a cluster voice transforming method and system based on a mixture Gaussian random process. The cluster voice transforming method comprises the steps that (S100) a training section is conducted, in other words, characteristic parameter extraction and training are conducted on source voice data and target voice data; (S200) a transformation section is conducted, in other words, the source characteristic parameters extracted again and training results are transformed and combined to obtain reconstructed voice data. The cluster voice transformation scheme is suitable for large data volume environments where the data are high in relevance and superposition; for the case of the Chinese voice, the voice metadata essentially forming the voice are limit under the voice with rich superficial changes; thus, by means of establishment of the voice transformation method with the mixture structure, cluster modeling can be conducted on the voice data, so that big data are fully utilized, and the system performance is improved.

Description

A kind of sub-clustering phonetics transfer method based on mixed Gaussian stochastic process and system
Technical field
The present invention relates to the method and system for the treatment of voice data, particularly relate to a kind of method and system of speech conversion.
Background technology
The object of speech conversion is the speech personal characteristics changing any one speaker, makes it sound like the sound of the target person that another is specified.And realize this goal, usually need to train mass data just can complete.At large data age, data present the features such as output is large, relevance is strong.How to make good use of these data, concerning speech conversion task, also most important.With regard to Chinese speech, although the words pronunciation related to is very abundant, be inherently by some limited initial consonant, simple or compound vowel of a Chinese syllable, consonant, then be equipped with that certain tone forms.Therefore, in the process of speech conversion, sub-clustering modeling can be carried out for this feature; On the other hand, need to carry out the precisely efficient characteristic parameter mapped between source and target speaker by means of outstanding mapping algorithm.In general, nonlinear algorithm should be able to obtain the performance more excellent than linear mapping algorithm.
Summary of the invention
The object of this invention is to provide a kind of sub-clustering phonetics transfer method, it is by means of this mathematical tool of Gaussian random process, in conjunction with the feature of speech data, devises a kind of novel Gaussian random process structure, and gives training and transfer algorithm on this basis.The method under big data quantity environment, can effectively utilize the relevance between data, improves conversion accuracy.In addition, the method or a kind of nonlinear mapping technique, have good adaptability to the modeling of complex data.
In order to solve the problems of the technologies described above, the invention provides a kind of sub-clustering phonetics transfer method, being to comprise the steps:
Step S100, the training stage, trains after carrying out characteristic parameter extraction respectively by source, target speech data; And step S200, translate phase, the characteristic parameter and the training result that are about to the source of again extracting are changed, and synthesis obtains the speech data of reconstruct.
Further, in order to realize, respectively characteristic parameter extraction is carried out to source, target speech data; The step of in described step S100, source, target speech data being carried out respectively characteristic parameter extraction comprises: step S110, and speech analysis, namely carries out speech decomposition respectively to source, target speech data; Step S120, parameter pre-service, namely extracts the characteristic parameter of the source after speech decomposition, target speech data, and by two characteristic parameter alignment; And step S130, characteristic parameter is trained, and two characteristic parameters is carried out the training of mixed Gaussian stochastic process.
Further, in order to realize, respectively speech decomposition is carried out to source, target speech data; In described step S110, the step that source, target speech data carry out speech decomposition is respectively comprised: step S111, the voice signal of source or target is fixed to the framing of duration, with cross-correlation method, fundamental frequency is estimated, to obtain voiced sound, Unvoiced signal; And step S112, a maximum voiced sound frequency component is set at Voiced signal, is used for dividing the main energy area of harmonic components and random element; Recycling least-squares algorithm is estimated to obtain discrete harmonic amplitude value and phase value.
Further, in order to realize the characteristic parameter extraction to voiced sound, voiceless sound; The method extracting characteristic parameter in described step S120 comprises: by the voiced sound characteristic parameter in extraction source, target speech data and voiceless sound characteristic parameter, to obtain the characteristic parameter of the speech data in described source or target; The step wherein extracting described voiced sound characteristic parameter comprises: step S121, to ask for square the harmonic amplitude obtained in step S112; Step S122, according to the one-to-one relationship of power spectral density function and autocorrelation function, obtains the Top's Ritz matrix equation about voiced sound linear predictor coefficient, solves this equation and obtain described voiced sound linear predictor coefficient; Step S123, described voiced sound linear predictor coefficient is converted to voiced sound linear spectral coefficient of frequency, and this coefficient is voiced sound characteristic parameter; And the method extracting described voiceless sound characteristic parameter comprises: at Unvoiced signal, utilize linear prediction analysis method to analyze it, thus obtain voiceless sound linear predictor coefficient; Again described voiceless sound linear predictor coefficient is converted to voiceless sound linear spectral coefficient of frequency, this coefficient is voiceless sound characteristic parameter.
Further, in order to realize two characteristic parameter alignment, in described step S120, the step of two characteristic parameter alignment is comprised: step S124, for the characteristic parameter of source and target, utilize dynamic time warping algorithm by the nonlinear time shaft being mapped to another feature parameter of time shaft of a wherein characteristic parameter, realize mating one to one; Step S125, in the process of alignment, by the cumulative distortion function that iteration optimization one is default, and restricted searching area, to obtain time match function.
Further, in order to training characteristics parameter, in described step S130, the step that characteristic parameter is trained is comprised: step S131, build the Gaussian random process model with mixed structure; Step S132, carrys out the parameter of Gaussian random process model described in approximate evaluation by Markov chain Monte-Carlo method, i.e. this parameter and the associating posterior probability density function being subordinate to marking variable; Step S133, carries out iterative edge to probability density function, obtains the estimation to the probability distribution being subordinate to marking variable probability distribution and model parameter respectively, to determine the structural parameters of mixed Gaussian stochastic process.
Further, in order to obtain the characteristic parameter after mapping, by identical for the method for the characteristic parameter extraction of source speech data in the Methods and steps S100 of the characteristic parameter in source that again extracts in described step S200, and the step that the characteristic parameter in the source of this again being extracted and training result carry out changing comprises: step S210, according to the characteristic parameter in the source of again extracting, and the structural parameters of the mixed Gaussian stochastic process trained, ask for the membership function value of current speech frame; Step S220, in the subspace of the mixing composition of each sub-clustering, produces the Output rusults corresponded according to mixed Gaussian stochastic process; Step S230, superposes each Output rusults, and its weight coefficient is the value of membership function, finally obtains the characteristic parameter after mapping.
Further, in order to realize speech data reconstruct, the step of speech data that in described step S200, synthesis obtains reconstructing comprises: the discrete harmonic amplitude of the Voiced signal in the characteristic parameter obtained at above-mentioned steps S230 and phase value are used as after the range value of sinusoidal signal and phase value superpose, with by an all-pole filter, Unvoiced signal is superposed, obtain the speech data reconstructed.
Another aspect, present invention also offers a kind of sub-clustering speech conversion system, to solve above-mentioned same technical matters.
This sub-clustering speech conversion system comprises: training unit, and it is trained after being suitable for that source, target speech data are carried out characteristic parameter extraction respectively; And conversion synthesis unit, it is suitable for the characteristic parameter in the source of again extracting and training result to change, and synthesis obtains the speech data of reconstruct.
Further, described training unit comprises: the first speech analysis module, carries out speech decomposition respectively to source, target speech data; First parameter pretreatment module, aligns after the result that source, target voice decompose is carried out characteristic parameter extraction; And characteristic parameter training module, each characteristic parameter is carried out the training of mixed Gaussian stochastic process.
Further, described conversion synthesis unit comprises: the second speech analysis module, carries out speech decomposition to source speech data; The result of the speech decomposition in source is carried out characteristic parameter extraction by the second feature parameter extraction submodule in the second parameter pretreatment module; The modular converter be connected with the second parameter pretreatment module with characteristic parameter training module, it is for changing by the extraction characteristic parameter of mixed Gaussian stochastic process by training result and source.
Further, described first speech analysis module is suitable for speech decomposition to become voiced sound, Unvoiced signal.
Further, in order to realize, respectively characteristic parameter extraction is carried out to source, target speech data; Described first parameter pretreatment module comprises: fisrt feature parameter extraction submodule, is suitable for extracting the characteristic parameter by described voiced sound, Unvoiced signal, to obtain the characteristic parameter of the speech data in described source or target; Characteristic parameter alignment submodule, be suitable for the characteristic parameter for source and target, utilize dynamic time warping algorithm by the nonlinear time shaft being mapped to another feature parameter of time shaft of a wherein characteristic parameter, realize mating one to one, and in the process of alignment, by the cumulative distortion function that iteration optimization one is default, and restricted searching area, to obtain time match function.
Further, described conversion synthesis unit also comprises: voice synthetic module, for being superposed with Unvoiced signal by the Voiced signal in the characteristic parameter reconstructed through modular converter, obtains the speech data reconstructed.
Further, described voice synthetic module is suitable for the discrete harmonic amplitude of described Voiced signal and phase value and is used as after the range value of sinusoidal signal and phase value superpose, with by an all-pole filter, Unvoiced signal is superposed, obtain the speech data of described reconstruct.
The third aspect, present invention also offers a kind of method of work of speech conversion system, to solve above-mentioned same technical matters.
The method of work of this speech conversion system, namely catch through train obtain source, target speech data set of characteristic parameters between mapping relations, with obtain reconstruct speech data.
Further, the method catching described mapping relations comprises: utilize Gauss's Gaussian random process model to carry out the method for training, it comprises the steps: first to build the Gaussian random process model with mixed structure; The parameter of Gaussian random process model described in approximate evaluation is carried out again, i.e. this parameter and the associating posterior probability density function being subordinate to marking variable by Markov chain Monte-Carlo method; Finally iterative edge is carried out to probability density function, obtain the estimation to the probability distribution being subordinate to marking variable probability distribution and model parameter respectively, to determine the structural parameters of mixed Gaussian stochastic process.
Further, the method catching described mapping relations also comprises: the method for mixed Gaussian stochastic process conversion, it comprises the steps: first according to the characteristic parameter in the source of again extracting, and the structural parameters of the mixed Gaussian stochastic process trained, and asks for the membership function value of current speech frame; Again in the subspace of the mixing composition of each sub-clustering, produce the Output rusults corresponded according to mixed Gaussian stochastic process; Finally superposed by each Output rusults, its weight coefficient is the value of membership function, finally obtains the characteristic parameter after mapping.
Further, described characteristic parameter is the individual features parameter of voiced sound, Unvoiced signal.
Further, the discrete harmonic amplitude of described Voiced signal and phase value are used as after the range value of sinusoidal signal and phase value superpose, with by an all-pole filter, Unvoiced signal is superposed, obtain the speech data reconstructed.
The invention has the beneficial effects as follows: under (1) this speech conversion scheme is applicable to big data quantity environment, namely there is between data very strong relevance and plyability.With regard to Chinese speech, under the voice that surface change is abundant, the voice metadata that its essence is formed is limited.Therefore, by setting up the phonetics transfer method with mixed structure, sub-clustering modeling can be carried out to speech data, thus making full use of large data, improve system performance.
(2) this voice conversion algorithm has the advantages that non-thread maps, data relationship complicated under the good simulating reality environment of energy.Namely by building the phonetics transfer method based on Gaussian random process, can make full use of the ability of the Nonlinear Mapping of Gaussian random process, the signal for this kind of variability of voice signal complexity is especially applicable.
Accompanying drawing explanation
Below in conjunction with drawings and Examples, the present invention is further described.
Fig. 1 shows the process flow diagram of sub-clustering phonetics transfer method;
Fig. 2 shows middle flow chart of steps of source, target speech data being carried out respectively characteristic parameter extraction;
Fig. 3 shows the flow chart of steps that source, target speech data carry out the alignment of speech decomposition, characteristic parameter extraction and characteristic parameter respectively;
Fig. 4 shows the flow chart of steps of characteristic parameter training;
Fig. 5 shows the flow chart of steps of conversion;
Fig. 6 shows the theory diagram of sub-clustering speech conversion system.
Embodiment
In conjunction with the accompanying drawings, the present invention is further detailed explanation.These accompanying drawings are the schematic diagram of simplification, only basic structure of the present invention are described in a schematic way, and therefore it only shows the formation relevant with the present invention.
The speech conversion system that the present invention describes, comprises speech analysis module, first, second parameter pretreatment module, characteristic parameter training module, modular converter, voice synthetic module.Described speech analysis module is used for decomposing primary speech signal, voice synthetic module is used for recombinating to primary speech signal, described first parameter pretreatment module, for arranging the characteristic parameter of source, target speech data and screen, obtain set of characteristic parameters synchronous in time; Described characteristic parameter training module, modular converter, for catching the mapping relations between source, target speech data two set of characteristic parameters, thus obtain mapping principle.
Implement 1
Fig. 1 shows the process flow diagram of sub-clustering phonetics transfer method.
As shown in Figure 1, a kind of sub-clustering phonetics transfer method of the present invention, is to comprise the steps:
Step S100, the training stage, trains after carrying out characteristic parameter extraction respectively by source, target speech data; And step S200, translate phase, the characteristic parameter and the training result that are about to the source of again extracting are changed, and synthesis obtains the speech data of reconstruct.
Fig. 2 shows middle flow chart of steps of source, target speech data being carried out respectively characteristic parameter extraction.
As shown in Figure 2, the step of in described step S100, source, target speech data being carried out respectively characteristic parameter extraction comprises: step S110, and namely speech analysis carries out speech decomposition respectively to source, target speech data; Step S120, parameter pre-service, namely extracts the characteristic parameter of the source after speech decomposition, target speech data, and by two characteristic parameter alignment; And step S130, characteristic parameter is trained, and two characteristic parameters is carried out the training of mixed Gaussian stochastic process.
Fig. 3 shows the flow chart of steps that source, target speech data carry out the alignment of speech decomposition, characteristic parameter extraction and characteristic parameter respectively.
In described step S110, the step that source, target speech data carry out speech decomposition is respectively comprised: step S111, the voice signal of source or target is fixed to the framing of duration, with cross-correlation method, fundamental frequency is estimated, to obtain voiced sound, Unvoiced signal; And step S112, a maximum voiced sound frequency component is set at Voiced signal, is used for dividing the main energy area of harmonic components and random element; Recycling least-squares algorithm is estimated to obtain discrete harmonic amplitude value and phase value.
The step of concrete speech analysis comprises:
Step a1 is fixed the framing of duration to voice signal, and frame length 20ms, frame moves 10ms.In frame voice, solve the autocorrelation function of these voice, utilize the first side lobe peak of autocorrelation function to carry out approximate evaluation pitch period, the inverse of pitch period is fundamental frequency.
Step a2, according to the fundamental frequency value obtained in a1 step (voiceless sound is 0, and voiced sound is non-zero), determines that these frame voice are voiceless sound or voiced sound.If voiced sound, then for it arranges a maximum voiced sound frequency component, be used for dividing the main energy area of harmonic components and random element.Frequency range below maximum voiced sound frequency, modeling is carried out to signal---utilize the superposition of several sine waves to carry out fitted signal.Least-squares algorithm is utilized to come discrete amplitude values and the phase value of constraint solving sine wave; For the signal frequency range being greater than maximum voiced sound frequency, do not process.
The method extracting characteristic parameter in described step S120 comprises: by the voiced sound characteristic parameter in extraction source, target speech data and voiceless sound characteristic parameter, to obtain the characteristic parameter of the speech data in described source or target.
Wherein, the step extracting described voiced sound characteristic parameter comprises:
Step S121, to ask for square the harmonic amplitude obtained in step S112.
Step S122, according to the one-to-one relationship of power spectral density function and autocorrelation function, obtains the Top's Ritz matrix equation about voiced sound linear predictor coefficient, solves this equation and obtain described voiced sound linear predictor coefficient.
Step S123, described voiced sound linear predictor coefficient is converted to voiced sound linear spectral coefficient of frequency, and this coefficient is voiced sound characteristic parameter.
Concrete implementation process comprises the steps:
The amplitude of step c1 to the discrete sine ripple obtained in step a2 is asked for square, and this numerical value is thought discrete power spectral function by approximate.
Step c2 is according to the one-to-one relationship of power spectral density function and autocorrelation function, to the discrete power spectrum numerical value inverse transformation of (c1), approximate autocorrelation function can be obtained, according to Linear prediction analysis method, utilize autocorrelation function to solve and obtain linear predictor coefficient.
The dual function that step c3 linear predictor coefficient is formed, its root on Z frequency domain, be linear spectral coefficient of frequency, this coefficient is voiced sound characteristic parameter.
The method extracting described voiceless sound characteristic parameter comprises: step S121 ', at Unvoiced signal, utilizes linear prediction analysis method to analyze it, thus obtains voiceless sound linear predictor coefficient; Again described voiceless sound linear predictor coefficient is converted to voiceless sound linear spectral coefficient of frequency, this coefficient is voiceless sound characteristic parameter.Concrete, if frame signal is at voiceless sound, then utilizes classical linear prediction analysis method to analyze it, set up an all-pole modeling, and utilize least square method constraint solving model coefficient, thus obtain linear predictor coefficient, be voiceless sound characteristic parameter.
In described step S120, the step of two characteristic parameter alignment is comprised:
Step S124, for the characteristic parameter of source and target, utilizes dynamic time warping algorithm by the nonlinear time shaft being mapped to another feature parameter of time shaft of a wherein characteristic parameter, realizes mating one to one.
Step S125, in the process of alignment, by the cumulative distortion function that iteration optimization one is default, and restricted searching area, to obtain time match function.
Fig. 4 shows the flow chart of steps of characteristic parameter training.
As shown in Figure 4, in described step S130, the step that characteristic parameter is trained is comprised:
Step S131, builds the Gaussian random process model with mixed structure.
Step S132, carrys out the parameter of Gaussian random process model described in approximate evaluation by Markov chain Monte-Carlo method, i.e. this parameter and the associating posterior probability density function being subordinate to marking variable.
Step S133, carries out iterative edge to probability density function, obtains the estimation to the probability distribution being subordinate to marking variable probability distribution and model parameter respectively, to determine the structural parameters of mixed Gaussian stochastic process.
Fig. 5 shows the flow chart of steps of conversion.
As shown in Figure 5, by identical for the method for the characteristic parameter extraction of source speech data in the Methods and steps S100 of the characteristic parameter in source that again extracts in described step S200, and the step that the characteristic parameter in the source of this again being extracted and training result carry out changing comprises: step S210, according to the characteristic parameter in the source of again extracting, and the structural parameters of the mixed Gaussian stochastic process trained, ask for the membership function value of current speech frame; Step S220, in the subspace of the mixing composition of each sub-clustering, produces the Output rusults corresponded according to mixed Gaussian stochastic process; Step S230, superposes each Output rusults, and its weight coefficient is the value of membership function, finally obtains the characteristic parameter after mapping.
The step of speech data that in described step S200, synthesis obtains reconstructing comprises: the discrete harmonic amplitude of the Voiced signal in the characteristic parameter obtained at above-mentioned steps S230 and phase value are used as after the range value of sinusoidal signal and phase value superpose, with by an all-pole filter, Unvoiced signal is superposed, obtain the speech data reconstructed.Optionally, interpositioning and Phase Compensation is used to make reconstruction signal not produce distortion in time domain waveform.
Wherein, parameter training, transfer process take Gaussian random process as theoretical foundation, and expand a set of mixed structure on basic framework, for carrying out sub-clustering modeling to data, improve accuracy.Meanwhile, have benefited from the Nonlinear Mapping feature of Gaussian random process, system can realize the conversion between the comparatively complicated characteristic parameter words of understanding relation.Whole running engineering comprises two stages, i.e. parameter training and conversion, and concrete operation steps is as follows.
Parameter training step:
Step e1 builds the Gaussian random process model comprising mixed structure, if degree of mixing is N, the weight coefficient of each blending constituent is respectively ri, wherein i=1,2,3..., N.Then under the prerequisite of given input and output vector set (this set is the characteristic parameter in the source of extracting), output vector sequence is approximately equal to the weighted array of N number of Gaussian random process.Wherein, the input of Gaussian random process is given input vector sequence.The average of all weight coefficients and each Gaussian random process and variance parameter, be unknown parameter to be estimated.
Step e2 Markov chain Monte-Carlo method carrys out the associating posterior probability density function of approximate evaluation weight coefficient and model parameter (average of Gaussian process and variance), namely first suppose to meet separate characteristic between weight coefficient and model parameter, then both probability density functions are progressively estimated by the mode of iteration, in each iterative process, first fix a kind of known variables, then another kind of known variables is sampled, its probability distribution approximate is carried out by a large amount of sampled data, finally weight coefficient is multiplied with the probability distribution function of model parameter, associating posterior probability function can be obtained.
Step e3 carries out marginalisation to joint probability density function, obtains the estimation to the probability distribution of weight coefficient and the probability distribution of model parameter respectively, and so far, mixed Gaussian random process model structure is determined.
Switch process:
Step f1, under the condition of given input observation vector set, according to the structural parameters of the mixed Gaussian stochastic process trained, asks for the membership function value of current speech frame, so-called membership function, refers to the ratio of normalization posteriority weight coefficient.
Step f2, according to being subordinate to angle value, differentiates which Gauss's subconstiuent is current speech belong to, and subsequently in the subspace of each sub-clustering, according to the definition of Gaussian random process, produces the output corresponded.
The Output rusults of all compositions stacks up by step f3, and weight coefficient is exactly the value of membership function, finally obtains the speech characteristic parameter after mapping.Wherein, the voiced sound of this speech characteristic parameter in order to source, target speech data, the characteristic parameter of voiceless sound.
The speech data of described reconstruct is for be reconstructed target speech data.
Embodiment 2
Fig. 6 shows the theory diagram of sub-clustering speech conversion system.
A kind of sub-clustering speech conversion system on embodiment 1 basis, it comprises:
Training unit, it is trained after being suitable for that source, target speech data are carried out characteristic parameter extraction respectively.
Conversion synthesis unit, it is suitable for the characteristic parameter in the source of again extracting and training result to change, and synthesis obtains the speech data of reconstruct.See step S100 and S200 in embodiment 1.
Described training unit comprises: the first speech analysis module, carries out speech decomposition respectively to source, target speech data; First parameter pretreatment module, aligns after the result that source, target voice decompose is carried out characteristic parameter extraction; And characteristic parameter training module, each characteristic parameter is carried out the training of mixed Gaussian stochastic process.Described first speech analysis module is suitable for speech decomposition to become voiced sound, Unvoiced signal.
Described first parameter pretreatment module comprises: fisrt feature parameter extraction submodule, is suitable for extracting the characteristic parameter by described voiced sound, Unvoiced signal, to obtain the characteristic parameter of the speech data in described source or target; Characteristic parameter alignment submodule, be suitable for the characteristic parameter for source and target, utilize dynamic time warping algorithm by the nonlinear time shaft being mapped to another feature parameter of time shaft of a wherein characteristic parameter, realize mating one to one, and in the process of alignment, by the cumulative distortion function that iteration optimization one is default, and restricted searching area, to obtain time match function.
The above-mentioned implementation process about training unit see embodiment 1 about the step of in step S100, source, target speech data being carried out respectively characteristic parameter extraction, i.e. step S110, speech analysis; Step S120, parameter pre-service; Step S130, characteristic parameter is trained.
Described conversion synthesis unit comprises:
Second speech analysis module, carries out speech decomposition to source speech data; The result of the speech decomposition in source is carried out characteristic parameter extraction by the second feature parameter extraction submodule in the second parameter pretreatment module; The modular converter be connected with the second parameter pretreatment module with characteristic parameter training module, it is for changing by the extraction characteristic parameter of mixed Gaussian stochastic process by training result and source.
Wherein, second speech analysis module and the first speech analysis module method when the source of process speech data makes identical, and the method that first, second characteristic parameter extraction submodule carries out characteristic parameter extraction in the result of the speech decomposition to source is also identical.
Described conversion synthesis unit also comprises: voice synthetic module, for being superposed with Unvoiced signal by the Voiced signal in the characteristic parameter reconstructed through modular converter, obtains the speech data reconstructed.
Described voice synthetic module is suitable for the discrete harmonic amplitude of described Voiced signal and phase value and is used as after the range value of sinusoidal signal and phase value superpose, to be superposed by Unvoiced signal, obtain the speech data of described reconstruct with by an all-pole filter.
The associated operating steps of above-mentioned modular converter and voice synthetic module see in step S200 described in embodiment 1 by identical for the method for the characteristic parameter extraction of source speech data in the Methods and steps S100 of the characteristic parameter in source that again extracts, and the characteristic parameter in the source of this again being extracted and training result carry out the step changed.
The function that in the present embodiment, each module also relates to is identical with step with the function occurred in embodiment 1 with step, repeats no more here.
Embodiment 3
The method of work of a kind of speech conversion system in embodiment 1 and embodiment 2 basis, this method of work comprises: catch through train obtain source, target speech data set of characteristic parameters between mapping relations, with obtain reconstruct speech data.
The method catching described mapping relations comprises: utilize Gauss's Gaussian random process model to carry out the method for training, it comprises the steps:
First build the Gaussian random process model with mixed structure; The parameter of Gaussian random process model described in approximate evaluation is carried out again, i.e. this parameter and the associating posterior probability density function being subordinate to marking variable by Markov chain Monte-Carlo method; Finally iterative edge is carried out to probability density function, obtain the estimation to the probability distribution being subordinate to marking variable probability distribution and model parameter respectively, to determine the structural parameters of mixed Gaussian stochastic process.
The method catching described mapping relations also comprises: the method for mixed Gaussian stochastic process conversion, it comprises the steps:
First according to the characteristic parameter in the source of again extracting, and the structural parameters of the mixed Gaussian stochastic process trained, ask for the membership function value of current speech frame; Again in the subspace of the mixing composition of each sub-clustering, produce the Output rusults corresponded according to mixed Gaussian stochastic process; Finally superposed by each Output rusults, its weight coefficient is the value of membership function, finally obtains the characteristic parameter after mapping.
Described characteristic parameter is the individual features parameter of voiced sound, Unvoiced signal.
The discrete harmonic amplitude of described Voiced signal and phase value are used as after the range value of sinusoidal signal and phase value superpose, with by an all-pole filter, Unvoiced signal is superposed, obtain the speech data reconstructed.
Wherein, the method (be called for short and change) of the method (abbreviation parameter training) utilizing Gauss's Gaussian random process model to carry out training, mixed Gaussian stochastic process conversion take Gaussian random process as theoretical foundation, and a set of mixed structure is expanded on basic framework, for carrying out sub-clustering modeling to data, improve accuracy.Meanwhile, have benefited from the Nonlinear Mapping feature of Gaussian random process, system can realize the conversion between the comparatively complicated characteristic parameter words of understanding relation.Whole running engineering comprises two stages, i.e. parameter training and conversion, and concrete operation steps is as follows.
Parameter training step:
Step e1 builds the Gaussian random process model comprising mixed structure, if degree of mixing is N, the weight coefficient of each blending constituent is respectively ri, wherein i=1,2,3..., N.Then under the prerequisite of given input and output vector set (this set is the characteristic parameter in the source of extracting), output vector sequence is approximately equal to the weighted array of N number of Gaussian random process.Wherein, the input of Gaussian random process is given input vector sequence.The average of all weight coefficients and each Gaussian random process and variance parameter, be unknown parameter to be estimated.
Step e2 Markov chain Monte-Carlo method carrys out the associating posterior probability density function of approximate evaluation weight coefficient and model parameter (average of Gaussian process and variance), namely first suppose to meet separate characteristic between weight coefficient and model parameter, then both probability density functions are progressively estimated by the mode of iteration, in each iterative process, first fix a kind of known variables, then another kind of known variables is sampled, its probability distribution approximate is carried out by a large amount of sampled data, finally weight coefficient is multiplied with the probability distribution function of model parameter, associating posterior probability function can be obtained.
Step e3 carries out marginalisation to joint probability density function, obtains the estimation to the probability distribution of weight coefficient and the probability distribution of model parameter respectively, and so far, mixed Gaussian random process model structure is determined;
Switch process:
Step f1, under the condition of given input observation vector set, according to the structural parameters of the mixed Gaussian stochastic process trained, asks for the membership function value of current speech frame, so-called membership function, refers to the ratio of normalization posteriority weight coefficient.
Step f2, according to being subordinate to angle value, differentiates which Gauss's subconstiuent is current speech belong to, and subsequently in the subspace of each sub-clustering, according to the definition of Gaussian random process, produces the output corresponded.
The Output rusults of all compositions stacks up by step f3, and weight coefficient is exactly the value of membership function, finally obtains the speech characteristic parameter after mapping.Wherein, the voiced sound of this speech characteristic parameter in order to source, target speech data, the characteristic parameter of voiceless sound.
The method of work of the speech conversion system described in the present embodiment also comprises: the method for decomposing source, target speech data, to the pretreated method of correlation parameter (characteristic parameter extraction, characteristic parameter align), and the method for phonetic synthesis.
Function involved in the present embodiment and step and embodiment 1 are identical with step with the function occurred in embodiment 2, repeat no more here.
With above-mentioned according to desirable embodiment of the present invention for enlightenment, by above-mentioned description, relevant staff in the scope not departing from this invention technological thought, can carry out various change and amendment completely.The technical scope of this invention is not limited to the content on instructions, must determine its technical scope according to right.

Claims (20)

1. a sub-clustering phonetics transfer method, is to comprise the steps:
Step S100, the training stage, trains after carrying out characteristic parameter extraction respectively by source, target speech data; And
Step S200, translate phase, the characteristic parameter and the training result that are about to the source of again extracting are changed, and synthesis obtains the speech data of reconstruct.
2. sub-clustering phonetics transfer method according to claim 1, is characterized in that, the step of in described step S100, source, target speech data being carried out respectively characteristic parameter extraction comprises:
Step S110, speech analysis, namely carries out speech decomposition respectively to source, target speech data;
Step S120, parameter pre-service, namely extracts the characteristic parameter of the source after speech decomposition, target speech data, and by two characteristic parameter alignment; And
Step S130, characteristic parameter is trained, and two characteristic parameters is carried out the training of mixed Gaussian stochastic process.
3. sub-clustering phonetics transfer method according to claim 2, is characterized in that, comprises in described step S110 to the step that source, target speech data carry out speech decomposition respectively:
Step S111, is fixed the framing of duration, estimates, to obtain voiced sound, Unvoiced signal with cross-correlation method to fundamental frequency to the voice signal of source or target; And
Step S112, arranges a maximum voiced sound frequency component at Voiced signal, is used for dividing the main energy area of harmonic components and random element; Recycling least-squares algorithm is estimated to obtain discrete harmonic amplitude value and phase value.
4. sub-clustering phonetics transfer method according to claim 3, it is characterized in that, the method extracting characteristic parameter in described step S120 comprises: by the voiced sound characteristic parameter in extraction source, target speech data and voiceless sound characteristic parameter, to obtain the characteristic parameter of the speech data in described source or target; Wherein
The step extracting described voiced sound characteristic parameter comprises:
Step S121, to ask for square the harmonic amplitude obtained in step S112;
Step S122, according to the one-to-one relationship of power spectral density function and autocorrelation function, obtains the Top's Ritz matrix equation about voiced sound linear predictor coefficient, solves this equation and obtain described voiced sound linear predictor coefficient;
Step S123, described voiced sound linear predictor coefficient is converted to voiced sound linear spectral coefficient of frequency, and this coefficient is voiced sound characteristic parameter; And
The method extracting described voiceless sound characteristic parameter comprises: at Unvoiced signal, utilizes linear prediction analysis method to analyze it, thus obtains voiceless sound linear predictor coefficient; Again described voiceless sound linear predictor coefficient is converted to voiceless sound linear spectral coefficient of frequency, this coefficient is voiceless sound characteristic parameter.
5. sub-clustering phonetics transfer method according to claim 4, is characterized in that, comprises in described step S120 to the step of two characteristic parameter alignment:
Step S124, for the characteristic parameter of source and target, utilizes dynamic time warping algorithm by the nonlinear time shaft being mapped to another feature parameter of time shaft of a wherein characteristic parameter, realizes mating one to one;
Step S125, in the process of alignment, by the cumulative distortion function that iteration optimization one is default, and restricted searching area, to obtain time match function.
6. sub-clustering phonetics transfer method according to claim 5, is characterized in that, the step that characteristic parameter is trained is comprised in described step S130:
Step S131, builds the Gaussian random process model with mixed structure;
Step S132, carrys out the parameter of Gaussian random process model described in approximate evaluation by Markov chain Monte-Carlo method, i.e. this parameter and the associating posterior probability density function being subordinate to marking variable;
Step S133, carries out iterative edge to probability density function, obtains the estimation to the probability distribution being subordinate to marking variable probability distribution and model parameter respectively, to determine the structural parameters of mixed Gaussian stochastic process.
7. sub-clustering phonetics transfer method according to claim 6, it is characterized in that, by identical for the method for the characteristic parameter extraction of source speech data in the Methods and steps S100 of the characteristic parameter in source that again extracts in described step S200, and the step that the characteristic parameter in the source of this again being extracted and training result carry out changing comprises:
Step S210, according to the characteristic parameter in the source of again extracting, and the structural parameters of the mixed Gaussian stochastic process trained, ask for the membership function value of current speech frame;
Step S220, in the subspace of the mixing composition of each sub-clustering, produces the Output rusults corresponded according to mixed Gaussian stochastic process;
Step S230, superposes each Output rusults, and its weight coefficient is the value of membership function, finally obtains the characteristic parameter after mapping.
8. sub-clustering phonetics transfer method according to claim 7, is characterized in that, synthesizes the step obtaining the speech data reconstructed and comprise in described step S200:
The discrete harmonic amplitude of the Voiced signal in the characteristic parameter obtained at above-mentioned steps S230 and phase value are used as after the range value of sinusoidal signal and phase value superpose, with by an all-pole filter, Unvoiced signal is superposed, obtain the speech data reconstructed.
9. a sub-clustering speech conversion system, is characterized in that comprising:
Training unit, it is trained after being suitable for that source, target speech data are carried out characteristic parameter extraction respectively;
Conversion synthesis unit, it is suitable for the characteristic parameter in the source of again extracting and training result to change, and synthesis obtains the speech data of reconstruct.
10. sub-clustering speech conversion system according to claim 9, is characterized in that, described training unit comprises:
First speech analysis module, carries out speech decomposition respectively to source, target speech data;
First parameter pretreatment module, aligns after the result that source, target voice decompose is carried out characteristic parameter extraction; And
Characteristic parameter training module, carries out the training of mixed Gaussian stochastic process by each characteristic parameter.
11. sub-clustering speech conversion systems according to claim 10, is characterized in that, described conversion synthesis unit comprises:
Second speech analysis module, carries out speech decomposition to source speech data;
The result of the speech decomposition in source is carried out characteristic parameter extraction by the second feature parameter extraction submodule in the second parameter pretreatment module; And
The modular converter be connected with the second parameter pretreatment module with characteristic parameter training module, it is for changing by the extraction characteristic parameter of mixed Gaussian stochastic process by training result and source.
12. sub-clustering speech conversion systems according to claim 11, is characterized in that, described first speech analysis module is suitable for speech decomposition to become voiced sound, Unvoiced signal.
13. sub-clustering speech conversion systems according to claim 12, is characterized in that, described first parameter pretreatment module comprises:
Fisrt feature parameter extraction submodule, is suitable for extracting the characteristic parameter by described voiced sound, Unvoiced signal, to obtain the characteristic parameter of the speech data in described source or target;
Characteristic parameter alignment submodule, be suitable for the characteristic parameter for source and target, utilize dynamic time warping algorithm by the nonlinear time shaft being mapped to another feature parameter of time shaft of a wherein characteristic parameter, realize mating one to one, and in the process of alignment, by the cumulative distortion function that iteration optimization one is default, and restricted searching area, to obtain time match function.
14. sub-clustering speech conversion systems according to claim 13, is characterized in that, described conversion synthesis unit also comprises:
Voice synthetic module, for being superposed with Unvoiced signal by the Voiced signal in the characteristic parameter reconstructed through modular converter, obtains the speech data reconstructed.
15. sub-clustering speech conversion systems according to claim 14, it is characterized in that, described voice synthetic module is suitable for the discrete harmonic amplitude of described Voiced signal and phase value and is used as after the range value of sinusoidal signal and phase value superpose, with by an all-pole filter, Unvoiced signal is superposed, obtain the speech data of described reconstruct.
The method of work of 16. 1 kinds of speech conversion systems, is characterized in that, catch through train obtain source, target speech data set of characteristic parameters between mapping relations, with obtain reconstruct speech data.
17. method of works according to claim 16, is characterized in that, the method catching described mapping relations comprises: utilize Gauss's Gaussian random process model to carry out the method for training, it comprises the steps:
First build the Gaussian random process model with mixed structure;
The parameter of Gaussian random process model described in approximate evaluation is carried out again, i.e. this parameter and the associating posterior probability density function being subordinate to marking variable by Markov chain Monte-Carlo method;
Finally iterative edge is carried out to probability density function, obtain the estimation to the probability distribution being subordinate to marking variable probability distribution and model parameter respectively, to determine the structural parameters of mixed Gaussian stochastic process.
18. method of works according to claim 17, is characterized in that, the method catching described mapping relations also comprises: the method for mixed Gaussian stochastic process conversion, it comprises the steps:
First according to the characteristic parameter in the source of again extracting, and the structural parameters of the mixed Gaussian stochastic process trained, ask for the membership function value of current speech frame;
Again in the subspace of the mixing composition of each sub-clustering, produce the Output rusults corresponded according to mixed Gaussian stochastic process;
Finally superposed by each Output rusults, its weight coefficient is the value of membership function, finally obtains the characteristic parameter after mapping.
19. method of works according to claim 18, is characterized in that, described characteristic parameter is the individual features parameter of voiced sound, Unvoiced signal.
20. method of works according to claim 19, it is characterized in that, the discrete harmonic amplitude of described Voiced signal and phase value are used as after the range value of sinusoidal signal and phase value superpose, with by an all-pole filter, Unvoiced signal is superposed, obtain the speech data reconstructed.
CN201410662811.8A 2014-11-19 2014-11-19 Cluster voice transforming method and system based on mixture Gaussian random process Pending CN104464744A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410662811.8A CN104464744A (en) 2014-11-19 2014-11-19 Cluster voice transforming method and system based on mixture Gaussian random process

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410662811.8A CN104464744A (en) 2014-11-19 2014-11-19 Cluster voice transforming method and system based on mixture Gaussian random process

Publications (1)

Publication Number Publication Date
CN104464744A true CN104464744A (en) 2015-03-25

Family

ID=52910689

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410662811.8A Pending CN104464744A (en) 2014-11-19 2014-11-19 Cluster voice transforming method and system based on mixture Gaussian random process

Country Status (1)

Country Link
CN (1) CN104464744A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106205623A (en) * 2016-06-17 2016-12-07 福建星网视易信息系统有限公司 A kind of sound converting method and device
CN106773692A (en) * 2016-12-21 2017-05-31 北京金控数据技术股份有限公司 Equipment operation control system and method based on Gaussian Profile
CN107610717A (en) * 2016-07-11 2018-01-19 香港中文大学 Many-one phonetics transfer method based on voice posterior probability
CN109147806A (en) * 2018-06-05 2019-01-04 安克创新科技股份有限公司 Speech quality Enhancement Method, device and system based on deep learning
CN109887484A (en) * 2019-02-22 2019-06-14 平安科技(深圳)有限公司 A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN113576412A (en) * 2021-07-27 2021-11-02 上海交通大学医学院附属第九人民医院 Difficult airway assessment method and device based on machine learning voice technology

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data
CN102568476A (en) * 2012-02-21 2012-07-11 南京邮电大学 Voice conversion method based on self-organizing feature map network cluster and radial basis network
CN102930863A (en) * 2012-10-19 2013-02-13 河海大学常州校区 Voice conversion and reconstruction method based on simplified self-adaptive interpolation weighting spectrum model
CN102968988A (en) * 2012-11-27 2013-03-13 河海大学常州校区 Robustness voice conversion method with insufficient training data
CN103531205A (en) * 2013-10-09 2014-01-22 常州工学院 Asymmetrical voice conversion method based on deep neural network feature mapping
CN104091592A (en) * 2014-07-02 2014-10-08 常州工学院 Voice conversion system based on hidden Gaussian random field
CN104123933A (en) * 2014-08-01 2014-10-29 中国科学院自动化研究所 Self-adaptive non-parallel training based voice conversion method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751921A (en) * 2009-12-16 2010-06-23 南京邮电大学 Real-time voice conversion method under conditions of minimal amount of training data
CN102568476A (en) * 2012-02-21 2012-07-11 南京邮电大学 Voice conversion method based on self-organizing feature map network cluster and radial basis network
CN102930863A (en) * 2012-10-19 2013-02-13 河海大学常州校区 Voice conversion and reconstruction method based on simplified self-adaptive interpolation weighting spectrum model
CN102968988A (en) * 2012-11-27 2013-03-13 河海大学常州校区 Robustness voice conversion method with insufficient training data
CN103531205A (en) * 2013-10-09 2014-01-22 常州工学院 Asymmetrical voice conversion method based on deep neural network feature mapping
CN104091592A (en) * 2014-07-02 2014-10-08 常州工学院 Voice conversion system based on hidden Gaussian random field
CN104123933A (en) * 2014-08-01 2014-10-29 中国科学院自动化研究所 Self-adaptive non-parallel training based voice conversion method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴宪等: ""基于模糊矢量量化的语音转换方法"", 《信息化研究》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106205623A (en) * 2016-06-17 2016-12-07 福建星网视易信息系统有限公司 A kind of sound converting method and device
CN107610717A (en) * 2016-07-11 2018-01-19 香港中文大学 Many-one phonetics transfer method based on voice posterior probability
CN107610717B (en) * 2016-07-11 2021-07-06 香港中文大学 Many-to-one voice conversion method based on voice posterior probability
CN106773692A (en) * 2016-12-21 2017-05-31 北京金控数据技术股份有限公司 Equipment operation control system and method based on Gaussian Profile
CN109147806A (en) * 2018-06-05 2019-01-04 安克创新科技股份有限公司 Speech quality Enhancement Method, device and system based on deep learning
CN109147806B (en) * 2018-06-05 2021-11-12 安克创新科技股份有限公司 Voice tone enhancement method, device and system based on deep learning
CN109887484A (en) * 2019-02-22 2019-06-14 平安科技(深圳)有限公司 A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN109887484B (en) * 2019-02-22 2023-08-04 平安科技(深圳)有限公司 Dual learning-based voice recognition and voice synthesis method and device
CN113576412A (en) * 2021-07-27 2021-11-02 上海交通大学医学院附属第九人民医院 Difficult airway assessment method and device based on machine learning voice technology

Similar Documents

Publication Publication Date Title
CN104464744A (en) Cluster voice transforming method and system based on mixture Gaussian random process
CN104091592B (en) A kind of speech conversion system based on hidden Gaussian random field
CN104392717A (en) Sound track spectrum Gaussian mixture model based rapid voice conversion system and method
CN105023580B (en) Unsupervised noise estimation based on separable depth automatic coding and sound enhancement method
O’Grady et al. Discovering speech phones using convolutive non-negative matrix factorisation with a sparseness constraint
CN110060691B (en) Many-to-many voice conversion method based on i-vector and VARSGAN
CN103021418A (en) Voice conversion method facing to multi-time scale prosodic features
CN105488466A (en) Deep neural network and underwater sound target vocal print feature extraction method
CN102592607A (en) Voice converting system and method using blind voice separation
CN102750955A (en) Vocoder based on residual signal spectrum reconfiguration
CN110047501A (en) Multi-to-multi phonetics transfer method based on beta-VAE
Sebastian et al. Group delay based music source separation using deep recurrent neural networks
CN106782599A (en) The phonetics transfer method of post filtering is exported based on Gaussian process
CN111326170B (en) Method and device for converting ear voice into normal voice by combining time-frequency domain expansion convolution
CN102930863B (en) Voice conversion and reconstruction method based on simplified self-adaptive interpolation weighting spectrum model
DE60305907D1 (en) METHOD FOR MODELING AMOUNTS OF THE UPPER WAVES IN LANGUAGE
Fatemi et al. Generation of uniform hazard earthquake accelerograms and near-field ground motions
CN103886859B (en) Phonetics transfer method based on one-to-many codebook mapping
CN107785030B (en) Voice conversion method
Azarov et al. Instantaneous harmonic representation of speech using multicomponent sinusoidal excitation.
Cemgil et al. Prior structures for Time-Frequency energy distributions
CN103577877A (en) Ship motion prediction method based on time-frequency analysis and BP neural network
Sulír et al. Hidden Markov Model based speech synthesis system in Slovak language with speaker interpolation
CN112634914A (en) Neural network vocoder training method based on short-time spectrum consistency
Aroon et al. Statistical parametric speech synthesis: A review

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150325