CN102664003A

CN102664003A - Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM)

Info

Publication number: CN102664003A
Application number: CN2012101218866A
Authority: CN
Inventors: 解伟超; 张玲华; 吴丽芳
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2012-04-24
Filing date: 2012-04-24
Publication date: 2012-09-12
Anticipated expiration: 2032-04-24
Also published as: CN102664003B

Abstract

The invention discloses a residual excitation signal synthesis and voice conversion method based on a harmonic plus noise model (HNM) and belongs to the field of voice signal processing. The method comprises the steps of preprocessing, judgment of unvoiced sound and voiced sound, extraction of harmonic parameters, computation of sound track spectrum parameters, establishment of a sound track spectrum conversion rule, conversion of characteristic parameters, prediction of residual excitation, voice synthesis and residual compensation. During establishment of an excitation signal, an appropriate residual signal which is analyzed and generated by the HNM is linearly superposed on the basis of a residual signal of a voiced sound frame harmonic signal which is analyzed and extracted by the HNM to form a predicted excitation source signal, so the supra-segmental features of a speaker, which are included in an excitation source, are effectively enhanced, and distortion caused by manual correction of the excitation signal in the conventional method is avoided; and in the synthesis stage, the appropriate residual of a target voiced sound frame harmonic signal which is analyzed by the HNM is superposed in a synthesized voice frame by frame, so the converted voice has the personality of a target speaker, and the quality of the voice is improved.

Description

Synthetic and the phonetics transfer method based on the residual error pumping signal of harmonic wave plus noise model

Technical field

The present invention relates to Voice Conversion Techniques, particularly synthetic the and phonetics transfer method based on the residual error pumping signal of harmonic wave plus noise model belongs to the voice process technology field.

Background technology

Speech conversion is the emerging in recent years research branch of field of voice signal; Be on the research basis of Speaker Identification and phonetic synthesis, to carry out; Simultaneously also be the abundant and continuation of these two branch's intensions, but not exclusively be under the jurisdiction of the category of Speaker Identification and phonetic synthesis again.

The target of speech conversion is under the condition that the semantic information that guarantees wherein remains unchanged; Personal characteristics information in the speaker's voice of change source; Make it to have target speaker's personal characteristics, thereby make the voice after the conversion sound similarly being target speaker's sound.The realization of speech conversion can be divided into training stage and translate phase.In the training stage, system trains source speaker and target speaker, analyzes their parameter, sets up transformation rule.At translate phase, earlier phonetic feature is analyzed and extracted to the source voice, be converted to the target speech characteristic according to the voice conversion rules that obtains by the training stage again.

The characteristic of voice signal is divided into two types of segment information and Supersonic segment informations.What segment5al feature was described is the tamber characteristic of voice, mainly comprises the bandwidth, spectral tilt, fundamental frequency of position, the resonance peak of sound channel resonance peak etc.Supersonic section feature description be the prosodic features and the driving source information of voice, characteristic parameter mainly comprises the behavioral characteristics such as duration, energy, the variation profile in cycle and the variation of spectrum envelope of phoneme etc.

The key issue of speech conversion is the extraction of speaker's personal characteristics and the foundation of transformation rule, and the development through recent two decades emerges a large amount of achievements in research.At present the segment5al feature with voice signal is mainly concentrated in the research of speech characteristic parameter, and to voice signal driving source Supersonic section characteristic relate to few.The current main method that the voice signal driving source is estimated has based on linear predictive coding (Linear Prediction Coding, LPC) the residual prediction method of model.But during the residual signals that linear forecasting technology obtains (Residual signal) conduct excitation, the target speaker's individual character that contains is less, and energy is lower in the residual signals, causes conversion back voice quality relatively poor; (1, Suendermann D, Bonafonte A, Ney H; Hoege H, " A Study on Residual Prediction Techniques for Voice Conversion ", proceedings of IEEE International Conference on Acoustics; Speech, and Signal Processing, vol.1; Pp.13-16,2005. 2, Percybrooks W.S, Moore E; " Voice conversion with linear prediction residual estimation ", proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing; Pp.4673 – 4676; March 2008.) in addition, also usefully in the existing speech conversion system calculate the companding ratio of fundamental frequency through the mean value of asking for fundamental frequency, perhaps revise excitation source signal artificially through modes such as duration insertion, shearings.But, receive the influence of the residing environment of speaker because voice signal driving source Supersonic segment signal characteristic is relevant more than speaker's state at that time.Therefore, the driving source Supersonic segment information that artificial modification pumping signal can not the accurate description voice, and introduce distortion.（3、Xuejing?Sun,?“Voice?quality?conversion?in?TD-PSOLA?speech?synthesis”,?proceedings?of?IEEE?International?Conference?on?Acoustics,?Speech,?and?Signal?Processing,?Vol.2,?pp.?II953?-?II956,?2000. 4、Wang?Yuan-yuan,?Yang?Shun,?“Speech?synthesis?based?on?PSOLA?algorithm?and?modified?pitch?parameters”,?International?Conference?on?Computational?Problem-Solving?(ICCP),?pp.?296?-?299,?2010.）。

Summary of the invention

The object of the present invention is to provide a kind of voice signal characteristics and the voice conversion algorithm of speaker's personal characteristics under parallel text of combining; The primary study voice signal is in the extraction and the prediction of driving source Supersonic segment information; Through to the improvement of excitation source signal and the compensation of conversion back voice, strengthen target speaker's in the synthetic speech the individual character and the performance of raising converting system.

In order to realize the foregoing invention purpose, the present invention has adopted following technical scheme:

Synthetic and the phonetics transfer method of a kind of residual error pumping signal based on harmonic wave plus noise model, concrete steps are following:

The first step, pre-service and pure and impure sound are judged, promptly respectively source voice and target speech are carried out pre-emphasis, divide frame and windowing process, calculate the short-time energy and average zero-crossing rate of each frame, accomplish the judgement of pure and impure sound;

In second step, the extraction of harmonic parameters promptly utilizes harmonic wave plus noise model (Harmonic plus Noise Model; HNM) model is analyzed the turbid speech frame of source voice and target speech respectively; At first calculate the fundamental frequency of unvoiced frame signal, the HNM model is decomposed into harmonic signal and wideband random signal with unvoiced frame then, calculates the harmonic wave number; Extract amplitude, phase place and the each harmonic frequency of harmonic signal, regard random noise as for voiceless sound and keep constant;

The 3rd step; Sound channel spectrum CALCULATION OF PARAMETERS; The amplitude of each order harmonics that promptly respectively the voiced sound signal extraction of source voice and target speech is gone out is carried out conversion, with square sampled value as discrete power of amplitude, through invert fast fourier transformation (Inverse Fast Fourier Transformation; IFFT) obtain coefficient of autocorrelation; Carry out lpc analysis through the Levinson-Durbin algorithm again, obtain linear spectral frequency (Linear Spectral Frequency, LSF) parameter and the corresponding residual signals of source voice and target speech;

The 4th step; Set up sound channel spectrum transformation rule, with the LSF parameter of source voice and target speech pass through dynamic time warping (Dynamic Time Warping, DTW) after; (Gaussian Mixture Model GMM) carries out probabilistic Modeling to send into gauss hybrid models;

The conversion of the 5th step characteristic parameter; Promptly treating converting speech earlier analyzes through HNM; According to the method for above-mentioned second step with the 3rd step, extract LSF parameter and residual signals to be converted, LSF parameter to be converted is sent into the GMM transformation rule of the 4th step foundation and changed;

The 6th step; The prediction of residual error excitation; The immediate target LSF parameter of LSF parameter after promptly at first finding out and change by frame is utilized the corresponding residual signals of this target LSF parameter and this frame remaining random signal linear superposition after HNM analyzes, then as the residual error pumping signal;

The 7th step; Phonetic synthesis and residual compensation; Promptly at first go on foot LSF parameter and residual error pumping signal after the conversion obtain by above-mentioned the 5th step and the 6th; Every frame voice signal based on the LPC synthetic model obtains changing out is superimposed with corresponding appropriate target residual signals with every frame voice signal of changing out then once more, behind overlap-add, finally obtains the voice that synthesize.

The present invention compared with prior art; Its remarkable advantage: (1) is when setting up pumping signal; On the basis of the residual signals of analyzing the unvoiced frame harmonic signal that extracts through HNM; This HNM of appropriateness analyzes the excitation source signal of produced simultaneously residual signal (wideband random signal) as prediction on the linear superposition, can effectively strengthen the speaker's Supersonic section characteristic that contains in the driving source like this, avoids classic method artificially to revise the distortion that pumping signal is introduced simultaneously; (2) synthesis phase is superimposed with the residual error of the target unvoiced frame harmonic signal that analyzes of HNM of appropriateness once more by frame in the voice that synthesize, make the voice of changing out have more target speaker individual character, improves voice quality.

Below in conjunction with accompanying drawing the present invention is described in further detail.

Description of drawings

Fig. 1 is the synthetic and phonetics transfer method synoptic diagram of residual error pumping signal that the present invention is based on harmonic wave plus noise model;

Fig. 2 is the extraction of characteristic parameter and the synoptic diagram that transformation rule is set up;

Fig. 3 is the conversion and the synoptic diagram of predicting based on the residual error pumping signal of HNM model of characteristic parameter;

Fig. 4 is iThe synoptic diagram of turbid speech parameter conversion of frame and phonetic synthesis.

Embodiment

In conjunction with Fig. 1, the synthetic and phonetics transfer method based on the residual error pumping signal of harmonic wave plus noise model, step is following:

The first step is carried out pre-service and the judgement of pure and impure sound earlier in the training stage, promptly respectively source voice and target speech is carried out pre-emphasis, divides frame and windowing process, calculates the short-time energy and average zero-crossing rate of each frame, accomplishes the judgement of pure and impure sound, and detailed process is following:

(1) source voice and target speech signal are carried out pre-service respectively, pre emphasis factor is 0.96, and 20ms divides frame by frame length, and zero lap uses Hamming window to carry out windowing process afterwards;

(2) by frame calculate short-time energy

and

Short-time zero-crossing rate

Figure 2012101218866100002DEST_PATH_IMAGE002

, wherein

Figure 2012101218866100002DEST_PATH_IMAGE003

Be after windowing iThe frame voice signal, and frame length does

Figure 2012101218866100002DEST_PATH_IMAGE004

, adopt the double threshold method to carry out the judgement of pure and impure sound;

Second step, the extraction of harmonic parameters, as shown in Figure 2; Utilize the HNM model respectively the turbid speech frame of source voice and target speech to be analyzed; At first calculate the fundamental frequency of unvoiced frame signal, the HNM model is decomposed into harmonic signal and wideband random signal with unvoiced frame then, calculates the harmonic wave number; Extract the amplitude of harmonic signal, phase place and each harmonic frequency; It is constant to regard the random noise reservation as for voiceless sound, and detailed process is following:

(1) with the normalized crosscorrelation method fundamental frequency

Figure 2012101218866100002DEST_PATH_IMAGE005

of calculation sources voice and target speech present frame respectively;

(2) respectively source voice and target speech are analyzed, if present frame is a unvoiced frame

Figure 2012101218866100002DEST_PATH_IMAGE006

(wherein

Figure 2012101218866100002DEST_PATH_IMAGE007

, NBe frame length), it is decomposed into harmonic components

Figure 2012101218866100002DEST_PATH_IMAGE008

And random element

Figure 2012101218866100002DEST_PATH_IMAGE009

, at first, confirm higher harmonics number

Figure 2012101218866100002DEST_PATH_IMAGE010

, wherein

Figure 2012101218866100002DEST_PATH_IMAGE011

Be SF.Objective function is

Figure 2012101218866100002DEST_PATH_IMAGE012

; Wherein

Figure 2012101218866100002DEST_PATH_IMAGE013

representes Hamming window (hamming); Under young waiter in a wineshop or an inn's journey criterion, estimate multiple amplitude ; The real amplitude of harmonic components can be expressed as

Figure 2012101218866100002DEST_PATH_IMAGE016

so, and the reality position can be expressed as

Figure 2012101218866100002DEST_PATH_IMAGE017

;

(3) right between adjacent two frames

Figure 2012101218866100002DEST_PATH_IMAGE018

With

Figure 2012101218866100002DEST_PATH_IMAGE019

Carry out interpolation, make variate when both become respectively With

, likewise to the harmonic wave number LCarry out linear interpolation variate when it is become

Figure 2012101218866100002DEST_PATH_IMAGE022

Adjacent two frames of what-if are kFrame and k+ 1 frame, and their center lays respectively at sampling point n= KNWith n=( k+ 1) NOn, amplitude harmonic number is carried out linear interpolation, phase place is carried out the cubic polynomial interpolation:

，

Figure 2012101218866100002DEST_PATH_IMAGE024

，

，

Figure 2012101218866100002DEST_PATH_IMAGE026

For lSubharmonic angular frequency, then polynomial interpolation coefficient:

，

Figure 2012101218866100002DEST_PATH_IMAGE028

，

，

Figure 2012101218866100002DEST_PATH_IMAGE030

，

Therefore; The harmonic wave part of one frame signal can be expressed as

, and then remaining random signal can be expressed as ;

(4) if present frame is that voiceless sound then is seen as random noise, because information is less so reservation voiceless sound signal is constant in the voiceless sound;

The 3rd step, sound channel spectrum CALCULATION OF PARAMETERS, as shown in Figure 2; The amplitude of each order harmonics that at first respectively the voiced sound signal extraction of source voice and target speech is gone out is carried out conversion; Square sampled value as discrete power with amplitude obtains coefficient of autocorrelation through the IFFT conversion, carries out lpc analysis through the Levinson-Durbin algorithm again; Obtain the LSF parameter and the corresponding residual signals of source voice and target speech, detailed process (is calculated by frame) as follows:

(1) calculates LIndividual discrete amplitudes value

Figure 2012101218866100002DEST_PATH_IMAGE033

Square value, think the sampled value of discrete power spectrum , wherein

Figure 2012101218866100002DEST_PATH_IMAGE035

Be lThe subharmonic angular frequency,

(2) will

Carry out the IFFT conversion and obtain coefficient of autocorrelation

Figure 2012101218866100002DEST_PATH_IMAGE037

, try to achieve through the Levinson-Durbin algorithm PRank LPC coefficient

Figure 2012101218866100002DEST_PATH_IMAGE038

, and further convert the LSF parameter into;

(3) by LPC coefficients to construct linear prediction inverse filter; Its transform expression formula is , and voice just can obtain the residual signals behind this frame lpc analysis through

Figure 2012101218866100002DEST_PATH_IMAGE040

;

The 4th step, set up sound channel spectrum transformation rule, as shown in Figure 2, with the LSF parameter of source voice and target speech pass through DTW regular after, send into the GMM model and carry out probabilistic Modeling, detailed process is following:

(1) the LSF parameter that source voice signal and target speech unvoiced frame extracting harmonic is gone out through the DTW time unifying, and is noted the subscript of the alignment LSF that DTW returns;

(2) subscript of the alignment LSF that returns according to DTW is alignd the residual signals of source voice with the harmonic wave of target speech unvoiced frame, likewise the source voice is alignd with target speech unvoiced frame remaining random signal after HNM analyzes;

(3) the source LSF parameter and the target LSF parameter composition combined parameters of alignment are sent into the GMM model, set up sound channel spectrum transfer function;

The 5th step, the conversion of characteristic parameter, as shown in Figure 3, treat converting speech earlier and analyze through HNM, according to the method for above-mentioned second step, extract LSF parameter and residual signals to be converted with the 3rd step.LSF parameter to be converted is sent into the GMM transformation rule of the 4th step foundation and changed, detailed process is following:

(1) voice signal to be converted such as above-mentioned step is said, carry out pre-service, divide frame, analyze through HNM and extract harmonic parameters, calculate sound channel spectrum parameter and convert the LSF parameter into;

(2) good GMM rule is set up in every frame LSF parameter utilization to be converted and changed the LSF parameter after obtaining changing;

The 6th step; The prediction of residual error excitation; As shown in Figure 3, the immediate target LSF parameter of LSF parameter after at first finding out and change by frame is utilized the corresponding residual signals of this target LSF parameter and this frame remaining random signal linear superposition after HNM analyzes then; As the residual error pumping signal, detailed process is following:

(1) finds out target LSF parameter immediate for the LSF parameter of changing out by frame, remaining random signal when confirming that residual signals that this target LSF parameter is corresponding and HNM analyze with it;

Remaining random signal linear superposition during (2) with target residual signals and HNM analysis is as the residual error pumping signal;

The 7th step; Phonetic synthesis and residual compensation, as shown in Figure 4, at first go on foot LSF parameter and residual error pumping signal after the conversion obtain by above-mentioned the 5th step and the 6th; The every frame voice signal that obtains changing out based on the LPC synthetic model; Then every frame voice signal of changing out is superimposed with corresponding appropriate target residual signals once more, behind overlap-add, finally obtains the voice that synthesize, detailed process is following:

(1) the LSF Parameters Transformation after the conversion that above-mentioned steps is obtained is the LPC coefficient, sets up wave filter by frame by the LPC coefficient, again with the residual error pumping signal that dopes through this wave filter, the voice after obtaining changing;

(2) voice signal after every frame conversion is superimposed with corresponding appropriate target residual signals once more; Residual signals is carried out the amplification of appropriateness according to the general needs of experiment experience; Can residual signals be amplified to original 2 ~ 5 times during compensation, each frame voice just can obtain final synthetic speech after splicing then.

Claims

1. based on the synthetic phonetics transfer method that reaches of the residual error pumping signal of harmonic wave plus noise model, it is characterized in that comprising following steps:

Second step; The extraction of harmonic parameters promptly utilizes the HNM model respectively the turbid speech frame of source voice and target speech to be analyzed, and at first calculates the fundamental frequency of unvoiced frame signal; The HNM model is decomposed into harmonic signal and wideband random signal with unvoiced frame then; Calculate the harmonic wave number, extract amplitude, phase place and the each harmonic frequency of harmonic signal, regard random noise as for voiceless sound and keep constant;

The 3rd step; Sound channel spectrum CALCULATION OF PARAMETERS; The amplitude of each order harmonics that promptly respectively the voiced sound signal extraction of source voice and target speech is gone out is carried out conversion, and square sampled value as discrete power with amplitude obtains coefficient of autocorrelation through the IFFT conversion; Carry out lpc analysis through the Levinson-Durbin algorithm again, obtain the LSF parameter and the corresponding residual signals of source voice and target speech;

The 4th step, set up sound channel spectrum transformation rule, with the LSF parameter of source voice and target speech pass through DTW regular after, send into the GMM model and carry out probabilistic Modeling;

The 5th step; The conversion of characteristic parameter is promptly treated converting speech earlier and is analyzed through HNM, according to the method for above-mentioned second step with the 3rd step; Extract LSF parameter and residual signals to be converted, LSF parameter to be converted is sent into the GMM transformation rule of the 4th step foundation and changed;

2. the synthetic and phonetics transfer method of the residual error pumping signal based on harmonic wave plus noise model according to claim 1 is characterized in that the detailed process of pre-service and pure and impure sound judgement is following:

The first step is carried out pre-service respectively to source voice and target speech signal, and pre emphasis factor is 0.96, and 20ms divides frame by frame length, and zero lap uses Hamming window to carry out windowing process afterwards;

Second step, by frame calculate short-time energy

and

Short-time zero-crossing rate

Z_{i} = \frac{1}{2} Σ_{m = 0}^{N - 1} | Sgn [x_{i} (m)] - Sgn [x_{i} (m - 1)] |,

X wherein _i(m) be i frame voice signal after windowing, and frame length is N, adopts the double threshold method to carry out the judgement of pure and impure sound.

3. the synthetic and phonetics transfer method of the residual error pumping signal based on harmonic wave plus noise model according to claim 1 is characterized in that the leaching process of harmonic parameters is following:

The first step is with the fundamental frequency f of normalized crosscorrelation method difference calculation sources voice and target speech present frame ₀

Second step, respectively source voice and target speech are analyzed, be unvoiced frame s (n) as if present frame, 1≤n≤N wherein, N is a frame length, and it is decomposed into harmonic components s _h(n) and random element e (n), at first, confirm higher harmonics number

F wherein _sBe SF, objective function does

C_{l} = Arg Min Σ_{n = - N / 2}^{N / 2} w^{2} (n) {[s (n) - s_{h} (n)]}^{2},

W (n) expression Hamming window is wherein estimated multiple amplitude { C under young waiter in a wineshop or an inn's journey criterion _l, l=-L ,-L+1 ..., L}, the real amplitude { A of harmonic components _lBe expressed as A _l=2|C _l|=2|C _-l|, the reality bit table is shown

Again between adjacent two frames to { A _lAnd Carry out interpolation, make variate { A when both become respectively _l(n) } and

Likewise harmonic wave number L is carried out linear interpolation variate { L (n) } when it is become, then remaining random signal is expressed as

The 3rd step is if present frame is that voiceless sound then is seen as random noise, because information is less so reservation voiceless sound signal is constant in the voiceless sound.

4. the synthetic and phonetics transfer method of the residual error pumping signal based on harmonic wave plus noise model according to claim 1 is characterized in that the following by frame computation process of sound channel spectrum parameter:

The first step is calculated L discrete amplitudes value A _lSquare value, think the sampled value P (ω of discrete power spectrum _l), ω wherein _lBe l subharmonic angular frequency, ω _l=2 π lf ₀

Second step is with p (ω _l) carry out the IFFT conversion and obtain coefficient of autocorrelation R (n), try to achieve P rank LPC coefficient { a through the Levinson-Durbin algorithm _j, j=1,2 ..., P}, and further convert the LSF parameter into;

The 3rd step; By LPC coefficients to construct linear prediction inverse filter, its transform expression formula just can obtain the residual signals behind this frame lpc analysis for

voice through A (Z).

5. the synthetic and phonetics transfer method of the residual error pumping signal based on harmonic wave plus noise model according to claim 1, the detailed process that it is characterized in that setting up sound channel spectrum transformation rule is following:

The first step, the LSF parameter that source voice signal and target speech unvoiced frame extracting harmonic are gone out through the DTW time unifying, and is noted the subscript of the alignment LSF that DTW returns;

In second step, the subscript of the alignment LSF that returns according to DTW is alignd the residual signals of source voice with the harmonic wave of target speech unvoiced frame, and likewise the source voice align with target speech unvoiced frame remaining random signal after the HNM analysis;

The 3rd step, the source LSF parameter and the target LSF parameter composition combined parameters of alignment are sent into the GMM model, set up sound channel spectrum transfer function.

6. the synthetic and phonetics transfer method of the residual error pumping signal based on harmonic wave plus noise model according to claim 1 is characterized in that the detailed process of conversion of characteristic parameter is following:

The first step is carried out voice signal to be converted pre-service, is divided frame, analyzes through HNM and extracts harmonic parameters, calculates sound channel spectrum parameter and converts the LSF parameter into;

In second step, good GMM rule is set up in every frame LSF parameter utilization to be converted changed the LSF parameter after obtaining changing.

7. the synthetic and phonetics transfer method of the residual error pumping signal based on harmonic wave plus noise model according to claim 1 is characterized in that the forecasting process of residual error excitation is following:

The first step is found out target LSF parameter immediate with it for the LSF parameter of changing out by frame, remaining random signal when confirming that residual signals that this target LSF parameter is corresponding and HNM analyze;

In second step, remaining random signal linear superposition during with target residual signals and HNM analysis is as the residual error pumping signal.

8. the synthetic and phonetics transfer method of the residual error pumping signal based on harmonic wave plus noise model according to claim 1 is characterized in that the detailed process of phonetic synthesis and residual compensation is following:

The first step is the LPC coefficient with the LSF Parameters Transformation after the conversion that obtains, and sets up wave filter by frame by the LPC coefficient, again the residual error pumping signal that dopes is passed through this wave filter, the voice after obtaining changing;

Second step was superimposed with corresponding appropriate target residual signals once more with the voice signal after every frame conversion, and each frame voice just can obtain final synthetic speech after splicing.