CN102664003A - Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM) - Google Patents

Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM) Download PDF

Info

Publication number
CN102664003A
CN102664003A CN2012101218866A CN201210121886A CN102664003A CN 102664003 A CN102664003 A CN 102664003A CN 2012101218866 A CN2012101218866 A CN 2012101218866A CN 201210121886 A CN201210121886 A CN 201210121886A CN 102664003 A CN102664003 A CN 102664003A
Authority
CN
China
Prior art keywords
frame
signal
voice
harmonic
residual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012101218866A
Other languages
Chinese (zh)
Other versions
CN102664003B (en
Inventor
解伟超
张玲华
吴丽芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN2012101218866A priority Critical patent/CN102664003B/en
Publication of CN102664003A publication Critical patent/CN102664003A/en
Application granted granted Critical
Publication of CN102664003B publication Critical patent/CN102664003B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Telephonic Communication Services (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a residual excitation signal synthesis and voice conversion method based on a harmonic plus noise model (HNM) and belongs to the field of voice signal processing. The method comprises the steps of preprocessing, judgment of unvoiced sound and voiced sound, extraction of harmonic parameters, computation of sound track spectrum parameters, establishment of a sound track spectrum conversion rule, conversion of characteristic parameters, prediction of residual excitation, voice synthesis and residual compensation. During establishment of an excitation signal, an appropriate residual signal which is analyzed and generated by the HNM is linearly superposed on the basis of a residual signal of a voiced sound frame harmonic signal which is analyzed and extracted by the HNM to form a predicted excitation source signal, so the supra-segmental features of a speaker, which are included in an excitation source, are effectively enhanced, and distortion caused by manual correction of the excitation signal in the conventional method is avoided; and in the synthesis stage, the appropriate residual of a target voiced sound frame harmonic signal which is analyzed by the HNM is superposed in a synthesized voice frame by frame, so the converted voice has the personality of a target speaker, and the quality of the voice is improved.

Description

Synthetic and the phonetics transfer method based on the residual error pumping signal of harmonic wave plus noise model
Technical field
The present invention relates to Voice Conversion Techniques, particularly synthetic the and phonetics transfer method based on the residual error pumping signal of harmonic wave plus noise model belongs to the voice process technology field.
Background technology
Speech conversion is the emerging in recent years research branch of field of voice signal; Be on the research basis of Speaker Identification and phonetic synthesis, to carry out; Simultaneously also be the abundant and continuation of these two branch's intensions, but not exclusively be under the jurisdiction of the category of Speaker Identification and phonetic synthesis again.
The target of speech conversion is under the condition that the semantic information that guarantees wherein remains unchanged; Personal characteristics information in the speaker's voice of change source; Make it to have target speaker's personal characteristics, thereby make the voice after the conversion sound similarly being target speaker's sound.The realization of speech conversion can be divided into training stage and translate phase.In the training stage, system trains source speaker and target speaker, analyzes their parameter, sets up transformation rule.At translate phase, earlier phonetic feature is analyzed and extracted to the source voice, be converted to the target speech characteristic according to the voice conversion rules that obtains by the training stage again.
The characteristic of voice signal is divided into two types of segment information and Supersonic segment informations.What segment5al feature was described is the tamber characteristic of voice, mainly comprises the bandwidth, spectral tilt, fundamental frequency of position, the resonance peak of sound channel resonance peak etc.Supersonic section feature description be the prosodic features and the driving source information of voice, characteristic parameter mainly comprises the behavioral characteristics such as duration, energy, the variation profile in cycle and the variation of spectrum envelope of phoneme etc.
The key issue of speech conversion is the extraction of speaker's personal characteristics and the foundation of transformation rule, and the development through recent two decades emerges a large amount of achievements in research.At present the segment5al feature with voice signal is mainly concentrated in the research of speech characteristic parameter, and to voice signal driving source Supersonic section characteristic relate to few.The current main method that the voice signal driving source is estimated has based on linear predictive coding (Linear Prediction Coding, LPC) the residual prediction method of model.But during the residual signals that linear forecasting technology obtains (Residual signal) conduct excitation, the target speaker's individual character that contains is less, and energy is lower in the residual signals, causes conversion back voice quality relatively poor; (1, Suendermann D, Bonafonte A, Ney H; Hoege H, " A Study on Residual Prediction Techniques for Voice Conversion ", proceedings of IEEE International Conference on Acoustics; Speech, and Signal Processing, vol.1; Pp.13-16,2005. 2, Percybrooks W.S, Moore E; " Voice conversion with linear prediction residual estimation ", proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing; Pp.4673 – 4676; March 2008.) in addition, also usefully in the existing speech conversion system calculate the companding ratio of fundamental frequency through the mean value of asking for fundamental frequency, perhaps revise excitation source signal artificially through modes such as duration insertion, shearings.But, receive the influence of the residing environment of speaker because voice signal driving source Supersonic segment signal characteristic is relevant more than speaker's state at that time.Therefore, the driving source Supersonic segment information that artificial modification pumping signal can not the accurate description voice, and introduce distortion.(3、Xuejing?Sun,?“Voice?quality?conversion?in?TD-PSOLA?speech?synthesis”,?proceedings?of?IEEE?International?Conference?on?Acoustics,?Speech,?and?Signal?Processing,?Vol.2,?pp.?II953?-?II956,?2000. 4、Wang?Yuan-yuan,?Yang?Shun,?“Speech?synthesis?based?on?PSOLA?algorithm?and?modified?pitch?parameters”,?International?Conference?on?Computational?Problem-Solving?(ICCP),?pp.?296?-?299,?2010.)。
Summary of the invention
The object of the present invention is to provide a kind of voice signal characteristics and the voice conversion algorithm of speaker's personal characteristics under parallel text of combining; The primary study voice signal is in the extraction and the prediction of driving source Supersonic segment information; Through to the improvement of excitation source signal and the compensation of conversion back voice, strengthen target speaker's in the synthetic speech the individual character and the performance of raising converting system.
In order to realize the foregoing invention purpose, the present invention has adopted following technical scheme:
Synthetic and the phonetics transfer method of a kind of residual error pumping signal based on harmonic wave plus noise model, concrete steps are following:
The first step, pre-service and pure and impure sound are judged, promptly respectively source voice and target speech are carried out pre-emphasis, divide frame and windowing process, calculate the short-time energy and average zero-crossing rate of each frame, accomplish the judgement of pure and impure sound;
In second step, the extraction of harmonic parameters promptly utilizes harmonic wave plus noise model (Harmonic plus Noise Model; HNM) model is analyzed the turbid speech frame of source voice and target speech respectively; At first calculate the fundamental frequency of unvoiced frame signal, the HNM model is decomposed into harmonic signal and wideband random signal with unvoiced frame then, calculates the harmonic wave number; Extract amplitude, phase place and the each harmonic frequency of harmonic signal, regard random noise as for voiceless sound and keep constant;
The 3rd step; Sound channel spectrum CALCULATION OF PARAMETERS; The amplitude of each order harmonics that promptly respectively the voiced sound signal extraction of source voice and target speech is gone out is carried out conversion, with square sampled value as discrete power of amplitude, through invert fast fourier transformation (Inverse Fast Fourier Transformation; IFFT) obtain coefficient of autocorrelation; Carry out lpc analysis through the Levinson-Durbin algorithm again, obtain linear spectral frequency (Linear Spectral Frequency, LSF) parameter and the corresponding residual signals of source voice and target speech;
The 4th step; Set up sound channel spectrum transformation rule, with the LSF parameter of source voice and target speech pass through dynamic time warping (Dynamic Time Warping, DTW) after; (Gaussian Mixture Model GMM) carries out probabilistic Modeling to send into gauss hybrid models;
The conversion of the 5th step characteristic parameter; Promptly treating converting speech earlier analyzes through HNM; According to the method for above-mentioned second step with the 3rd step, extract LSF parameter and residual signals to be converted, LSF parameter to be converted is sent into the GMM transformation rule of the 4th step foundation and changed;
The 6th step; The prediction of residual error excitation; The immediate target LSF parameter of LSF parameter after promptly at first finding out and change by frame is utilized the corresponding residual signals of this target LSF parameter and this frame remaining random signal linear superposition after HNM analyzes, then as the residual error pumping signal;
The 7th step; Phonetic synthesis and residual compensation; Promptly at first go on foot LSF parameter and residual error pumping signal after the conversion obtain by above-mentioned the 5th step and the 6th; Every frame voice signal based on the LPC synthetic model obtains changing out is superimposed with corresponding appropriate target residual signals with every frame voice signal of changing out then once more, behind overlap-add, finally obtains the voice that synthesize.
The present invention compared with prior art; Its remarkable advantage: (1) is when setting up pumping signal; On the basis of the residual signals of analyzing the unvoiced frame harmonic signal that extracts through HNM; This HNM of appropriateness analyzes the excitation source signal of produced simultaneously residual signal (wideband random signal) as prediction on the linear superposition, can effectively strengthen the speaker's Supersonic section characteristic that contains in the driving source like this, avoids classic method artificially to revise the distortion that pumping signal is introduced simultaneously; (2) synthesis phase is superimposed with the residual error of the target unvoiced frame harmonic signal that analyzes of HNM of appropriateness once more by frame in the voice that synthesize, make the voice of changing out have more target speaker individual character, improves voice quality.
Below in conjunction with accompanying drawing the present invention is described in further detail.
Description of drawings
Fig. 1 is the synthetic and phonetics transfer method synoptic diagram of residual error pumping signal that the present invention is based on harmonic wave plus noise model;
Fig. 2 is the extraction of characteristic parameter and the synoptic diagram that transformation rule is set up;
Fig. 3 is the conversion and the synoptic diagram of predicting based on the residual error pumping signal of HNM model of characteristic parameter;
Fig. 4 is iThe synoptic diagram of turbid speech parameter conversion of frame and phonetic synthesis.
Embodiment
In conjunction with Fig. 1, the synthetic and phonetics transfer method based on the residual error pumping signal of harmonic wave plus noise model, step is following:
The first step is carried out pre-service and the judgement of pure and impure sound earlier in the training stage, promptly respectively source voice and target speech is carried out pre-emphasis, divides frame and windowing process, calculates the short-time energy and average zero-crossing rate of each frame, accomplishes the judgement of pure and impure sound, and detailed process is following:
(1) source voice and target speech signal are carried out pre-service respectively, pre emphasis factor is 0.96, and 20ms divides frame by frame length, and zero lap uses Hamming window to carry out windowing process afterwards;
(2) by frame calculate short-time energy
Figure 730404DEST_PATH_IMAGE001
and
Short-time zero-crossing rate
Figure 2012101218866100002DEST_PATH_IMAGE002
, wherein
Figure 2012101218866100002DEST_PATH_IMAGE003
Be after windowing iThe frame voice signal, and frame length does
Figure 2012101218866100002DEST_PATH_IMAGE004
, adopt the double threshold method to carry out the judgement of pure and impure sound;
Second step, the extraction of harmonic parameters, as shown in Figure 2; Utilize the HNM model respectively the turbid speech frame of source voice and target speech to be analyzed; At first calculate the fundamental frequency of unvoiced frame signal, the HNM model is decomposed into harmonic signal and wideband random signal with unvoiced frame then, calculates the harmonic wave number; Extract the amplitude of harmonic signal, phase place and each harmonic frequency; It is constant to regard the random noise reservation as for voiceless sound, and detailed process is following:
(1) with the normalized crosscorrelation method fundamental frequency
Figure 2012101218866100002DEST_PATH_IMAGE005
of calculation sources voice and target speech present frame respectively;
(2) respectively source voice and target speech are analyzed, if present frame is a unvoiced frame
Figure 2012101218866100002DEST_PATH_IMAGE006
(wherein
Figure 2012101218866100002DEST_PATH_IMAGE007
, NBe frame length), it is decomposed into harmonic components
Figure 2012101218866100002DEST_PATH_IMAGE008
And random element
Figure 2012101218866100002DEST_PATH_IMAGE009
, at first, confirm higher harmonics number
Figure 2012101218866100002DEST_PATH_IMAGE010
, wherein
Figure 2012101218866100002DEST_PATH_IMAGE011
Be SF.Objective function is
Figure 2012101218866100002DEST_PATH_IMAGE012
; Wherein
Figure 2012101218866100002DEST_PATH_IMAGE013
representes Hamming window (hamming); Under young waiter in a wineshop or an inn's journey criterion, estimate multiple amplitude ; The real amplitude of harmonic components can be expressed as
Figure 2012101218866100002DEST_PATH_IMAGE016
so, and the reality position can be expressed as
Figure 2012101218866100002DEST_PATH_IMAGE017
;
(3) right between adjacent two frames
Figure 2012101218866100002DEST_PATH_IMAGE018
With
Figure 2012101218866100002DEST_PATH_IMAGE019
Carry out interpolation, make variate when both become respectively With
Figure 86692DEST_PATH_IMAGE021
, likewise to the harmonic wave number LCarry out linear interpolation variate when it is become
Figure 2012101218866100002DEST_PATH_IMAGE022
Adjacent two frames of what-if are kFrame and k+ 1 frame, and their center lays respectively at sampling point n= KNWith n=( k+ 1) NOn, amplitude harmonic number is carried out linear interpolation, phase place is carried out the cubic polynomial interpolation:
Figure 804113DEST_PATH_IMAGE023
Figure 2012101218866100002DEST_PATH_IMAGE024
Figure 199322DEST_PATH_IMAGE025
Figure 2012101218866100002DEST_PATH_IMAGE026
For lSubharmonic angular frequency, then polynomial interpolation coefficient:
Figure 951377DEST_PATH_IMAGE027
Figure 2012101218866100002DEST_PATH_IMAGE028
Figure 711523DEST_PATH_IMAGE029
Figure 2012101218866100002DEST_PATH_IMAGE030
Therefore; The harmonic wave part of one frame signal can be expressed as
Figure 916239DEST_PATH_IMAGE031
, and then remaining random signal can be expressed as ;
(4) if present frame is that voiceless sound then is seen as random noise, because information is less so reservation voiceless sound signal is constant in the voiceless sound;
The 3rd step, sound channel spectrum CALCULATION OF PARAMETERS, as shown in Figure 2; The amplitude of each order harmonics that at first respectively the voiced sound signal extraction of source voice and target speech is gone out is carried out conversion; Square sampled value as discrete power with amplitude obtains coefficient of autocorrelation through the IFFT conversion, carries out lpc analysis through the Levinson-Durbin algorithm again; Obtain the LSF parameter and the corresponding residual signals of source voice and target speech, detailed process (is calculated by frame) as follows:
(1) calculates LIndividual discrete amplitudes value
Figure 2012101218866100002DEST_PATH_IMAGE033
Square value, think the sampled value of discrete power spectrum , wherein
Figure 2012101218866100002DEST_PATH_IMAGE035
Be lThe subharmonic angular frequency,
(2) will
Figure 928189DEST_PATH_IMAGE034
Carry out the IFFT conversion and obtain coefficient of autocorrelation
Figure 2012101218866100002DEST_PATH_IMAGE037
, try to achieve through the Levinson-Durbin algorithm PRank LPC coefficient
Figure 2012101218866100002DEST_PATH_IMAGE038
, and further convert the LSF parameter into;
(3) by LPC coefficients to construct linear prediction inverse filter; Its transform expression formula is , and voice just can obtain the residual signals behind this frame lpc analysis through
Figure 2012101218866100002DEST_PATH_IMAGE040
;
The 4th step, set up sound channel spectrum transformation rule, as shown in Figure 2, with the LSF parameter of source voice and target speech pass through DTW regular after, send into the GMM model and carry out probabilistic Modeling, detailed process is following:
(1) the LSF parameter that source voice signal and target speech unvoiced frame extracting harmonic is gone out through the DTW time unifying, and is noted the subscript of the alignment LSF that DTW returns;
(2) subscript of the alignment LSF that returns according to DTW is alignd the residual signals of source voice with the harmonic wave of target speech unvoiced frame, likewise the source voice is alignd with target speech unvoiced frame remaining random signal after HNM analyzes;
(3) the source LSF parameter and the target LSF parameter composition combined parameters of alignment are sent into the GMM model, set up sound channel spectrum transfer function;
The 5th step, the conversion of characteristic parameter, as shown in Figure 3, treat converting speech earlier and analyze through HNM, according to the method for above-mentioned second step, extract LSF parameter and residual signals to be converted with the 3rd step.LSF parameter to be converted is sent into the GMM transformation rule of the 4th step foundation and changed, detailed process is following:
(1) voice signal to be converted such as above-mentioned step is said, carry out pre-service, divide frame, analyze through HNM and extract harmonic parameters, calculate sound channel spectrum parameter and convert the LSF parameter into;
(2) good GMM rule is set up in every frame LSF parameter utilization to be converted and changed the LSF parameter after obtaining changing;
The 6th step; The prediction of residual error excitation; As shown in Figure 3, the immediate target LSF parameter of LSF parameter after at first finding out and change by frame is utilized the corresponding residual signals of this target LSF parameter and this frame remaining random signal linear superposition after HNM analyzes then; As the residual error pumping signal, detailed process is following:
(1) finds out target LSF parameter immediate for the LSF parameter of changing out by frame, remaining random signal when confirming that residual signals that this target LSF parameter is corresponding and HNM analyze with it;
Remaining random signal linear superposition during (2) with target residual signals and HNM analysis is as the residual error pumping signal;
The 7th step; Phonetic synthesis and residual compensation, as shown in Figure 4, at first go on foot LSF parameter and residual error pumping signal after the conversion obtain by above-mentioned the 5th step and the 6th; The every frame voice signal that obtains changing out based on the LPC synthetic model; Then every frame voice signal of changing out is superimposed with corresponding appropriate target residual signals once more, behind overlap-add, finally obtains the voice that synthesize, detailed process is following:
(1) the LSF Parameters Transformation after the conversion that above-mentioned steps is obtained is the LPC coefficient, sets up wave filter by frame by the LPC coefficient, again with the residual error pumping signal that dopes through this wave filter, the voice after obtaining changing;
(2) voice signal after every frame conversion is superimposed with corresponding appropriate target residual signals once more; Residual signals is carried out the amplification of appropriateness according to the general needs of experiment experience; Can residual signals be amplified to original 2 ~ 5 times during compensation, each frame voice just can obtain final synthetic speech after splicing then.

Claims (8)

1. based on the synthetic phonetics transfer method that reaches of the residual error pumping signal of harmonic wave plus noise model, it is characterized in that comprising following steps:
The first step, pre-service and pure and impure sound are judged, promptly respectively source voice and target speech are carried out pre-emphasis, divide frame and windowing process, calculate the short-time energy and average zero-crossing rate of each frame, accomplish the judgement of pure and impure sound;
Second step; The extraction of harmonic parameters promptly utilizes the HNM model respectively the turbid speech frame of source voice and target speech to be analyzed, and at first calculates the fundamental frequency of unvoiced frame signal; The HNM model is decomposed into harmonic signal and wideband random signal with unvoiced frame then; Calculate the harmonic wave number, extract amplitude, phase place and the each harmonic frequency of harmonic signal, regard random noise as for voiceless sound and keep constant;
The 3rd step; Sound channel spectrum CALCULATION OF PARAMETERS; The amplitude of each order harmonics that promptly respectively the voiced sound signal extraction of source voice and target speech is gone out is carried out conversion, and square sampled value as discrete power with amplitude obtains coefficient of autocorrelation through the IFFT conversion; Carry out lpc analysis through the Levinson-Durbin algorithm again, obtain the LSF parameter and the corresponding residual signals of source voice and target speech;
The 4th step, set up sound channel spectrum transformation rule, with the LSF parameter of source voice and target speech pass through DTW regular after, send into the GMM model and carry out probabilistic Modeling;
The 5th step; The conversion of characteristic parameter is promptly treated converting speech earlier and is analyzed through HNM, according to the method for above-mentioned second step with the 3rd step; Extract LSF parameter and residual signals to be converted, LSF parameter to be converted is sent into the GMM transformation rule of the 4th step foundation and changed;
The 6th step; The prediction of residual error excitation; The immediate target LSF parameter of LSF parameter after promptly at first finding out and change by frame is utilized the corresponding residual signals of this target LSF parameter and this frame remaining random signal linear superposition after HNM analyzes, then as the residual error pumping signal;
The 7th step; Phonetic synthesis and residual compensation; Promptly at first go on foot LSF parameter and residual error pumping signal after the conversion obtain by above-mentioned the 5th step and the 6th; Every frame voice signal based on the LPC synthetic model obtains changing out is superimposed with corresponding appropriate target residual signals with every frame voice signal of changing out then once more, behind overlap-add, finally obtains the voice that synthesize.
2. the synthetic and phonetics transfer method of the residual error pumping signal based on harmonic wave plus noise model according to claim 1 is characterized in that the detailed process of pre-service and pure and impure sound judgement is following:
The first step is carried out pre-service respectively to source voice and target speech signal, and pre emphasis factor is 0.96, and 20ms divides frame by frame length, and zero lap uses Hamming window to carry out windowing process afterwards;
Second step, by frame calculate short-time energy
Figure FDA0000156369420000011
and
Short-time zero-crossing rate Z i = 1 2 Σ m = 0 N - 1 | Sgn [ x i ( m ) ] - Sgn [ x i ( m - 1 ) ] | , X wherein i(m) be i frame voice signal after windowing, and frame length is N, adopts the double threshold method to carry out the judgement of pure and impure sound.
3. the synthetic and phonetics transfer method of the residual error pumping signal based on harmonic wave plus noise model according to claim 1 is characterized in that the leaching process of harmonic parameters is following:
The first step is with the fundamental frequency f of normalized crosscorrelation method difference calculation sources voice and target speech present frame 0
Second step, respectively source voice and target speech are analyzed, be unvoiced frame s (n) as if present frame, 1≤n≤N wherein, N is a frame length, and it is decomposed into harmonic components s h(n) and random element e (n), at first, confirm higher harmonics number
Figure FDA0000156369420000021
F wherein sBe SF, objective function does C l = Arg Min Σ n = - N / 2 N / 2 w 2 ( n ) [ s ( n ) - s h ( n ) ] 2 , W (n) expression Hamming window is wherein estimated multiple amplitude { C under young waiter in a wineshop or an inn's journey criterion l, l=-L ,-L+1 ..., L}, the real amplitude { A of harmonic components lBe expressed as A l=2|C l|=2|C -l|, the reality bit table is shown
Figure FDA0000156369420000023
Again between adjacent two frames to { A lAnd Carry out interpolation, make variate { A when both become respectively l(n) } and
Figure FDA0000156369420000025
Likewise harmonic wave number L is carried out linear interpolation variate { L (n) } when it is become, then remaining random signal is expressed as
Figure FDA0000156369420000026
The 3rd step is if present frame is that voiceless sound then is seen as random noise, because information is less so reservation voiceless sound signal is constant in the voiceless sound.
4. the synthetic and phonetics transfer method of the residual error pumping signal based on harmonic wave plus noise model according to claim 1 is characterized in that the following by frame computation process of sound channel spectrum parameter:
The first step is calculated L discrete amplitudes value A lSquare value, think the sampled value P (ω of discrete power spectrum l), ω wherein lBe l subharmonic angular frequency, ω l=2 π lf 0
Second step is with p (ω l) carry out the IFFT conversion and obtain coefficient of autocorrelation R (n), try to achieve P rank LPC coefficient { a through the Levinson-Durbin algorithm j, j=1,2 ..., P}, and further convert the LSF parameter into;
The 3rd step; By LPC coefficients to construct linear prediction inverse filter, its transform expression formula just can obtain the residual signals behind this frame lpc analysis for
Figure FDA0000156369420000027
voice through A (Z).
5. the synthetic and phonetics transfer method of the residual error pumping signal based on harmonic wave plus noise model according to claim 1, the detailed process that it is characterized in that setting up sound channel spectrum transformation rule is following:
The first step, the LSF parameter that source voice signal and target speech unvoiced frame extracting harmonic are gone out through the DTW time unifying, and is noted the subscript of the alignment LSF that DTW returns;
In second step, the subscript of the alignment LSF that returns according to DTW is alignd the residual signals of source voice with the harmonic wave of target speech unvoiced frame, and likewise the source voice align with target speech unvoiced frame remaining random signal after the HNM analysis;
The 3rd step, the source LSF parameter and the target LSF parameter composition combined parameters of alignment are sent into the GMM model, set up sound channel spectrum transfer function.
6. the synthetic and phonetics transfer method of the residual error pumping signal based on harmonic wave plus noise model according to claim 1 is characterized in that the detailed process of conversion of characteristic parameter is following:
The first step is carried out voice signal to be converted pre-service, is divided frame, analyzes through HNM and extracts harmonic parameters, calculates sound channel spectrum parameter and converts the LSF parameter into;
In second step, good GMM rule is set up in every frame LSF parameter utilization to be converted changed the LSF parameter after obtaining changing.
7. the synthetic and phonetics transfer method of the residual error pumping signal based on harmonic wave plus noise model according to claim 1 is characterized in that the forecasting process of residual error excitation is following:
The first step is found out target LSF parameter immediate with it for the LSF parameter of changing out by frame, remaining random signal when confirming that residual signals that this target LSF parameter is corresponding and HNM analyze;
In second step, remaining random signal linear superposition during with target residual signals and HNM analysis is as the residual error pumping signal.
8. the synthetic and phonetics transfer method of the residual error pumping signal based on harmonic wave plus noise model according to claim 1 is characterized in that the detailed process of phonetic synthesis and residual compensation is following:
The first step is the LPC coefficient with the LSF Parameters Transformation after the conversion that obtains, and sets up wave filter by frame by the LPC coefficient, again the residual error pumping signal that dopes is passed through this wave filter, the voice after obtaining changing;
Second step was superimposed with corresponding appropriate target residual signals once more with the voice signal after every frame conversion, and each frame voice just can obtain final synthetic speech after splicing.
CN2012101218866A 2012-04-24 2012-04-24 Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM) Expired - Fee Related CN102664003B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012101218866A CN102664003B (en) 2012-04-24 2012-04-24 Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012101218866A CN102664003B (en) 2012-04-24 2012-04-24 Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM)

Publications (2)

Publication Number Publication Date
CN102664003A true CN102664003A (en) 2012-09-12
CN102664003B CN102664003B (en) 2013-12-04

Family

ID=46773469

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012101218866A Expired - Fee Related CN102664003B (en) 2012-04-24 2012-04-24 Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM)

Country Status (1)

Country Link
CN (1) CN102664003B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103489443A (en) * 2013-09-17 2014-01-01 湖南大学 Method and device for imitating sound
WO2015154397A1 (en) * 2014-04-08 2015-10-15 华为技术有限公司 Noise signal processing and generation method, encoder/decoder and encoding/decoding system
CN105359211A (en) * 2013-09-09 2016-02-24 华为技术有限公司 Unvoiced/voiced decision for speech processing
CN106098073A (en) * 2016-05-23 2016-11-09 苏州大学 A kind of end-to-end speech encrypting and deciphering system mapping based on frequency spectrum
CN106486129A (en) * 2014-06-27 2017-03-08 华为技术有限公司 A kind of audio coding method and device
CN106656882A (en) * 2016-11-29 2017-05-10 中国科学院声学研究所 Signal synthesizing method and system
CN107134277A (en) * 2017-06-15 2017-09-05 深圳市潮流网络技术有限公司 A kind of voice-activation detecting method based on GMM model
CN108281150A (en) * 2018-01-29 2018-07-13 上海泰亿格康复医疗科技股份有限公司 A kind of breaking of voice change of voice method based on derivative glottal flow model
CN108510991A (en) * 2018-03-30 2018-09-07 厦门大学 Utilize the method for identifying speaker of harmonic series
CN108766450A (en) * 2018-04-16 2018-11-06 杭州电子科技大学 A kind of phonetics transfer method decomposed based on harmonic wave impulse
CN108899008A (en) * 2018-06-13 2018-11-27 中国人民解放军91977部队 One kind simulating interference method and system to empty voice communication noise
CN109003621A (en) * 2018-09-06 2018-12-14 广州酷狗计算机科技有限公司 A kind of audio-frequency processing method, device and storage medium
CN109065068A (en) * 2018-08-17 2018-12-21 广州酷狗计算机科技有限公司 Audio-frequency processing method, device and storage medium
CN109712634A (en) * 2018-12-24 2019-05-03 东北大学 A kind of automatic sound conversion method
CN110444192A (en) * 2019-08-15 2019-11-12 广州科粤信息科技有限公司 A kind of intelligent sound robot based on voice technology
CN111418005A (en) * 2017-11-29 2020-07-14 雅马哈株式会社 Speech synthesis method, speech synthesis device, and program
CN113241089A (en) * 2021-04-16 2021-08-10 维沃移动通信有限公司 Voice signal enhancement method and device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004088633A1 (en) * 2003-03-27 2004-10-14 France Telecom Method for analyzing fundamental frequency information and voice conversion method and system implementing said analysis method
TW201001396A (en) * 2008-06-26 2010-01-01 Univ Nat Taiwan Science Tech Method for synthesizing speech
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004088633A1 (en) * 2003-03-27 2004-10-14 France Telecom Method for analyzing fundamental frequency information and voice conversion method and system implementing said analysis method
TW201001396A (en) * 2008-06-26 2010-01-01 Univ Nat Taiwan Science Tech Method for synthesizing speech
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WINSTON S. PERCYBROOKS等: "VOICE CONVERSION WITH LINEAR PREDICTION RESIDUAL ESTIMATON", 《IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING》, 4 April 2008 (2008-04-04), pages 4673 - 4676 *
易立夫等: "基于HNM算法的汉语语音合成系统", 《第六届全国现代语音学学术会议论文集(下)》, 20 October 2003 (2003-10-20), pages 528 - 533 *

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10347275B2 (en) 2013-09-09 2019-07-09 Huawei Technologies Co., Ltd. Unvoiced/voiced decision for speech processing
CN105359211A (en) * 2013-09-09 2016-02-24 华为技术有限公司 Unvoiced/voiced decision for speech processing
CN103489443A (en) * 2013-09-17 2014-01-01 湖南大学 Method and device for imitating sound
US10734003B2 (en) 2014-04-08 2020-08-04 Huawei Technologies Co., Ltd. Noise signal processing method, noise signal generation method, encoder, decoder, and encoding and decoding system
US9728195B2 (en) 2014-04-08 2017-08-08 Huawei Technologies Co., Ltd. Noise signal processing method, noise signal generation method, encoder, decoder, and encoding and decoding system
US10134406B2 (en) 2014-04-08 2018-11-20 Huawei Technologies Co., Ltd. Noise signal processing method, noise signal generation method, encoder, decoder, and encoding and decoding system
WO2015154397A1 (en) * 2014-04-08 2015-10-15 华为技术有限公司 Noise signal processing and generation method, encoder/decoder and encoding/decoding system
CN106486129A (en) * 2014-06-27 2017-03-08 华为技术有限公司 A kind of audio coding method and device
US11133016B2 (en) 2014-06-27 2021-09-28 Huawei Technologies Co., Ltd. Audio coding method and apparatus
US10460741B2 (en) 2014-06-27 2019-10-29 Huawei Technologies Co., Ltd. Audio coding method and apparatus
CN106486129B (en) * 2014-06-27 2019-10-25 华为技术有限公司 A kind of audio coding method and device
CN106098073A (en) * 2016-05-23 2016-11-09 苏州大学 A kind of end-to-end speech encrypting and deciphering system mapping based on frequency spectrum
CN106656882A (en) * 2016-11-29 2017-05-10 中国科学院声学研究所 Signal synthesizing method and system
CN106656882B (en) * 2016-11-29 2019-05-10 中国科学院声学研究所 A kind of signal synthesis method and system
CN107134277A (en) * 2017-06-15 2017-09-05 深圳市潮流网络技术有限公司 A kind of voice-activation detecting method based on GMM model
CN111418005A (en) * 2017-11-29 2020-07-14 雅马哈株式会社 Speech synthesis method, speech synthesis device, and program
CN111418005B (en) * 2017-11-29 2023-08-11 雅马哈株式会社 Voice synthesis method, voice synthesis device and storage medium
CN108281150A (en) * 2018-01-29 2018-07-13 上海泰亿格康复医疗科技股份有限公司 A kind of breaking of voice change of voice method based on derivative glottal flow model
CN108510991A (en) * 2018-03-30 2018-09-07 厦门大学 Utilize the method for identifying speaker of harmonic series
CN108766450B (en) * 2018-04-16 2023-02-17 杭州电子科技大学 Voice conversion method based on harmonic impulse decomposition
CN108766450A (en) * 2018-04-16 2018-11-06 杭州电子科技大学 A kind of phonetics transfer method decomposed based on harmonic wave impulse
CN108899008A (en) * 2018-06-13 2018-11-27 中国人民解放军91977部队 One kind simulating interference method and system to empty voice communication noise
CN108899008B (en) * 2018-06-13 2023-04-18 中国人民解放军91977部队 Method and system for simulating interference of noise in air voice communication
CN109065068A (en) * 2018-08-17 2018-12-21 广州酷狗计算机科技有限公司 Audio-frequency processing method, device and storage medium
CN109003621A (en) * 2018-09-06 2018-12-14 广州酷狗计算机科技有限公司 A kind of audio-frequency processing method, device and storage medium
CN109712634A (en) * 2018-12-24 2019-05-03 东北大学 A kind of automatic sound conversion method
CN110444192A (en) * 2019-08-15 2019-11-12 广州科粤信息科技有限公司 A kind of intelligent sound robot based on voice technology
CN113241089A (en) * 2021-04-16 2021-08-10 维沃移动通信有限公司 Voice signal enhancement method and device and electronic equipment
CN113241089B (en) * 2021-04-16 2024-02-23 维沃移动通信有限公司 Voice signal enhancement method and device and electronic equipment

Also Published As

Publication number Publication date
CN102664003B (en) 2013-12-04

Similar Documents

Publication Publication Date Title
CN102664003B (en) Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM)
Dave Feature extraction methods LPC, PLP and MFCC in speech recognition
US20150025892A1 (en) Method and system for template-based personalized singing synthesis
Song et al. ExcitNet vocoder: A neural excitation model for parametric speech synthesis systems
CN1815552B (en) Frequency spectrum modelling and voice reinforcing method based on line spectrum frequency and its interorder differential parameter
CN101527141B (en) Method of converting whispered voice into normal voice based on radial group neutral network
CN110648684B (en) Bone conduction voice enhancement waveform generation method based on WaveNet
CN102184731A (en) Method for converting emotional speech by combining rhythm parameters with tone parameters
CN103021418A (en) Voice conversion method facing to multi-time scale prosodic features
CN103714822B (en) Sub-band coding and decoding method and device based on SILK coder decoder
CN106782599A (en) The phonetics transfer method of post filtering is exported based on Gaussian process
Erro et al. MFCC+ F0 extraction and waveform reconstruction using HNM: preliminary results in an HMM-based synthesizer
Oura et al. Deep neural network based real-time speech vocoder with periodic and aperiodic inputs
CN105654941A (en) Voice change method and device based on specific target person voice change ratio parameter
CN103854655B (en) A kind of low bit-rate speech coder and decoder
CN103886859B (en) Phonetics transfer method based on one-to-many codebook mapping
CN102231275B (en) Embedded speech synthesis method based on weighted mixed excitation
Raju et al. Application of prosody modification for speech recognition in different emotion conditions
Othmane et al. Enhancement of esophageal speech using voice conversion techniques
CN115862590A (en) Text-driven speech synthesis method based on characteristic pyramid
Xie et al. Pitch transformation in neural network based voice conversion
CN114913844A (en) Broadcast language identification method for pitch normalization reconstruction
Gentet et al. Neutral to lombard speech conversion with deep learning
Jung et al. Pitch alteration technique in speech synthesis system
Kawahara et al. Beyond bandlimited sampling of speech spectral envelope imposed by the harmonic structure of voiced sounds.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20131204

Termination date: 20160424

CF01 Termination of patent right due to non-payment of annual fee