CN102231275B

CN102231275B - Embedded speech synthesis method based on weighted mixed excitation

Info

Publication number: CN102231275B
Application number: CN2011101454794A
Authority: CN
Inventors: 王朝民; 那兴宇; 谢湘; 何娅玲
Original assignee: BEIJING YUYIN TIANXIA TECHNOLOGY Co Ltd
Current assignee: BEIJING YUYIN TIANXIA TECHNOLOGY CO LTD; Zhuhai Hi-tech Angel Venture Capital Co.,Ltd.
Priority date: 2011-06-01
Filing date: 2011-06-01
Publication date: 2013-10-16
Anticipated expiration: 2031-06-01
Also published as: CN102231275A

Abstract

The invention discloses an embedded speech synthesis method based on weighted mixed excitation, which is used for an embedded operating system to convert any received character into speech for output. The method comprises the following steps of: at a training end, extracting a fundamental frequency adaptive interpolation of weighted spectrum (STRAIGHT spectrum) coefficient, a fundamental frequency and non-periodic components from a speech signal; and at a synthesis end, constructing mixed excitation through the fundamental frequency and the non-periodic components, and obtaining a synthetic speech through the traditional parameter synthesizer. By the method, the original binary excitation is replaced by the mixed excitation at the synthesis end, a lower computing speed is guaranteed, the naturalness and tone quality of the synthetic speech are improved, and an effect of the synthetic speed is similar to that of the STRAIGHT synthesizer.

Description

A kind of embedded language synthetic method based on the weighted blend excitation

Technical field

Present invention relates in general to a kind of embedded language synthetic method based on the STRAIGHT coefficient, especially storage and the limited terminal device of calculation resources.

Background technology

Flourish along with mobile Internet and technology of Internet of things, the embedded device such as mobile phone, e-book terminal progressively becomes people's the most direct daily acquisition of information and processing approach, voice then are naturally the most direct interactive meanses, therefore the development of embedded speech synthetic technology is trend of the times, has urgent market application demand.

The aim of speech synthesis technique is the perfect reproduction mankind's sound, namely allows machine can imitate the characteristics such as human voice, pronunciation style and the rhythm.Traditional speech synthesis technique is on the splicing synthetic method that is based upon based on Large Scale Corpus, and the simple and synthetic tonequality of technology is high, once is widely adopted.But the sound storehouse scale of this method is large, although by after the processing of the technological means such as cluster, coding and compression, the space can reduce, and tonequality sustains damage, and flexibility ratio descends.Therefore, statistical modeling parameter synthetic method based on Large Scale Corpus is widely studied in recent years, basic thought is, a large amount of raw tone storehouses carried out parametrization represents and statistical modeling, select model-composing model sequence according to ad hoc rules when synthetic, further calculate the argument sequence of synthetic statement, by the synthetic satisfactory voice of the synthetic method of parametrization.The voice synthetic by parametrization statistical modeling method have higher naturalness and degree of intelligence.Be speech synthesis technique based on HMM by everybody broad research and employing at present.The selection of speech characteristic parameter has determined the tonequality of synthetic speech to a great extent, and characteristic parameter generally comprises driving source parameter and sound channel spectrum parameter etc.General sound channel spectral coefficient is to extract from the Short Time Fourier Transform spectrum, can directly finish the synthetic of voice by traditional parameters compositor (such as cepstral filtering device or linear prediction filter) at synthetic end, and tonequality is better.STRAIGHT (STRAIGHT) the speech analysis composition algorithm that proposed is in recent years removed by the periodicity that will have time-domain and frequency-domain in the Short Time Fourier Transform spectrum now, obtain the level and smooth frequency spectrum of aperiodicity disturbance, can synthesize the more natural voice of high tone quality.Thereby if although directly only improve tonequality and the naturalness that original FFT spectrum can be improved the phonetic synthesis sound significantly with STRAIGHT as spectrum signature, but the excitation of simple use binary does not utilize whole advantages of STRAIGHT algorithm fully, its, composition was the key of the synthetic high naturalness voice of high-quality non-periodic, also was the main path that tonequality and naturalness promote.

Therefore, need a kind of Innovative method, can under embedded platform, realize taking the less parameterised speech synthesis system of computational resource, not only can use the STRAIGHT spectrum signature, can also by rationally using composition non-periodic in the STRAIGHT algorithm, make the tonequality of synthetic speech near the synthetic speech of STRAIGHT.

Summary of the invention

Technical matters to be solved by this invention is that the composition non-periodic pattern by mixed excitation on the basis of low operand with STRAIGHT joins in the driving source of synthetic speech, improve the excitation of original binary, make the synthetic speech of generation have more tonequality and naturalness near the STRAIGHT synthesized voice.

For achieving the above object, this paper provides a kind of embedded language synthetic method based on the weighted blend excitation, is used for embedded OS, and any text conversion that receives is become voice output.Replace original binary excitation at synthetic end by mixed excitation, when guaranteeing low arithmetic speed, improved naturalness and the tonequality of synthetic speech, reach the effect approximate with the STRAIGHT compositor.The speech synthesis system of using the method is divided into following two parts:

A. train part: at first to voice signal extract STRAIGHT spectrum, fundamental frequency and non-periodic composition, then the STRAIGHT spectrum is extracted sound channel spectrum signature coefficient, and with non-periodic composition in 5 frequency bands, average, and then by HTS to characteristic coefficient modeling, training.

B. composite part: after obtaining calculating the characteristic coefficient sequence by model, by non-periodic the excitation of composition weighted blend and traditional parameters compositor obtain synthetic speech.

Above-described embedded language synthetic method based on the STRAIGHT coefficient, the leaching process of phonetic synthesis training end characteristic coefficient sequence is divided into following five steps:

A. the voice signal in the training utterance database is carried out parameter extraction, be respectively fundamental frequency, gain, STRAIGHT spectrum and non-periodic composition.

B. from the STRAIGHT spectrum that obtains, extract again sound channel spectrum signature coefficient.

C. will gain and be combined into new sound channel spectrum signature coefficient with sound channel spectrum signature coefficient.

D. with non-periodic composition according to 0～1KHz, 1～2KHz, 2～4KHz, 4～6KHz and five frequency bands of 6～8KHz, then composition non-periodic in each frequency band is gone on average, each frequency band obtain one non-periodic the composition weights, with the part of these 5 weights as the characteristic parameter sequence.System adopts general embedded system 16K sampling rate comparatively commonly used.

E. fundamental frequency, new sound channel spectral coefficient are reached minute composition weighted value non-periodic of band and carry out the HMM model training as the characteristic parameter sequence in the lump

Above-described embedded language synthetic method based on the STRAIGHT coefficient, the synthetic end compositor synthetic speech process of phonetic synthesis is divided into following three steps:

A. by the parameter calculation algorithm from model, generate fundamental frequency, sound channel spectral coefficient and non-periodic the composition weighting sequence.

B. by the driving source of fundamental frequency and composition weighting sequence generation non-periodic synthetic speech, adopt the model of mixed excitation.

C. driving source and sound channel spectral coefficient sequence are obtained synthetic speech by the traditional parameters compositor.

The present invention is further described below in conjunction with drawings and Examples, will describe better step of the present invention and the process of realizing to the detailed description of each building block of system in conjunction with the drawings.

Description of drawings

Accompanying drawing 1 is based on the speech synthesis system structured flowchart of HMM

Accompanying drawing 2 system features argument sequences extract schematic diagram

Accompanying drawing composition weighted blend 3 non-periodic excitation voice operation demonstrator structured flowchart

1. voice language material databases among the figure, 2. driving source parameter extraction, 3.HMM model training, 4.HMM Models Sets is 5. by HMM model generation parameter, 6. text analyzing, 7. driving source generates, 8. synthetic filtering, 9. sound channel spectrum parameter extraction, 10. voice signal, 11. driving source parameters, 12. sound channels spectrum parameter, 13. synthetic speech, 14. synthesis texts, 15. training parts, 16. composite part, 17. mark texts, 18. training end characteristic parameter extraction, 19. voice signal data, 20.TANDEM-STRAIGHT analyzes, the 21.STRAIGHT spectrum, 22.LSP coefficient, 23. new LSP coefficients, 24. gains, 25. fundamental frequency, 26. non-periodic composition, 5 frequency bands were averaged in 27. minutes, 28. band minute weighting composition non-periodic, 29.lsp[0], 27.lsp2ipc, 28.LPC wave filter, 29. synthetic end parameter synthetic filterings, 30. synthetic end parameter synthetic filterings, 31.lsp2lpc, 32. mixed excitation, 33. weightings, 34. non-periodic weights, 35. pulse train, 36. white noises.

Embodiment

As shown in Figure 1, in embodiments of the invention, speech synthesis system is deployed in a kind of embedded OS, and this embedded speech synthesis system comprises: phonetic synthesis training end and synthetic end.Wherein, phonetic synthesis model training part is only used under system line, needed compact model storehouse when only being used for generating speech synthesis system work; The composite part of phonetic synthesis then is to finish at chip.Because the present invention focuses on the extraction of parameter with synthetic, and text marking, text analyzing, modeling, training and parameter generation are not focus of the present invention, so the below highlights parameter extraction and the Reconstruction of training end, and the generation of the mixed excitation of synthetic end.The present embodiment has selected LSP coefficient (22) as sound channel spectrum parameter, and selects LPC wave filter (28) as composite filter, and speech data is the 16K sampling.

The characteristic parameter extraction (18) of training end:

Step 1, to the training utterance data carry out time domain firm power spectrum estimate (TANDEM-STRAIGHT algorithm) thus obtain fundamental frequency (25), STRAIGHT spectrum (21), gain (24) and composition non-periodic (26).

Step 2 uses broad sense cepstral analysis algorithm to extract the LPC coefficient from STRAIGHT spectrum (21), wherein comes the conversion spectrum coefficient with the concept of Mei Er broad sense cepstral analysis, and then the LPC coefficients conversion with gained becomes LSP coefficient (22).

Step 3 replaces the 0th of LSP to tie up parameter gain, generates new LSP sound channel spectral coefficient.

Step 4, analyze (20) by TANDEM-STRAIGHT and obtain composition non-periodic (26), then composition non-periodic (26) is divided into five bands at the frequency domain axle, voice for the 16k sampling, frequency band is divided into 0～1000Hz, 1000～2000Hz, 2000～4000Hz, 4000～6000Hz and five bands of 6000～8000Hz, each the band in to non-periodic composition average, with the weighted value of this value as this frequency band Africa composition, therefore composition non-periodic of every frame voice is reduced to 5 coefficients again.

Step 5 is used HMM model training (3) as the characteristic parameter of voice signal together with new LSP sound channel spectrum, fundamental frequency (25) and composition non-periodic (26) weighted value.

The generation (such as Fig. 3) of the mixed excitation of synthetic end:

Step 1 is come the generation of gating pulse sequence (35) and white Gaussian noise (36) by fundamental frequency (25).

Step 2 is come the weighted blend of gating pulse sequence (35) and white Gaussian noise (36) by composition non-periodic (26) weighted value, obtain mixed excitation (32).

Step 3 by the MLSA wave filter by channel parameters control, is generating last synthetic speech (13) waveform by the PSOLA wave filter with mixed excitation (32).

Above-mentioned example is preferred embodiment of the present invention, wherein sound channel spectrum parameter (12) can be selected MGC, corresponding composite filter is then selected the MLSA wave filter, effect is fine equally, but the MLSA wave filter requires higher with respect to the LPC wave filter to computing power, so in embedded device, it is good selecting LSP coefficient (22).

When the present invention uses at embedded device, the IO interface that all audio frequency input and output all can use equipment itself to provide.Phonetic function can be opened or close at equipment at any time.When the not enabled phonetic function, the various functions of original equipment are not affected.

Application of the present invention can be used for various embedded type terminal equipments.According to main design of the present invention, those of ordinary skill in the art all can produce the low or of equal value application of multiple types.Therefore, protection of the present invention should be as the criterion with the protection domain of claim.

Claims

1. the embedded language synthetic method based on the weighted blend excitation is used for embedded OS, and any text conversion that receives is become voice output; Replace original binary excitation at synthetic end by mixed excitation, when guaranteeing low arithmetic speed, improved naturalness and the tonequality of synthetic speech; The method step is as follows:

A. training: at first to voice signal extract STRAIGHT spectrum, fundamental frequency and non-periodic composition, then the STRAIGHT spectrum is extracted sound channel spectrum signature coefficient, and with non-periodic composition in 5 frequency bands, average, and then modeling obtains model, training to characteristic coefficient by HTS;

Described steps A is divided into:

A1. the voice signal in the training utterance database is carried out parameter extraction, be respectively fundamental frequency, gain, STRAIGHT spectrum and non-periodic composition;

A2. from the STRAIGHT spectrum that obtains, extract again sound channel spectrum signature coefficient;

A3. will gain and be combined into new sound channel spectrum signature coefficient with sound channel spectrum signature coefficient;

A4. with non-periodic composition according to 0～1KHz, 1～2KHz, 2～4KHz, 4～6KHz and five frequency bands of 6～8KHz, then composition non-periodic in each frequency band is gone on average, each frequency band obtain one non-periodic the composition weights, with the part of these 5 weights as the characteristic parameter sequence; System adopts general embedded system 16K sampling rate comparatively commonly used;

A5 reaches minute composition weighted value non-periodic of band with fundamental frequency, new sound channel spectral coefficient and carries out the HMM model training as the characteristic parameter sequence in the lump;

B. synthetic: after calculating the characteristic coefficient sequence by described model solution, by non-periodic the excitation of composition weighted blend and traditional parameters compositor obtain synthetic speech; Described traditional parameters compositor is MLSA wave filter and/or PSOLA wave filter.

2. the embedded language synthetic method based on weighted blend excitation according to claim 1, it is characterized in that: described step B is divided into:

B1. by the parameter calculation algorithm from described model, generate fundamental frequency, sound channel spectral coefficient and non-periodic the composition weighting sequence;

B2. by the driving source of fundamental frequency and composition weighting sequence generation non-periodic synthetic speech, adopt the model of mixed excitation;

B3. driving source and sound channel spectral coefficient sequence are obtained synthetic speech by the traditional parameters compositor.