CN102214463A

CN102214463A - Imbedded voice synthesis method based on adaptive weighted spectrum interpolation coefficient

Info

Publication number: CN102214463A
Application number: CN201110145478XA
Authority: CN
Inventors: 王朝民; 那兴宇; 谢湘; 何娅玲
Original assignee: BEIJING YUYIN TIANXIA TECHNOLOGY Co Ltd
Current assignee: BEIJING YUYIN TIANXIA TECHNOLOGY Co Ltd
Priority date: 2011-06-01
Filing date: 2011-06-01
Publication date: 2011-10-12

Abstract

The invention discloses an imbedded voice synthesis method based on an adaptive weighted spectrum interpolation coefficient, which is used in an imbedded operation system for transforming arbitrary received characters into voices and outputting the voices. The method comprises the following steps of: at a training end, extracting a base frequency adaptive weighted spectrum interpolation (a STRAIGHT spectrum) for voice signals, extracting a track spectrum characteristic coefficient for the STRAIGHT spectrum, and modeling and training the characteristic coefficient through HTS; and at a synthesis end, after a characteristic coefficient sequence is calculated via a model, acquiring a synthesis voice by a traditional parameter synthesizer. When the method provided by the invention is used, the synthesis voice quality equivalent to that of a STRAIGHT synthesizer can be acquired, and the STRAIGHT synthesizer is replaced by the traditional parameter synthesizer at the synthesis end so as to greatly improve the synthesis speed, and the embedded application becomes possible.

Description

A kind of embedded language synthetic method based on adaptive weighted spectrum interpolation coefficient

Technical field

Present invention relates in general to a kind of embedded language synthetic method, especially storage and the limited terminal device of calculation resources based on adaptive weighted spectrum interpolation coefficient.

Background technology

Flourish along with mobile Internet and technology of Internet of things, embedded device such as mobile phone, e-book terminal progressively becomes people's the most direct daily information and obtains and the processing approach, voice then are the most direct naturally interactive meanses, therefore the development of embedded speech synthetic technology is trend of the times, has urgent market application demand.

The aim of speech synthesis technique is the perfect reproduction mankind's a sound, just allows machine can imitate characteristics such as human voice, pronunciation style and the rhythm.Traditional speech synthesis technique is on the splicing synthetic method that is based upon based on extensive corpus, and the simple and synthetic tonequality height of technology once was widely adopted.But the sound storehouse scale of this method is big, though by after the processing of technological means such as cluster, coding and compression, the space can reduce, and tonequality sustains damage, and flexibility ratio descends.Therefore, statistical modeling parameter synthetic method based on extensive corpus is widely studied in recent years, basic thought is, a large amount of raw tone storehouses carried out parametrization is represented and statistical modeling, select model component model sequence according to ad hoc rules when synthetic, further calculate the argument sequence of synthetic statement, by the synthetic satisfactory voice of the synthetic method of parametrization.The voice synthetic by parametrization statistical modeling method have higher naturalness and degree of intelligence.Be speech synthesis technique by everybody broad research and employing at present based on HMM.The selection of speech characteristic parameter has determined the tonequality of synthetic speech to a great extent, and characteristic parameter generally comprises driving source parameter and sound channel spectrum parameter etc.General sound channel spectral coefficient is to extract from the Short Time Fourier Transform spectrum, can directly finish the synthetic of voice by traditional parameters compositor (as cepstrum wave filter or linear prediction filter) at synthetic end, and tonequality is better.Adaptive weighted spectrum interpolation (STRAIGHT) the speech analysis composition algorithm of Ti Chuing was removed by the periodicity that will have time-domain and frequency-domain in the Short Time Fourier Transform spectrum now in recent years, obtain the level and smooth frequency spectrum of aperiodicity disturbance, can synthesize the more natural voice of high tone quality.But the STRAIGHT compositor is to finish sound channel filtering by the frequency spectrum convolution, and computing cost is very big, accounts for very much computational resource, can't satisfy practical requirement for the limited terminal device of computing and storage resources.

Therefore, need a kind of improved method, can under embedded platform, realize taking the less parameterised speech synthesis system of computational resource, and can be with the tonequality of the synthetic speech synthetic speech near STRAIGHT.

Summary of the invention

Technical matters to be solved by this invention provides a kind of embedded language synthetic method based on adaptive weighted spectrum interpolation coefficient, can under embedded platform, realize taking the less parameterised speech synthesis system of computational resource, and can be with the tonequality of the synthetic speech synthetic speech near STRAIGHT.

For achieving the above object, this paper provides a kind of embedded language synthetic method based on adaptive weighted spectrum interpolation (STRAIGHT spectrum) coefficient, is used for embedded OS, and any text conversion that receives is become voice output.Can obtain the synthetic speech tonequality suitable, and replace the STRAIGHT compositor significantly to improve aggregate velocity by the traditional parameters compositor, and make it Embedded Application and become possibility at synthetic end with the STRAIGHT compositor.The speech synthesis system of using this method is divided into following two parts:

A. train part: at first voice signal is extracted the STRAIGHT spectrum, then the STRAIGHT spectrum is extracted sound channel spectrum signature coefficient, and then by HTS to characteristic coefficient modeling, training;

B. composite part: after obtaining calculating the characteristic coefficient sequence by model, obtain synthetic speech by the traditional parameters compositor.

Above-mentioned embedded language synthetic method based on adaptive weighted spectrum interpolation coefficient, its phonetic synthesis training end sound channel spectrum signature coefficient leaching process is divided into following four steps:

A. the voice signal in the training utterance database is carried out parameter extraction, be respectively fundamental frequency, gain and STRAIGHT spectrum;

B. from the STRAIGHT spectrum that obtains, extract sound channel spectrum signature coefficient again;

C. will gain with sound channel spectrum signature coefficient be combined into new sound channel spectrum signature coefficient;

D. fundamental frequency and new sound channel spectral coefficient are carried out the HMM model training as the characteristic parameter sequence in the lump.

Above-mentioned embedded language synthetic method based on adaptive weighted spectrum interpolation coefficient, the synthetic end compositor synthetic speech process of its phonetic synthesis is divided into following three steps:

A. from model, generate fundamental frequency and sound channel spectral coefficient sequence by the parameter calculation algorithm;

B. generate the driving source of synthetic speech by the fundamental frequency sequence;

C. driving source and sound channel spectral coefficient sequence are obtained synthetic speech by the traditional parameters compositor.

Embedded speech synthesis system according to said method is set up can realize taking the less relatively parameterised speech synthesis system of computational resource fully under embedded platform, and can be with the tonequality of the synthetic speech synthetic speech near STRAIGHT.

The present invention is further described below in conjunction with drawings and Examples, will describe step of the present invention and the process of realizing better to the detailed description of each building block of system in conjunction with the drawings.

Description of drawings

Accompanying drawing 1 is based on the speech synthesis system structured flowchart of HMM

The LSP coefficient of accompanying drawing 2STRAIGHT spectrum extracts synoptic diagram

Accompanying drawing 3 parameter compositor synthetic speech synoptic diagram

1. voice language material databases among the figure, 2. driving source parameter extraction, 3.HMM model training, 4.HMM mode set 5. generates parameter, 6. text analyzing by the HMM model, 7. driving source generates, 8. synthetic filtering, 9. sound channel spectrum parameter extraction, 10. voice signal, 11. driving source parameters, 12. sound channels spectrum parameter, 13. synthetic speech, 14. synthesis texts, 15. training parts, 16. composite part, 17. mark texts, 18. training end characteristic parameter extraction, 19. voice signal data, 20.TANDEM-STRAIGHT analyze 21.STRAIGHT spectrum, 22.LSP parameter, 23. gain, 24. new LSP parameter, 25.lsp[0], 26. fundamental frequencies, 27.lsp2ipc, 28.LPC wave filter, 29. synthetic end parameter synthetic filterings, 30. driving sources.

Embodiment

As shown in Figure 1, in embodiments of the invention, speech synthesis system is deployed in a kind of embedded OS, and this embedded speech synthesis system comprises: phonetic synthesis training end and synthetic end.Wherein, phonetic synthesis model training part is only used under system line, desired compression model bank when only being used to generate speech synthesis system work; The composite part of phonetic synthesis (16) then is to finish on chip.Because the present invention focuses on Parameter Extraction with synthetic, and mark text (17), text analyzing (6), modeling, training and parameter generation are not focus of the present invention, so highlighting the parameter extraction of training end and the composite filter of parameter reconstruction and synthetic end below selects.Present embodiment has selected LSP (line spectrum pair) parameter to compose parameter (12) as sound channel, and selects for use LPC wave filter (28) as composite filter, and speech data is the 16K sampling.

The characteristic parameter extraction (as Fig. 2) of training end:

Step 1, to the training utterance data carry out time domain firm power spectrum estimate (TANDEM-STRAIGHT algorithm) thus obtain fundamental frequency, STRAIGHT spectrum (21) and gain (23).

Step 2 uses broad sense cepstral analysis algorithm to extract LPC coefficient (22) from STRAIGHT spectrum (21), wherein uses the notion of Mei Er broad sense cepstral analysis to come the conversion spectrum coefficient, and the formula of broad sense cepstrum is:

H (z) = {s_{γ}}^{- 1} {Σ_{m = 0}^{M} c_{α, γ} (m) z^{- m}))

= \{\begin{matrix} {(1 + γ Σ_{m + 1}^{M} c_{α, γ} {(m)}_{z}^{~ - m})}^{1 / γ}, & - 1 \leq γ < 0; \\ \exp Σ_{m = 1}^{M} c_{α, γ} {(m)}_{z}^{~ - m}, & γ = 0; \end{matrix}

Wherein, c _{Alpha, gamma}(m) be Mei Er broad sense cepstrum coefficient, α represents the frequency bending, the expression precision of γ control zero limit.

Step 3 converts the LPC coefficient of gained to LSP coefficient (22).

Step 4 replaces the 0th of LSP to tie up parameter gain, generates new LSP sound channel spectral coefficient.

Step 5 composes new LSP sound channel with the characteristic parameter use HMM model training (3) of fundamental frequency as voice signal.

The composite filter of synthetic end is selected (as Fig. 3):

Step 1, LSP coefficient (22) sequence that parameter calculation is obtained converts the LPC coefficient sequence to.

Step 2 obtains driving source (30) signal by fundamental frequency (26) sequence.

Step 3 obtains synthetic speech (13) with driving source by LPC wave filter (28).

Above-mentioned example is preferred embodiment of the present invention, wherein sound channel spectral coefficient (12) can be selected MGC for use, corresponding composite filter is then selected the MLSA wave filter for use, effect is fine equally, but the MLSA wave filter requires higher with respect to LPC wave filter (28) to computing power, so in embedded device, it is good selecting LSP coefficient (22).

When the present invention uses on embedded device, the IO interface that all audio frequency input and output all can use equipment itself to provide.Phonetic function can be opened on equipment or close at any time.When the not enabled phonetic function, the various functions of original equipment are not affected.

Application of the present invention can be used for various embedded type terminal equipments.According to main design of the present invention, those of ordinary skill in the art all can produce the low or of equal value application of multiple class.Therefore, protection of the present invention should be as the criterion with the protection domain of claim.

Claims

1. the embedded language synthetic method based on adaptive weighted spectrum interpolation (STRAIGHT spectrum) coefficient is used for embedded OS, and any text conversion that receives is become voice output.Can obtain the synthetic speech tonequality suitable, and replace the STRAIGHT compositor significantly to improve aggregate velocity by the traditional parameters compositor, and make it Embedded Application and become possibility at synthetic end with the STRAIGHT compositor.The speech synthesis system of using this method is divided into following two parts:

2. the embedded language synthetic method based on adaptive weighted spectrum interpolation coefficient according to claim 1 is characterized in that: in the described A step, phonetic synthesis training end sound channel spectrum signature coefficient leaching process is divided into following four steps:

3. the embedded language synthetic method based on adaptive weighted spectrum interpolation coefficient according to claim 1 is characterized in that: in the described B step, the synthetic end compositor synthetic speech process of phonetic synthesis is divided into following three steps: