CN102231275B - Embedded speech synthesis method based on weighted mixed excitation - Google Patents

Embedded speech synthesis method based on weighted mixed excitation Download PDF

Info

Publication number
CN102231275B
CN102231275B CN2011101454794A CN201110145479A CN102231275B CN 102231275 B CN102231275 B CN 102231275B CN 2011101454794 A CN2011101454794 A CN 2011101454794A CN 201110145479 A CN201110145479 A CN 201110145479A CN 102231275 B CN102231275 B CN 102231275B
Authority
CN
China
Prior art keywords
periodic
composition
synthetic
coefficient
excitation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2011101454794A
Other languages
Chinese (zh)
Other versions
CN102231275A (en
Inventor
王朝民
那兴宇
谢湘
何娅玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING YUYIN TIANXIA TECHNOLOGY CO LTD
Zhuhai Hi-tech Angel Venture Capital Co.,Ltd.
Original Assignee
BEIJING YUYIN TIANXIA TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING YUYIN TIANXIA TECHNOLOGY Co Ltd filed Critical BEIJING YUYIN TIANXIA TECHNOLOGY Co Ltd
Priority to CN2011101454794A priority Critical patent/CN102231275B/en
Publication of CN102231275A publication Critical patent/CN102231275A/en
Application granted granted Critical
Publication of CN102231275B publication Critical patent/CN102231275B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses an embedded speech synthesis method based on weighted mixed excitation, which is used for an embedded operating system to convert any received character into speech for output. The method comprises the following steps of: at a training end, extracting a fundamental frequency adaptive interpolation of weighted spectrum (STRAIGHT spectrum) coefficient, a fundamental frequency and non-periodic components from a speech signal; and at a synthesis end, constructing mixed excitation through the fundamental frequency and the non-periodic components, and obtaining a synthetic speech through the traditional parameter synthesizer. By the method, the original binary excitation is replaced by the mixed excitation at the synthesis end, a lower computing speed is guaranteed, the naturalness and tone quality of the synthetic speech are improved, and an effect of the synthetic speed is similar to that of the STRAIGHT synthesizer.

Description

A kind of embedded language synthetic method based on the weighted blend excitation
Technical field
Present invention relates in general to a kind of embedded language synthetic method based on the STRAIGHT coefficient, especially storage and the limited terminal device of calculation resources.
Background technology
Flourish along with mobile Internet and technology of Internet of things, the embedded device such as mobile phone, e-book terminal progressively becomes people's the most direct daily acquisition of information and processing approach, voice then are naturally the most direct interactive meanses, therefore the development of embedded speech synthetic technology is trend of the times, has urgent market application demand.
The aim of speech synthesis technique is the perfect reproduction mankind's sound, namely allows machine can imitate the characteristics such as human voice, pronunciation style and the rhythm.Traditional speech synthesis technique is on the splicing synthetic method that is based upon based on Large Scale Corpus, and the simple and synthetic tonequality of technology is high, once is widely adopted.But the sound storehouse scale of this method is large, although by after the processing of the technological means such as cluster, coding and compression, the space can reduce, and tonequality sustains damage, and flexibility ratio descends.Therefore, statistical modeling parameter synthetic method based on Large Scale Corpus is widely studied in recent years, basic thought is, a large amount of raw tone storehouses carried out parametrization represents and statistical modeling, select model-composing model sequence according to ad hoc rules when synthetic, further calculate the argument sequence of synthetic statement, by the synthetic satisfactory voice of the synthetic method of parametrization.The voice synthetic by parametrization statistical modeling method have higher naturalness and degree of intelligence.Be speech synthesis technique based on HMM by everybody broad research and employing at present.The selection of speech characteristic parameter has determined the tonequality of synthetic speech to a great extent, and characteristic parameter generally comprises driving source parameter and sound channel spectrum parameter etc.General sound channel spectral coefficient is to extract from the Short Time Fourier Transform spectrum, can directly finish the synthetic of voice by traditional parameters compositor (such as cepstral filtering device or linear prediction filter) at synthetic end, and tonequality is better.STRAIGHT (STRAIGHT) the speech analysis composition algorithm that proposed is in recent years removed by the periodicity that will have time-domain and frequency-domain in the Short Time Fourier Transform spectrum now, obtain the level and smooth frequency spectrum of aperiodicity disturbance, can synthesize the more natural voice of high tone quality.Thereby if although directly only improve tonequality and the naturalness that original FFT spectrum can be improved the phonetic synthesis sound significantly with STRAIGHT as spectrum signature, but the excitation of simple use binary does not utilize whole advantages of STRAIGHT algorithm fully, its, composition was the key of the synthetic high naturalness voice of high-quality non-periodic, also was the main path that tonequality and naturalness promote.
Therefore, need a kind of Innovative method, can under embedded platform, realize taking the less parameterised speech synthesis system of computational resource, not only can use the STRAIGHT spectrum signature, can also by rationally using composition non-periodic in the STRAIGHT algorithm, make the tonequality of synthetic speech near the synthetic speech of STRAIGHT.
Summary of the invention
Technical matters to be solved by this invention is that the composition non-periodic pattern by mixed excitation on the basis of low operand with STRAIGHT joins in the driving source of synthetic speech, improve the excitation of original binary, make the synthetic speech of generation have more tonequality and naturalness near the STRAIGHT synthesized voice.
For achieving the above object, this paper provides a kind of embedded language synthetic method based on the weighted blend excitation, is used for embedded OS, and any text conversion that receives is become voice output.Replace original binary excitation at synthetic end by mixed excitation, when guaranteeing low arithmetic speed, improved naturalness and the tonequality of synthetic speech, reach the effect approximate with the STRAIGHT compositor.The speech synthesis system of using the method is divided into following two parts:
A. train part: at first to voice signal extract STRAIGHT spectrum, fundamental frequency and non-periodic composition, then the STRAIGHT spectrum is extracted sound channel spectrum signature coefficient, and with non-periodic composition in 5 frequency bands, average, and then by HTS to characteristic coefficient modeling, training.
B. composite part: after obtaining calculating the characteristic coefficient sequence by model, by non-periodic the excitation of composition weighted blend and traditional parameters compositor obtain synthetic speech.
Above-described embedded language synthetic method based on the STRAIGHT coefficient, the leaching process of phonetic synthesis training end characteristic coefficient sequence is divided into following five steps:
A. the voice signal in the training utterance database is carried out parameter extraction, be respectively fundamental frequency, gain, STRAIGHT spectrum and non-periodic composition.
B. from the STRAIGHT spectrum that obtains, extract again sound channel spectrum signature coefficient.
C. will gain and be combined into new sound channel spectrum signature coefficient with sound channel spectrum signature coefficient.
D. with non-periodic composition according to 0~1KHz, 1~2KHz, 2~4KHz, 4~6KHz and five frequency bands of 6~8KHz, then composition non-periodic in each frequency band is gone on average, each frequency band obtain one non-periodic the composition weights, with the part of these 5 weights as the characteristic parameter sequence.System adopts general embedded system 16K sampling rate comparatively commonly used.
E. fundamental frequency, new sound channel spectral coefficient are reached minute composition weighted value non-periodic of band and carry out the HMM model training as the characteristic parameter sequence in the lump
Above-described embedded language synthetic method based on the STRAIGHT coefficient, the synthetic end compositor synthetic speech process of phonetic synthesis is divided into following three steps:
A. by the parameter calculation algorithm from model, generate fundamental frequency, sound channel spectral coefficient and non-periodic the composition weighting sequence.
B. by the driving source of fundamental frequency and composition weighting sequence generation non-periodic synthetic speech, adopt the model of mixed excitation.
C. driving source and sound channel spectral coefficient sequence are obtained synthetic speech by the traditional parameters compositor.
The present invention is further described below in conjunction with drawings and Examples, will describe better step of the present invention and the process of realizing to the detailed description of each building block of system in conjunction with the drawings.
Description of drawings
Accompanying drawing 1 is based on the speech synthesis system structured flowchart of HMM
Accompanying drawing 2 system features argument sequences extract schematic diagram
Accompanying drawing composition weighted blend 3 non-periodic excitation voice operation demonstrator structured flowchart
1. voice language material databases among the figure, 2. driving source parameter extraction, 3.HMM model training, 4.HMM Models Sets is 5. by HMM model generation parameter, 6. text analyzing, 7. driving source generates, 8. synthetic filtering, 9. sound channel spectrum parameter extraction, 10. voice signal, 11. driving source parameters, 12. sound channels spectrum parameter, 13. synthetic speech, 14. synthesis texts, 15. training parts, 16. composite part, 17. mark texts, 18. training end characteristic parameter extraction, 19. voice signal data, 20.TANDEM-STRAIGHT analyzes, the 21.STRAIGHT spectrum, 22.LSP coefficient, 23. new LSP coefficients, 24. gains, 25. fundamental frequency, 26. non-periodic composition, 5 frequency bands were averaged in 27. minutes, 28. band minute weighting composition non-periodic, 29.lsp[0], 27.lsp2ipc, 28.LPC wave filter, 29. synthetic end parameter synthetic filterings, 30. synthetic end parameter synthetic filterings, 31.lsp2lpc, 32. mixed excitation, 33. weightings, 34. non-periodic weights, 35. pulse train, 36. white noises.
Embodiment
As shown in Figure 1, in embodiments of the invention, speech synthesis system is deployed in a kind of embedded OS, and this embedded speech synthesis system comprises: phonetic synthesis training end and synthetic end.Wherein, phonetic synthesis model training part is only used under system line, needed compact model storehouse when only being used for generating speech synthesis system work; The composite part of phonetic synthesis then is to finish at chip.Because the present invention focuses on the extraction of parameter with synthetic, and text marking, text analyzing, modeling, training and parameter generation are not focus of the present invention, so the below highlights parameter extraction and the Reconstruction of training end, and the generation of the mixed excitation of synthetic end.The present embodiment has selected LSP coefficient (22) as sound channel spectrum parameter, and selects LPC wave filter (28) as composite filter, and speech data is the 16K sampling.
The characteristic parameter extraction (18) of training end:
Step 1, to the training utterance data carry out time domain firm power spectrum estimate (TANDEM-STRAIGHT algorithm) thus obtain fundamental frequency (25), STRAIGHT spectrum (21), gain (24) and composition non-periodic (26).
Step 2 uses broad sense cepstral analysis algorithm to extract the LPC coefficient from STRAIGHT spectrum (21), wherein comes the conversion spectrum coefficient with the concept of Mei Er broad sense cepstral analysis, and then the LPC coefficients conversion with gained becomes LSP coefficient (22).
Step 3 replaces the 0th of LSP to tie up parameter gain, generates new LSP sound channel spectral coefficient.
Step 4, analyze (20) by TANDEM-STRAIGHT and obtain composition non-periodic (26), then composition non-periodic (26) is divided into five bands at the frequency domain axle, voice for the 16k sampling, frequency band is divided into 0~1000Hz, 1000~2000Hz, 2000~4000Hz, 4000~6000Hz and five bands of 6000~8000Hz, each the band in to non-periodic composition average, with the weighted value of this value as this frequency band Africa composition, therefore composition non-periodic of every frame voice is reduced to 5 coefficients again.
Step 5 is used HMM model training (3) as the characteristic parameter of voice signal together with new LSP sound channel spectrum, fundamental frequency (25) and composition non-periodic (26) weighted value.
The generation (such as Fig. 3) of the mixed excitation of synthetic end:
Step 1 is come the generation of gating pulse sequence (35) and white Gaussian noise (36) by fundamental frequency (25).
Step 2 is come the weighted blend of gating pulse sequence (35) and white Gaussian noise (36) by composition non-periodic (26) weighted value, obtain mixed excitation (32).
Step 3 by the MLSA wave filter by channel parameters control, is generating last synthetic speech (13) waveform by the PSOLA wave filter with mixed excitation (32).
Above-mentioned example is preferred embodiment of the present invention, wherein sound channel spectrum parameter (12) can be selected MGC, corresponding composite filter is then selected the MLSA wave filter, effect is fine equally, but the MLSA wave filter requires higher with respect to the LPC wave filter to computing power, so in embedded device, it is good selecting LSP coefficient (22).
When the present invention uses at embedded device, the IO interface that all audio frequency input and output all can use equipment itself to provide.Phonetic function can be opened or close at equipment at any time.When the not enabled phonetic function, the various functions of original equipment are not affected.
Application of the present invention can be used for various embedded type terminal equipments.According to main design of the present invention, those of ordinary skill in the art all can produce the low or of equal value application of multiple types.Therefore, protection of the present invention should be as the criterion with the protection domain of claim.

Claims (2)

1. the embedded language synthetic method based on the weighted blend excitation is used for embedded OS, and any text conversion that receives is become voice output; Replace original binary excitation at synthetic end by mixed excitation, when guaranteeing low arithmetic speed, improved naturalness and the tonequality of synthetic speech; The method step is as follows:
A. training: at first to voice signal extract STRAIGHT spectrum, fundamental frequency and non-periodic composition, then the STRAIGHT spectrum is extracted sound channel spectrum signature coefficient, and with non-periodic composition in 5 frequency bands, average, and then modeling obtains model, training to characteristic coefficient by HTS;
Described steps A is divided into:
A1. the voice signal in the training utterance database is carried out parameter extraction, be respectively fundamental frequency, gain, STRAIGHT spectrum and non-periodic composition;
A2. from the STRAIGHT spectrum that obtains, extract again sound channel spectrum signature coefficient;
A3. will gain and be combined into new sound channel spectrum signature coefficient with sound channel spectrum signature coefficient;
A4. with non-periodic composition according to 0~1KHz, 1~2KHz, 2~4KHz, 4~6KHz and five frequency bands of 6~8KHz, then composition non-periodic in each frequency band is gone on average, each frequency band obtain one non-periodic the composition weights, with the part of these 5 weights as the characteristic parameter sequence; System adopts general embedded system 16K sampling rate comparatively commonly used;
A5 reaches minute composition weighted value non-periodic of band with fundamental frequency, new sound channel spectral coefficient and carries out the HMM model training as the characteristic parameter sequence in the lump;
B. synthetic: after calculating the characteristic coefficient sequence by described model solution, by non-periodic the excitation of composition weighted blend and traditional parameters compositor obtain synthetic speech; Described traditional parameters compositor is MLSA wave filter and/or PSOLA wave filter.
2. the embedded language synthetic method based on weighted blend excitation according to claim 1, it is characterized in that: described step B is divided into:
B1. by the parameter calculation algorithm from described model, generate fundamental frequency, sound channel spectral coefficient and non-periodic the composition weighting sequence;
B2. by the driving source of fundamental frequency and composition weighting sequence generation non-periodic synthetic speech, adopt the model of mixed excitation;
B3. driving source and sound channel spectral coefficient sequence are obtained synthetic speech by the traditional parameters compositor.
CN2011101454794A 2011-06-01 2011-06-01 Embedded speech synthesis method based on weighted mixed excitation Active CN102231275B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011101454794A CN102231275B (en) 2011-06-01 2011-06-01 Embedded speech synthesis method based on weighted mixed excitation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011101454794A CN102231275B (en) 2011-06-01 2011-06-01 Embedded speech synthesis method based on weighted mixed excitation

Publications (2)

Publication Number Publication Date
CN102231275A CN102231275A (en) 2011-11-02
CN102231275B true CN102231275B (en) 2013-10-16

Family

ID=44843835

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011101454794A Active CN102231275B (en) 2011-06-01 2011-06-01 Embedded speech synthesis method based on weighted mixed excitation

Country Status (1)

Country Link
CN (1) CN102231275B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104282300A (en) * 2013-07-05 2015-01-14 中国移动通信集团公司 Non-periodic component syllable model building and speech synthesizing method and device
EP3363015A4 (en) * 2015-10-06 2019-06-12 Interactive Intelligence Group, Inc. Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
CN105206259A (en) * 2015-11-03 2015-12-30 常州工学院 Voice conversion method
CN108184032B (en) * 2016-12-07 2020-02-21 中国移动通信有限公司研究院 Service method and device of customer service system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3515039B2 (en) * 2000-03-03 2004-04-05 沖電気工業株式会社 Pitch pattern control method in text-to-speech converter
CN1815552B (en) * 2006-02-28 2010-05-12 安徽中科大讯飞信息科技有限公司 Frequency spectrum modelling and voice reinforcing method based on line spectrum frequency and its interorder differential parameter
CN101950559A (en) * 2010-07-05 2011-01-19 李华东 Method for synthesizing continuous speech with large vocabulary and terminal equipment

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
.基于STRAIGHT模型和人工神经网络的语音转换.《电声技术》.2010, *
张正军 *
杨卫英 *
陈赞 *

Also Published As

Publication number Publication date
CN102231275A (en) 2011-11-02

Similar Documents

Publication Publication Date Title
Erro et al. Harmonics plus noise model based vocoder for statistical parametric speech synthesis
US8280738B2 (en) Voice quality conversion apparatus, pitch conversion apparatus, and voice quality conversion method
US10621969B2 (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
Childers et al. Voice conversion
CN1815552B (en) Frequency spectrum modelling and voice reinforcing method based on line spectrum frequency and its interorder differential parameter
US8386256B2 (en) Method, apparatus and computer program product for providing real glottal pulses in HMM-based text-to-speech synthesis
Mittal et al. Study of characteristics of aperiodicity in Noh voices
CA3004700C (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
CN102231275B (en) Embedded speech synthesis method based on weighted mixed excitation
CN113241082A (en) Sound changing method, device, equipment and medium
CN101887719A (en) Speech synthesis method, system and mobile terminal equipment with speech synthesis function
Huber et al. On glottal source shape parameter transformation using a novel deterministic and stochastic speech analysis and synthesis system
JP2018077283A (en) Speech synthesis method
CN104282300A (en) Non-periodic component syllable model building and speech synthesizing method and device
Rao Unconstrained pitch contour modification using instants of significant excitation
CN111862931A (en) Voice generation method and device
Drugman et al. A comparative evaluation of pitch modification techniques
CN102214463A (en) Imbedded voice synthesis method based on adaptive weighted spectrum interpolation coefficient
JP6834370B2 (en) Speech synthesis method
Nguyen et al. Spectral modification for voice gender conversion using temporal decomposition
Lehana et al. Speech synthesis in Indian languages
JP2018077280A (en) Speech synthesis method
Roebel Between physics and perception: Signal models for high level audio processing
Espic Calderón In search of the optimal acoustic features for statistical parametric speech synthesis
Roddy et al. A Method of Morphing Spectral Envelopes of the Singing Voice for Use with Backing Vocals.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: ZHUHAI YUYIN TIANXIA TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: BEIJING YUYIN TIANXIA TECHNOLOGY CO., LTD.

Effective date: 20140708

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 100085 HAIDIAN, BEIJING TO: 519000 ZHUHAI, GUANGDONG PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20140708

Address after: 519000 Guangdong city of Zhuhai province high tech Zone Tangjiawan Town Road No. 101, University of Tsinghua Science Park (Zhuhai) business building A A1013

Patentee after: Zhuhai Yu World Technology Co.,Ltd.

Address before: 100085, room 15, 915 information road, Beijing, Haidian District

Patentee before: BEIJING YUYIN TIANXIA TECHNOLOGY Co.,Ltd.

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20170105

Address after: 518057 Guangdong city of Shenzhen province Nanshan District science and Technology Park North Yuanxing Technology Building 406 North Block

Patentee after: SHENZHEN AVSNEST TECHNOLOGY CO.,LTD.

Address before: The financial trade No. 15 building, 100085 Beijing city Haidian District information Road Room 915

Patentee before: BEIJING YUYIN TIANXIA TECHNOLOGY Co.,Ltd.

Effective date of registration: 20170105

Address after: The financial trade No. 15 building, 100085 Beijing city Haidian District information Road Room 915

Patentee after: BEIJING YUYIN TIANXIA TECHNOLOGY Co.,Ltd.

Address before: 519000 Guangdong city of Zhuhai province high tech Zone Tangjiawan Town Road No. 101, University of Tsinghua Science Park (Zhuhai) business building A A1013

Patentee before: Zhuhai Yu World Technology Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20181023

Address after: 519000 Tsinghua Science and Technology Park (Zhuhai) Pioneering Building A Block A1013, 101 University Road, Tangjiawan Town, Zhuhai High-tech Zone, Guangdong Province

Patentee after: Zhuhai Yu World Technology Co.,Ltd.

Address before: 518057 Guangdong North Shenzhen science and Technology Park, north of Nanshan District science and technology tower, 406

Patentee before: SHENZHEN AVSNEST TECHNOLOGY CO.,LTD.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190104

Address after: 100085 room 915, finance and trade building, 15 Information Road, Haidian District, Beijing.

Co-patentee after: Zhuhai Hi-tech Angel Venture Capital Co.,Ltd.

Patentee after: BEIJING YUYIN TIANXIA TECHNOLOGY Co.,Ltd.

Address before: 519000 Tsinghua Science and Technology Park (Zhuhai) Pioneering Building A Block A1013, 101 University Road, Tangjiawan Town, Zhuhai High-tech Zone, Guangdong Province

Patentee before: Zhuhai Yu World Technology Co.,Ltd.