CN103632663B - A kind of method of Mongol phonetic synthesis front-end processing based on HMM - Google Patents

A kind of method of Mongol phonetic synthesis front-end processing based on HMM Download PDF

Info

Publication number
CN103632663B
CN103632663B CN201310595871.8A CN201310595871A CN103632663B CN 103632663 B CN103632663 B CN 103632663B CN 201310595871 A CN201310595871 A CN 201310595871A CN 103632663 B CN103632663 B CN 103632663B
Authority
CN
China
Prior art keywords
hmm
mongolian
mongol
synthesis
end processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310595871.8A
Other languages
Chinese (zh)
Other versions
CN103632663A (en
Inventor
飞龙
高光来
赵建东
张学良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University
Original Assignee
Inner Mongolia University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University filed Critical Inner Mongolia University
Priority to CN201310595871.8A priority Critical patent/CN103632663B/en
Publication of CN103632663A publication Critical patent/CN103632663A/en
Application granted granted Critical
Publication of CN103632663B publication Critical patent/CN103632663B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

A kind of method of Mongol phonetic synthesis front-end processing based on HMM, first has to be analyzed the speech data in sound storehouse;Sum up to context property collection with for the problem set of decision tree-based clustering;Carry out the training of HMM, obtain the training result of HMM;Obtain the cluster state model of correspondence, form cluster state model sequence;Utilize the dynamic characteristic of parameter to generate target acoustical parameters sequence, and by STRAIGHT synthesizer obtain synthesize voice;The method of front-end processing is: processed by the Mongolian spcial character of input;By Mongolian transcription to latin transliteration;To Mongolian part-of-speech tagging;Use G2P model by latin transliteration to phoneme;Use rule by syllable splitting;The data of synthesis prosody prediction desirable format;Carry out prosody prediction;Process obtains synthesizing required text formatting.The present invention can the duration of the most involutory one-tenth voice, pitch period parameter be adjusted, and keeps the natural and tripping of synthesis voice, reduces synthesis cost.

Description

A kind of method of Mongol phonetic synthesis front-end processing based on HMM
Technical field
The invention belongs to speech synthesis technique field, particularly relate at a kind of Mongol phonetic synthesis front end based on HMM The method of reason.
Background technology
Phonetic synthesis is the key technology realizing man machine language's communication, and it relates at acoustics, linguistics, digital signal Multiple subject technologies such as reason, computer science, are cutting edge technologies of field of information processing, and the subject matter of solution is just It is that how Word message is converted into audible acoustic information, it is simply that allow machine lift up one's voice as people.
The phoneme synthesizing method of current main flow is speech synthesis technique based on HMM, spells with waveform based on Big-corpus Connection technology is compared, and it has can construct one newly at short notice automatically in the case of substantially need not manual intervention System, and whole training process is basically independent on speaker, the advantage of the pronunciation factor such as style and emotion, The aspects such as Chinese, English, Japanese have had many adaptable speech synthesis systems.In Mongolian phonetic synthesis Aspect, forefathers have done the research in terms of waveform concatenation method based on Big-corpus.These research work are for Mongol language The development of sound synthesis is significant, but Mongol phonetic synthesis research based on HMM is scarcely out of swaddling-clothes. Study Mongolian speech synthesis technique based on HMM and build Mongolian speech synthesis system based on HMM, for The education of minority area, traffic, communication, office automatic have great importance.
Prior art one, Mongol phoneme synthesizing method based on stem affixe:
First by text analysis model, the text of input is formatted process, record word to be pronounced and punctuate symbol Number, filter aphonic character;Then to each word, first make a look up in stem table, if finding this list Word original shape, then extract the speech data of correspondence;If can not find in stem table, then need to carry out the cutting of stem affixe, To find the stem affixe that this word is corresponding, carry out the extraction of the speech data corresponding to stem affixe of this word simultaneously. Then record prosodic features, and profit PSOLA algorithm carries out prosody modification.Finally according to splicing rule to the voice chosen Segment is spliced, and produces final synthesis voice;
Technical scheme text analysis model needs the problems such as the cutting of process text formatting and stem affixe.In systems, The most simply process when running into the symbol or English word that represent with English, it is simply that sequential read out each English alphabet, The most further considered. for each word obtained in text analyzing, first make a look up in stem table, if looked for To this word, then extract speech data, if can not find, carry out the cutting of stem affixe.Word in Mongol may only enter Cutting of row just can find the stem affixe of correspondence, there may come a time when to need the stem after cutting is carried out secondary cutting, or Carry out repeatedly cutting, so proposing the concept of multistage stem affixe sheer;
The prosodic features of phrase, sentence is represented by recording the prosodic features of each word at text analysis model.Logical The stress of word is expressed in the duration change of overwriting word, the change of pitch variation and amplitude, in phrase each word time Long change, stress change, the change (including stress, duration and fundamental frequency) of the sentence tone.After text analyzing, draw each The rhythm variation characteristic of word, is modified by TDPSOLA and FD-PSOLA in system, finally gives and repair Voice segment after changing;
The splicing of phonetic synthesis unit is the most frequently used three kinds of algorithms: diphone splicing, fight recklessly connect, soft splicing.Fight recklessly and connect Method be the method for a kind of simple two voices are put together (simple concatenation).Soft splicing, the position of splicing is same Sample is positioned at the boundary of two segments.But, come smoothing speech stitching portion by introducing the transient characteristic in natural language Transient condition.Certain may be needed between two speech primitives overlapping.In voice joint synthesizes, fight recklessly if only used Connecing, the voice after synthesis sounds sometimes fast and sometimes slow, and shake is very big, lacks continuity.So in the technical program, entering Have employed during row splicing to fight recklessly and connect the method combined with soft splicing.The soft joining method using head and the tail overlapping in system was carried out Cross.The soft joining method that so-called head and the tail overlap is: carry out the afterbody of previous voice unit and the stem of a rear voice unit The addition of waveforms of certain length.Choosing of certain previous voice segment and a rear voice segment overlapping portion length is most important. Proper equal to about 1/8th of two voice unit minimum lengths to be superimposed by summing up the length of superposition.This Sample both can ensure that after synthesis that the voice between word and word was coherent, smooth, pronunciation can be made again clear, improve phonetic synthesis Naturalness.
The shortcoming of prior art: waveform concatenation based on large database concept is it is generally required to use a sound storehouse the biggest, and this just hinders Hinder its application in mobile device or embedded device;Secondly, the sound of the method synthesis of waveform concatenation is the most single, If needing to change the features such as the sex of synthetic video, age, then needing to re-establish a sound storehouse, required input is very Greatly;And, although there is a stitching algorithm of a lot of prosody adjustment, but the scope that pitch period and duration are adjusted or limited, If adjusting bigger, the naturalness of synthesis voice can be decreased obviously.
Summary of the invention
The purpose of the embodiment of the present invention is a kind of method providing Mongol phonetic synthesis front-end processing based on HMM, Aim to solve the problem that the synthesis cost that existing Mongol phoneme synthesizing method exists is big, scope that pitch period and duration adjust has The problem of limit.
The embodiment of the present invention is achieved in that a kind of method of Mongol phonetic synthesis based on HMM, should be based on The method of the Mongol phonetic synthesis of HMM comprises the following steps:
Step one, first has to be analyzed the speech data in sound storehouse, extracts corresponding speech parameter;
Step 2, according to the speech parameter extracted, the observation vector of HMM can be divided into spectrum and two parts of fundamental frequency, Sum up to context property collection with for the problem set of decision tree-based clustering;
Step 3, after model initialization, sound mother HMM training, the training of extended context correlation model, cluster Model training and duration modeling training carry out the training of HMM, finally obtain and include spectrum, fundamental frequency and duration parameters Cluster HMM and respective decision tree carry out the training result of HMM;
Step 4, input text is converted to context-sensitive unit sequence after text analyzing, then utilizes and train To decision tree each unit is carried out decision-making, obtain correspondence cluster state model, formed cluster state model sequence;
Step 5, according to parameter generation algorithm, utilizes the dynamic characteristic of parameter to generate target acoustical parameters sequence, and leads to Cross STRAIGHT synthesizer and obtain final synthesis voice.
Further, the method being somebody's turn to do Mongol phonetic synthesis based on HMM is divided into training stage and synthesis phase.
Further, in step 2, spectrum argument section uses continuous probability distribution HMM to be modeled, and fundamental frequency part is adopted With many spatial probability distribution HMM.
The another object of the embodiment of the present invention is to provide the side of a kind of Mongol phonetic synthesis front-end processing based on HMM Method, the method being somebody's turn to do Mongol phonetic synthesis front-end processing based on HMM comprises the following steps:
The first step, processes the Mongolian spcial character of input;
Second step, by rule by Mongolian transcription to latin transliteration;
3rd step, uses method based on historical models to carry out automatic part-of-speech tagging Mongolian part-of-speech tagging;
4th step, uses G2P model by latin transliteration to phoneme;
5th step, uses regular by syllable splitting, and adds up the syllable number comprised in each word;
6th step, is combined into the data of prosody prediction desirable format by the part of speech obtained and syllable information;
7th step, carries out prosody prediction by method based on condition random field;
8th step, processes and obtains synthesizing required text formatting.
Further, in step one, Mongolian spcial character includes numeral, punctuation marks used to enclose the title character.
The method of the Mongol phonetic synthesis front-end processing based on HMM that the present invention provides, by language based on HMM The foundation of the corpus of sound synthesis system, the context property relevant to Mongol selects and is used for the problem set of cluster really Fixed, and the process of speech synthesis system front end.The present invention uses Mongolian special symbol to process during front-end processing, Mongolian is to latin transliteration, latin transliteration to phoneme, the cutting of Mongol syllable, the automatic marking of Mongol part of speech and Mongolia The prediction of the language rhythm.The present invention can the parameter such as the duration of the most involutory one-tenth voice, pitch period be adjusted, simultaneously Keep the natural and tripping of synthesis voice;By MLLR, MAP scheduling algorithm, parameter is changed such that it is able to synthesize The sound of different tone colors, it is not necessary to additionally make sound storehouse, reduce the cost of synthesis.
Accompanying drawing explanation
Fig. 1 is the method flow diagram of the Mongol phonetic synthesis based on HMM that the embodiment of the present invention provides;
Fig. 2 is the method flow diagram of the Mongol phonetic synthesis front-end processing based on HMM that the embodiment of the present invention provides.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with embodiment, to the present invention It is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, not For limiting the present invention.
Below in conjunction with the accompanying drawings and the application principle of the present invention is further described by specific embodiment.
As it is shown in figure 1, the method for the Mongol phonetic synthesis based on HMM of the embodiment of the present invention comprises the following steps:
S101: first have to the speech data in sound storehouse is analyzed, extract corresponding speech parameter;
S102: according to the speech parameter extracted, the observation vector of HMM can be divided into spectrum and two parts of fundamental frequency, always Peering context property collection and the problem set for decision tree-based clustering;
S103: after model initialization, sound mother HMM training, the training of extended context correlation model, cluster Model training and duration modeling training carry out the training of HMM, finally obtain and include the poly-of spectrum, fundamental frequency and duration parameters Class HMM and respective decision tree carry out the training result of HMM;
S104: input text is converted to context-sensitive unit sequence after text analyzing, then utilizes training to obtain Decision tree each unit is carried out decision-making, obtain correspondence cluster state model, formed cluster state model sequence;
S105: according to parameter generation algorithm, utilizes the dynamic characteristic of parameter to generate target acoustical parameters sequence, and passes through STRAIGHT synthesizer obtains final synthesis voice.
The method of the Mongol phonetic synthesis based on HMM of the embodiment of the present invention concretely comprises the following steps:
It is divided into two stages: training stage and synthesis phase;
Training stage mainly includes pretreatment and HMM training, at pretreatment stage, first has to the voice number in sound storehouse According to being analyzed, extracting corresponding speech parameter, according to the speech parameter extracted, the observation vector of HMM can divide For spectrum and two parts of fundamental frequency, wherein spectrum argument section uses continuous probability distribution HMM to be modeled, and fundamental frequency part Using many spatial probability distribution HMM, in addition, also having an important job before model training is exactly to belong to context Property collection and being designed for the problem set of decision tree-based clustering, i.e. select some relevant to Mongol also according to priori Parameters,acoustic (spectrum, fundamental frequency and duration) is had the context property of certain impact and designs corresponding problem set for context Correlation model clusters;
Pretreatment is exactly whole HMM training after completing, and its training step is followed successively by model initialization, sound mother HMM instruction Model training after the training of white silk, extended context correlation model, cluster and duration modeling training, the training finally obtained knot Fruit includes spectrum, fundamental frequency and cluster HMM of duration parameters and respective decision tree;
At synthesis phase, first input text is converted to context-sensitive unit sequence, then profit after text analyzing The decision tree obtained with training carries out decision-making to each unit, obtains the cluster state model of correspondence, and forms cluster shape States model sequence, finally, according to parameter generation algorithm, utilize the dynamic characteristic of parameter to generate target acoustical parameters sequence, And obtain final synthesis voice by STRAIGHT synthesizer.
As in figure 2 it is shown, the method for the Mongol phonetic synthesis front-end processing based on HMM of the embodiment of the present invention include with Lower step:
S201: the Mongolian spcial character of input is processed;
S202: by rule by Mongolian transcription to latin transliteration;
S203: use method based on historical models to carry out automatic part-of-speech tagging Mongolian part-of-speech tagging;
S204: use G2P model by latin transliteration to phoneme;
S205: use regular by syllable splitting, and add up the syllable number comprised in each word;
S206: the part of speech obtained and syllable information are combined into the data of prosody prediction desirable format;
S207: carry out prosody prediction by method based on condition random field;
S208: process and obtain synthesizing required text formatting.
The method of the Mongol phonetic synthesis front-end processing based on HMM of the present invention concretely comprises the following steps:
One, the Mongolian spcial character (including the character such as numeral, punctuation marks used to enclose the title) of input is processed;
Two: by rule by Mongolian transcription to latin transliteration;
Three: use method based on historical models to carry out automatic part-of-speech tagging Mongolian part-of-speech tagging;
Four: use G2P model by latin transliteration to phoneme;
Five: use regular by syllable splitting, and add up the syllable number comprised in each word;
Six: three, five parts of speech obtained and syllable information are combined into the data of prosody prediction desirable format;
Seven: carry out prosody prediction by method based on condition random field;
Eight: process and obtain synthesizing required text formatting.
Present invention Mongol based on HMM phoneme synthesizing method can the duration of the most involutory one-tenth voice, pitch period It is adjusted etc. parameter, keeps the natural and tripping of synthesis voice simultaneously;Can be by MLLR, MAP scheduling algorithm to parameter Change such that it is able to synthesizing the sound of different tone color, different from waveform concatenation, this conversion need not additionally make Make sound storehouse, only need a small amount of training data to be trained.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all spirit in the present invention Any amendment, equivalent and improvement etc. with being made within principle, should be included within the scope of the present invention.

Claims (2)

1. the method for a Mongol phonetic synthesis front-end processing based on HMM, it is characterised in that should be based on HMM The method of Mongol phonetic synthesis front-end processing comprise the following steps:
The first step, processes the Mongolian spcial character of input;
Second step, by rule by Mongolian transcription to latin transliteration;
3rd step, uses method based on historical models to carry out automatic part-of-speech tagging Mongolian part-of-speech tagging;
4th step, uses G2P model by latin transliteration to phoneme;
5th step, uses regular by syllable splitting, and adds up the syllable number comprised in each word;
6th step, is combined into the data of prosody prediction desirable format by the part of speech obtained and syllable information;
7th step, carries out prosody prediction by method based on condition random field;
8th step, processes and obtains synthesizing required text formatting.
2. the method for Mongol phonetic synthesis front-end processing based on HMM as claimed in claim 1, it is characterised in that In step one, Mongolian spcial character includes numeral, punctuation marks used to enclose the title character.
CN201310595871.8A 2013-11-25 2013-11-25 A kind of method of Mongol phonetic synthesis front-end processing based on HMM Active CN103632663B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310595871.8A CN103632663B (en) 2013-11-25 2013-11-25 A kind of method of Mongol phonetic synthesis front-end processing based on HMM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310595871.8A CN103632663B (en) 2013-11-25 2013-11-25 A kind of method of Mongol phonetic synthesis front-end processing based on HMM

Publications (2)

Publication Number Publication Date
CN103632663A CN103632663A (en) 2014-03-12
CN103632663B true CN103632663B (en) 2016-08-17

Family

ID=50213641

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310595871.8A Active CN103632663B (en) 2013-11-25 2013-11-25 A kind of method of Mongol phonetic synthesis front-end processing based on HMM

Country Status (1)

Country Link
CN (1) CN103632663B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105427855A (en) * 2015-11-09 2016-03-23 上海语知义信息技术有限公司 Voice broadcast system and voice broadcast method of intelligent software
CN105679308A (en) * 2016-03-03 2016-06-15 百度在线网络技术(北京)有限公司 Method and device for generating g2p model based on artificial intelligence and method and device for synthesizing English speech based on artificial intelligence
CN105957518B (en) * 2016-06-16 2019-05-31 内蒙古大学 A kind of method of Mongol large vocabulary continuous speech recognition
CN108334502A (en) * 2017-12-29 2018-07-27 内蒙古蒙科立蒙古文化股份有限公司 A kind of method for mutually conversing of tradition Mongolian and Cyrillic Mongolian
CN108766413B (en) * 2018-05-25 2020-09-25 北京云知声信息技术有限公司 Speech synthesis method and system
CN110349564B (en) * 2019-07-22 2021-09-24 思必驰科技股份有限公司 Cross-language voice recognition method and device
CN112562637B (en) * 2019-09-25 2024-02-06 北京中关村科金技术有限公司 Method, device and storage medium for splicing voice audios
CN114936555B (en) * 2022-05-24 2023-06-06 内蒙古自治区公安厅 Method and system for AI intelligent labeling of Mongolian

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178896A (en) * 2007-12-06 2008-05-14 安徽科大讯飞信息科技股份有限公司 Unit selection voice synthetic method based on acoustics statistical model
CN102063897A (en) * 2010-12-09 2011-05-18 北京宇音天下科技有限公司 Sound library compression for embedded type voice synthesis system and use method thereof
CN103226946A (en) * 2013-03-26 2013-07-31 中国科学技术大学 Voice synthesis method based on limited Boltzmann machine

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178896A (en) * 2007-12-06 2008-05-14 安徽科大讯飞信息科技股份有限公司 Unit selection voice synthetic method based on acoustics statistical model
CN102063897A (en) * 2010-12-09 2011-05-18 北京宇音天下科技有限公司 Sound library compression for embedded type voice synthesis system and use method thereof
CN103226946A (en) * 2013-03-26 2013-07-31 中国科学技术大学 Voice synthesis method based on limited Boltzmann machine

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
基于HMM的可训练中文语音合成;吴义坚等;《中文信息学报》;20060730;第20卷(第4期);第76页第4-5段,第77页第1-3段以及图1 *
基于历史模型的蒙古文自动词性标注研究;赵建东等;《中文信息学报》;20130930;第27卷(第5期);第156-159页,第165页 *
条件随机场模型在韵律结构预测中的应用;董远等;《北京邮电大学学报》;20091031;第32卷(第5期);第36-40页 *
蒙古文字母到音素转换方法的研究;飞龙等;《计算机应用研究》;20130630;第30卷(第6期);第1696-1700页 *

Also Published As

Publication number Publication date
CN103632663A (en) 2014-03-12

Similar Documents

Publication Publication Date Title
CN103632663B (en) A kind of method of Mongol phonetic synthesis front-end processing based on HMM
CN101178896B (en) Unit selection voice synthetic method based on acoustics statistical model
WO2018153213A1 (en) Multi-language hybrid speech recognition method
US20170047060A1 (en) Text-to-speech method and multi-lingual speech synthesizer using the method
CN101156196A (en) Hybrid speech synthesizer, method and use
CN107103900A (en) A kind of across language emotional speech synthesizing method and system
CN104217713A (en) Tibetan-Chinese speech synthesis method and device
CN112352275A (en) Neural text-to-speech synthesis with multi-level textual information
CN110767213A (en) Rhythm prediction method and device
CN103165126A (en) Method for voice playing of mobile phone text short messages
CN116092472A (en) Speech synthesis method and synthesis system
Kayte et al. A Marathi Hidden-Markov Model Based Speech Synthesis System
TWI605350B (en) Text-to-speech method and multiplingual speech synthesizer using the method
Levy et al. The effect of pitch, intensity and pause duration in punctuation detection
CN114678001A (en) Speech synthesis method and speech synthesis device
Mukherjee et al. A bengali hmm based speech synthesis system
CN105895076B (en) A kind of phoneme synthesizing method and system
KR100669241B1 (en) System and method of synthesizing dialog-style speech using speech-act information
Balyan et al. Automatic phonetic segmentation of Hindi speech using hidden Markov model
CN116129868A (en) Method and system for generating structured photo
TW201937479A (en) Multilingual mixed speech recognition method
CN114708848A (en) Method and device for acquiring size of audio and video file
CN114822490A (en) Voice splicing method and voice splicing device
WO2022144851A1 (en) System and method of automated audio output
JP4964695B2 (en) Speech synthesis apparatus, speech synthesis method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20160615

Address after: 010021 Hohhot West Road, the Inner Mongolia Autonomous Region, No. 235

Applicant after: Inner Mongolia University

Address before: 010021 Hohhot West Road, the Inner Mongolia Autonomous Region, No. 235

Applicant before: Fei Long

C14 Grant of patent or utility model
GR01 Patent grant