CN103632663B - A kind of method of Mongol phonetic synthesis front-end processing based on HMM - Google Patents
A kind of method of Mongol phonetic synthesis front-end processing based on HMM Download PDFInfo
- Publication number
- CN103632663B CN103632663B CN201310595871.8A CN201310595871A CN103632663B CN 103632663 B CN103632663 B CN 103632663B CN 201310595871 A CN201310595871 A CN 201310595871A CN 103632663 B CN103632663 B CN 103632663B
- Authority
- CN
- China
- Prior art keywords
- hmm
- mongolian
- mongol
- synthesis
- end processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
A kind of method of Mongol phonetic synthesis front-end processing based on HMM, first has to be analyzed the speech data in sound storehouse;Sum up to context property collection with for the problem set of decision tree-based clustering;Carry out the training of HMM, obtain the training result of HMM;Obtain the cluster state model of correspondence, form cluster state model sequence;Utilize the dynamic characteristic of parameter to generate target acoustical parameters sequence, and by STRAIGHT synthesizer obtain synthesize voice;The method of front-end processing is: processed by the Mongolian spcial character of input;By Mongolian transcription to latin transliteration;To Mongolian part-of-speech tagging;Use G2P model by latin transliteration to phoneme;Use rule by syllable splitting;The data of synthesis prosody prediction desirable format;Carry out prosody prediction;Process obtains synthesizing required text formatting.The present invention can the duration of the most involutory one-tenth voice, pitch period parameter be adjusted, and keeps the natural and tripping of synthesis voice, reduces synthesis cost.
Description
Technical field
The invention belongs to speech synthesis technique field, particularly relate at a kind of Mongol phonetic synthesis front end based on HMM
The method of reason.
Background technology
Phonetic synthesis is the key technology realizing man machine language's communication, and it relates at acoustics, linguistics, digital signal
Multiple subject technologies such as reason, computer science, are cutting edge technologies of field of information processing, and the subject matter of solution is just
It is that how Word message is converted into audible acoustic information, it is simply that allow machine lift up one's voice as people.
The phoneme synthesizing method of current main flow is speech synthesis technique based on HMM, spells with waveform based on Big-corpus
Connection technology is compared, and it has can construct one newly at short notice automatically in the case of substantially need not manual intervention
System, and whole training process is basically independent on speaker, the advantage of the pronunciation factor such as style and emotion,
The aspects such as Chinese, English, Japanese have had many adaptable speech synthesis systems.In Mongolian phonetic synthesis
Aspect, forefathers have done the research in terms of waveform concatenation method based on Big-corpus.These research work are for Mongol language
The development of sound synthesis is significant, but Mongol phonetic synthesis research based on HMM is scarcely out of swaddling-clothes.
Study Mongolian speech synthesis technique based on HMM and build Mongolian speech synthesis system based on HMM, for
The education of minority area, traffic, communication, office automatic have great importance.
Prior art one, Mongol phoneme synthesizing method based on stem affixe:
First by text analysis model, the text of input is formatted process, record word to be pronounced and punctuate symbol
Number, filter aphonic character;Then to each word, first make a look up in stem table, if finding this list
Word original shape, then extract the speech data of correspondence;If can not find in stem table, then need to carry out the cutting of stem affixe,
To find the stem affixe that this word is corresponding, carry out the extraction of the speech data corresponding to stem affixe of this word simultaneously.
Then record prosodic features, and profit PSOLA algorithm carries out prosody modification.Finally according to splicing rule to the voice chosen
Segment is spliced, and produces final synthesis voice;
Technical scheme text analysis model needs the problems such as the cutting of process text formatting and stem affixe.In systems,
The most simply process when running into the symbol or English word that represent with English, it is simply that sequential read out each English alphabet,
The most further considered. for each word obtained in text analyzing, first make a look up in stem table, if looked for
To this word, then extract speech data, if can not find, carry out the cutting of stem affixe.Word in Mongol may only enter
Cutting of row just can find the stem affixe of correspondence, there may come a time when to need the stem after cutting is carried out secondary cutting, or
Carry out repeatedly cutting, so proposing the concept of multistage stem affixe sheer;
The prosodic features of phrase, sentence is represented by recording the prosodic features of each word at text analysis model.Logical
The stress of word is expressed in the duration change of overwriting word, the change of pitch variation and amplitude, in phrase each word time
Long change, stress change, the change (including stress, duration and fundamental frequency) of the sentence tone.After text analyzing, draw each
The rhythm variation characteristic of word, is modified by TDPSOLA and FD-PSOLA in system, finally gives and repair
Voice segment after changing;
The splicing of phonetic synthesis unit is the most frequently used three kinds of algorithms: diphone splicing, fight recklessly connect, soft splicing.Fight recklessly and connect
Method be the method for a kind of simple two voices are put together (simple concatenation).Soft splicing, the position of splicing is same
Sample is positioned at the boundary of two segments.But, come smoothing speech stitching portion by introducing the transient characteristic in natural language
Transient condition.Certain may be needed between two speech primitives overlapping.In voice joint synthesizes, fight recklessly if only used
Connecing, the voice after synthesis sounds sometimes fast and sometimes slow, and shake is very big, lacks continuity.So in the technical program, entering
Have employed during row splicing to fight recklessly and connect the method combined with soft splicing.The soft joining method using head and the tail overlapping in system was carried out
Cross.The soft joining method that so-called head and the tail overlap is: carry out the afterbody of previous voice unit and the stem of a rear voice unit
The addition of waveforms of certain length.Choosing of certain previous voice segment and a rear voice segment overlapping portion length is most important.
Proper equal to about 1/8th of two voice unit minimum lengths to be superimposed by summing up the length of superposition.This
Sample both can ensure that after synthesis that the voice between word and word was coherent, smooth, pronunciation can be made again clear, improve phonetic synthesis
Naturalness.
The shortcoming of prior art: waveform concatenation based on large database concept is it is generally required to use a sound storehouse the biggest, and this just hinders
Hinder its application in mobile device or embedded device;Secondly, the sound of the method synthesis of waveform concatenation is the most single,
If needing to change the features such as the sex of synthetic video, age, then needing to re-establish a sound storehouse, required input is very
Greatly;And, although there is a stitching algorithm of a lot of prosody adjustment, but the scope that pitch period and duration are adjusted or limited,
If adjusting bigger, the naturalness of synthesis voice can be decreased obviously.
Summary of the invention
The purpose of the embodiment of the present invention is a kind of method providing Mongol phonetic synthesis front-end processing based on HMM,
Aim to solve the problem that the synthesis cost that existing Mongol phoneme synthesizing method exists is big, scope that pitch period and duration adjust has
The problem of limit.
The embodiment of the present invention is achieved in that a kind of method of Mongol phonetic synthesis based on HMM, should be based on
The method of the Mongol phonetic synthesis of HMM comprises the following steps:
Step one, first has to be analyzed the speech data in sound storehouse, extracts corresponding speech parameter;
Step 2, according to the speech parameter extracted, the observation vector of HMM can be divided into spectrum and two parts of fundamental frequency,
Sum up to context property collection with for the problem set of decision tree-based clustering;
Step 3, after model initialization, sound mother HMM training, the training of extended context correlation model, cluster
Model training and duration modeling training carry out the training of HMM, finally obtain and include spectrum, fundamental frequency and duration parameters
Cluster HMM and respective decision tree carry out the training result of HMM;
Step 4, input text is converted to context-sensitive unit sequence after text analyzing, then utilizes and train
To decision tree each unit is carried out decision-making, obtain correspondence cluster state model, formed cluster state model sequence;
Step 5, according to parameter generation algorithm, utilizes the dynamic characteristic of parameter to generate target acoustical parameters sequence, and leads to
Cross STRAIGHT synthesizer and obtain final synthesis voice.
Further, the method being somebody's turn to do Mongol phonetic synthesis based on HMM is divided into training stage and synthesis phase.
Further, in step 2, spectrum argument section uses continuous probability distribution HMM to be modeled, and fundamental frequency part is adopted
With many spatial probability distribution HMM.
The another object of the embodiment of the present invention is to provide the side of a kind of Mongol phonetic synthesis front-end processing based on HMM
Method, the method being somebody's turn to do Mongol phonetic synthesis front-end processing based on HMM comprises the following steps:
The first step, processes the Mongolian spcial character of input;
Second step, by rule by Mongolian transcription to latin transliteration;
3rd step, uses method based on historical models to carry out automatic part-of-speech tagging Mongolian part-of-speech tagging;
4th step, uses G2P model by latin transliteration to phoneme;
5th step, uses regular by syllable splitting, and adds up the syllable number comprised in each word;
6th step, is combined into the data of prosody prediction desirable format by the part of speech obtained and syllable information;
7th step, carries out prosody prediction by method based on condition random field;
8th step, processes and obtains synthesizing required text formatting.
Further, in step one, Mongolian spcial character includes numeral, punctuation marks used to enclose the title character.
The method of the Mongol phonetic synthesis front-end processing based on HMM that the present invention provides, by language based on HMM
The foundation of the corpus of sound synthesis system, the context property relevant to Mongol selects and is used for the problem set of cluster really
Fixed, and the process of speech synthesis system front end.The present invention uses Mongolian special symbol to process during front-end processing,
Mongolian is to latin transliteration, latin transliteration to phoneme, the cutting of Mongol syllable, the automatic marking of Mongol part of speech and Mongolia
The prediction of the language rhythm.The present invention can the parameter such as the duration of the most involutory one-tenth voice, pitch period be adjusted, simultaneously
Keep the natural and tripping of synthesis voice;By MLLR, MAP scheduling algorithm, parameter is changed such that it is able to synthesize
The sound of different tone colors, it is not necessary to additionally make sound storehouse, reduce the cost of synthesis.
Accompanying drawing explanation
Fig. 1 is the method flow diagram of the Mongol phonetic synthesis based on HMM that the embodiment of the present invention provides;
Fig. 2 is the method flow diagram of the Mongol phonetic synthesis front-end processing based on HMM that the embodiment of the present invention provides.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with embodiment, to the present invention
It is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, not
For limiting the present invention.
Below in conjunction with the accompanying drawings and the application principle of the present invention is further described by specific embodiment.
As it is shown in figure 1, the method for the Mongol phonetic synthesis based on HMM of the embodiment of the present invention comprises the following steps:
S101: first have to the speech data in sound storehouse is analyzed, extract corresponding speech parameter;
S102: according to the speech parameter extracted, the observation vector of HMM can be divided into spectrum and two parts of fundamental frequency, always
Peering context property collection and the problem set for decision tree-based clustering;
S103: after model initialization, sound mother HMM training, the training of extended context correlation model, cluster
Model training and duration modeling training carry out the training of HMM, finally obtain and include the poly-of spectrum, fundamental frequency and duration parameters
Class HMM and respective decision tree carry out the training result of HMM;
S104: input text is converted to context-sensitive unit sequence after text analyzing, then utilizes training to obtain
Decision tree each unit is carried out decision-making, obtain correspondence cluster state model, formed cluster state model sequence;
S105: according to parameter generation algorithm, utilizes the dynamic characteristic of parameter to generate target acoustical parameters sequence, and passes through
STRAIGHT synthesizer obtains final synthesis voice.
The method of the Mongol phonetic synthesis based on HMM of the embodiment of the present invention concretely comprises the following steps:
It is divided into two stages: training stage and synthesis phase;
Training stage mainly includes pretreatment and HMM training, at pretreatment stage, first has to the voice number in sound storehouse
According to being analyzed, extracting corresponding speech parameter, according to the speech parameter extracted, the observation vector of HMM can divide
For spectrum and two parts of fundamental frequency, wherein spectrum argument section uses continuous probability distribution HMM to be modeled, and fundamental frequency part
Using many spatial probability distribution HMM, in addition, also having an important job before model training is exactly to belong to context
Property collection and being designed for the problem set of decision tree-based clustering, i.e. select some relevant to Mongol also according to priori
Parameters,acoustic (spectrum, fundamental frequency and duration) is had the context property of certain impact and designs corresponding problem set for context
Correlation model clusters;
Pretreatment is exactly whole HMM training after completing, and its training step is followed successively by model initialization, sound mother HMM instruction
Model training after the training of white silk, extended context correlation model, cluster and duration modeling training, the training finally obtained knot
Fruit includes spectrum, fundamental frequency and cluster HMM of duration parameters and respective decision tree;
At synthesis phase, first input text is converted to context-sensitive unit sequence, then profit after text analyzing
The decision tree obtained with training carries out decision-making to each unit, obtains the cluster state model of correspondence, and forms cluster shape
States model sequence, finally, according to parameter generation algorithm, utilize the dynamic characteristic of parameter to generate target acoustical parameters sequence,
And obtain final synthesis voice by STRAIGHT synthesizer.
As in figure 2 it is shown, the method for the Mongol phonetic synthesis front-end processing based on HMM of the embodiment of the present invention include with
Lower step:
S201: the Mongolian spcial character of input is processed;
S202: by rule by Mongolian transcription to latin transliteration;
S203: use method based on historical models to carry out automatic part-of-speech tagging Mongolian part-of-speech tagging;
S204: use G2P model by latin transliteration to phoneme;
S205: use regular by syllable splitting, and add up the syllable number comprised in each word;
S206: the part of speech obtained and syllable information are combined into the data of prosody prediction desirable format;
S207: carry out prosody prediction by method based on condition random field;
S208: process and obtain synthesizing required text formatting.
The method of the Mongol phonetic synthesis front-end processing based on HMM of the present invention concretely comprises the following steps:
One, the Mongolian spcial character (including the character such as numeral, punctuation marks used to enclose the title) of input is processed;
Two: by rule by Mongolian transcription to latin transliteration;
Three: use method based on historical models to carry out automatic part-of-speech tagging Mongolian part-of-speech tagging;
Four: use G2P model by latin transliteration to phoneme;
Five: use regular by syllable splitting, and add up the syllable number comprised in each word;
Six: three, five parts of speech obtained and syllable information are combined into the data of prosody prediction desirable format;
Seven: carry out prosody prediction by method based on condition random field;
Eight: process and obtain synthesizing required text formatting.
Present invention Mongol based on HMM phoneme synthesizing method can the duration of the most involutory one-tenth voice, pitch period
It is adjusted etc. parameter, keeps the natural and tripping of synthesis voice simultaneously;Can be by MLLR, MAP scheduling algorithm to parameter
Change such that it is able to synthesizing the sound of different tone color, different from waveform concatenation, this conversion need not additionally make
Make sound storehouse, only need a small amount of training data to be trained.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all spirit in the present invention
Any amendment, equivalent and improvement etc. with being made within principle, should be included within the scope of the present invention.
Claims (2)
1. the method for a Mongol phonetic synthesis front-end processing based on HMM, it is characterised in that should be based on HMM
The method of Mongol phonetic synthesis front-end processing comprise the following steps:
The first step, processes the Mongolian spcial character of input;
Second step, by rule by Mongolian transcription to latin transliteration;
3rd step, uses method based on historical models to carry out automatic part-of-speech tagging Mongolian part-of-speech tagging;
4th step, uses G2P model by latin transliteration to phoneme;
5th step, uses regular by syllable splitting, and adds up the syllable number comprised in each word;
6th step, is combined into the data of prosody prediction desirable format by the part of speech obtained and syllable information;
7th step, carries out prosody prediction by method based on condition random field;
8th step, processes and obtains synthesizing required text formatting.
2. the method for Mongol phonetic synthesis front-end processing based on HMM as claimed in claim 1, it is characterised in that
In step one, Mongolian spcial character includes numeral, punctuation marks used to enclose the title character.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310595871.8A CN103632663B (en) | 2013-11-25 | 2013-11-25 | A kind of method of Mongol phonetic synthesis front-end processing based on HMM |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310595871.8A CN103632663B (en) | 2013-11-25 | 2013-11-25 | A kind of method of Mongol phonetic synthesis front-end processing based on HMM |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103632663A CN103632663A (en) | 2014-03-12 |
CN103632663B true CN103632663B (en) | 2016-08-17 |
Family
ID=50213641
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310595871.8A Active CN103632663B (en) | 2013-11-25 | 2013-11-25 | A kind of method of Mongol phonetic synthesis front-end processing based on HMM |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103632663B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105427855A (en) * | 2015-11-09 | 2016-03-23 | 上海语知义信息技术有限公司 | Voice broadcast system and voice broadcast method of intelligent software |
CN105679308A (en) * | 2016-03-03 | 2016-06-15 | 百度在线网络技术(北京)有限公司 | Method and device for generating g2p model based on artificial intelligence and method and device for synthesizing English speech based on artificial intelligence |
CN105957518B (en) * | 2016-06-16 | 2019-05-31 | 内蒙古大学 | A kind of method of Mongol large vocabulary continuous speech recognition |
CN108334502A (en) * | 2017-12-29 | 2018-07-27 | 内蒙古蒙科立蒙古文化股份有限公司 | A kind of method for mutually conversing of tradition Mongolian and Cyrillic Mongolian |
CN108766413B (en) * | 2018-05-25 | 2020-09-25 | 北京云知声信息技术有限公司 | Speech synthesis method and system |
CN110349564B (en) * | 2019-07-22 | 2021-09-24 | 思必驰科技股份有限公司 | Cross-language voice recognition method and device |
CN112562637B (en) * | 2019-09-25 | 2024-02-06 | 北京中关村科金技术有限公司 | Method, device and storage medium for splicing voice audios |
CN114936555B (en) * | 2022-05-24 | 2023-06-06 | 内蒙古自治区公安厅 | Method and system for AI intelligent labeling of Mongolian |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101178896A (en) * | 2007-12-06 | 2008-05-14 | 安徽科大讯飞信息科技股份有限公司 | Unit selection voice synthetic method based on acoustics statistical model |
CN102063897A (en) * | 2010-12-09 | 2011-05-18 | 北京宇音天下科技有限公司 | Sound library compression for embedded type voice synthesis system and use method thereof |
CN103226946A (en) * | 2013-03-26 | 2013-07-31 | 中国科学技术大学 | Voice synthesis method based on limited Boltzmann machine |
-
2013
- 2013-11-25 CN CN201310595871.8A patent/CN103632663B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101178896A (en) * | 2007-12-06 | 2008-05-14 | 安徽科大讯飞信息科技股份有限公司 | Unit selection voice synthetic method based on acoustics statistical model |
CN102063897A (en) * | 2010-12-09 | 2011-05-18 | 北京宇音天下科技有限公司 | Sound library compression for embedded type voice synthesis system and use method thereof |
CN103226946A (en) * | 2013-03-26 | 2013-07-31 | 中国科学技术大学 | Voice synthesis method based on limited Boltzmann machine |
Non-Patent Citations (4)
Title |
---|
基于HMM的可训练中文语音合成;吴义坚等;《中文信息学报》;20060730;第20卷(第4期);第76页第4-5段,第77页第1-3段以及图1 * |
基于历史模型的蒙古文自动词性标注研究;赵建东等;《中文信息学报》;20130930;第27卷(第5期);第156-159页,第165页 * |
条件随机场模型在韵律结构预测中的应用;董远等;《北京邮电大学学报》;20091031;第32卷(第5期);第36-40页 * |
蒙古文字母到音素转换方法的研究;飞龙等;《计算机应用研究》;20130630;第30卷(第6期);第1696-1700页 * |
Also Published As
Publication number | Publication date |
---|---|
CN103632663A (en) | 2014-03-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103632663B (en) | A kind of method of Mongol phonetic synthesis front-end processing based on HMM | |
CN101178896B (en) | Unit selection voice synthetic method based on acoustics statistical model | |
WO2018153213A1 (en) | Multi-language hybrid speech recognition method | |
US20170047060A1 (en) | Text-to-speech method and multi-lingual speech synthesizer using the method | |
CN101156196A (en) | Hybrid speech synthesizer, method and use | |
CN107103900A (en) | A kind of across language emotional speech synthesizing method and system | |
CN104217713A (en) | Tibetan-Chinese speech synthesis method and device | |
CN112352275A (en) | Neural text-to-speech synthesis with multi-level textual information | |
CN110767213A (en) | Rhythm prediction method and device | |
CN103165126A (en) | Method for voice playing of mobile phone text short messages | |
CN116092472A (en) | Speech synthesis method and synthesis system | |
Kayte et al. | A Marathi Hidden-Markov Model Based Speech Synthesis System | |
TWI605350B (en) | Text-to-speech method and multiplingual speech synthesizer using the method | |
Levy et al. | The effect of pitch, intensity and pause duration in punctuation detection | |
CN114678001A (en) | Speech synthesis method and speech synthesis device | |
Mukherjee et al. | A bengali hmm based speech synthesis system | |
CN105895076B (en) | A kind of phoneme synthesizing method and system | |
KR100669241B1 (en) | System and method of synthesizing dialog-style speech using speech-act information | |
Balyan et al. | Automatic phonetic segmentation of Hindi speech using hidden Markov model | |
CN116129868A (en) | Method and system for generating structured photo | |
TW201937479A (en) | Multilingual mixed speech recognition method | |
CN114708848A (en) | Method and device for acquiring size of audio and video file | |
CN114822490A (en) | Voice splicing method and voice splicing device | |
WO2022144851A1 (en) | System and method of automated audio output | |
JP4964695B2 (en) | Speech synthesis apparatus, speech synthesis method, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C41 | Transfer of patent application or patent right or utility model | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20160615 Address after: 010021 Hohhot West Road, the Inner Mongolia Autonomous Region, No. 235 Applicant after: Inner Mongolia University Address before: 010021 Hohhot West Road, the Inner Mongolia Autonomous Region, No. 235 Applicant before: Fei Long |
|
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |