CN103632663B

CN103632663B - A kind of method of Mongol phonetic synthesis front-end processing based on HMM

Info

Publication number: CN103632663B
Application number: CN201310595871.8A
Authority: CN
Inventors: 飞龙; 高光来; 赵建东; 张学良
Original assignee: Inner Mongolia University
Current assignee: Inner Mongolia University
Priority date: 2013-11-25
Filing date: 2013-11-25
Publication date: 2016-08-17
Anticipated expiration: 2033-11-25
Also published as: CN103632663A

Abstract

A kind of method of Mongol phonetic synthesis front-end processing based on HMM, first has to be analyzed the speech data in sound storehouse；Sum up to context property collection with for the problem set of decision tree-based clustering；Carry out the training of HMM, obtain the training result of HMM；Obtain the cluster state model of correspondence, form cluster state model sequence；Utilize the dynamic characteristic of parameter to generate target acoustical parameters sequence, and by STRAIGHT synthesizer obtain synthesize voice；The method of front-end processing is: processed by the Mongolian spcial character of input；By Mongolian transcription to latin transliteration；To Mongolian part-of-speech tagging；Use G2P model by latin transliteration to phoneme；Use rule by syllable splitting；The data of synthesis prosody prediction desirable format；Carry out prosody prediction；Process obtains synthesizing required text formatting.The present invention can the duration of the most involutory one-tenth voice, pitch period parameter be adjusted, and keeps the natural and tripping of synthesis voice, reduces synthesis cost.

Description

A kind of method of Mongol phonetic synthesis front-end processing based on HMM

Technical field

The invention belongs to speech synthesis technique field, particularly relate at a kind of Mongol phonetic synthesis front end based on HMM The method of reason.

Background technology

Phonetic synthesis is the key technology realizing man machine language's communication, and it relates at acoustics, linguistics, digital signal Multiple subject technologies such as reason, computer science, are cutting edge technologies of field of information processing, and the subject matter of solution is just It is that how Word message is converted into audible acoustic information, it is simply that allow machine lift up one's voice as people.

The phoneme synthesizing method of current main flow is speech synthesis technique based on HMM, spells with waveform based on Big-corpus Connection technology is compared, and it has can construct one newly at short notice automatically in the case of substantially need not manual intervention System, and whole training process is basically independent on speaker, the advantage of the pronunciation factor such as style and emotion, The aspects such as Chinese, English, Japanese have had many adaptable speech synthesis systems.In Mongolian phonetic synthesis Aspect, forefathers have done the research in terms of waveform concatenation method based on Big-corpus.These research work are for Mongol language The development of sound synthesis is significant, but Mongol phonetic synthesis research based on HMM is scarcely out of swaddling-clothes. Study Mongolian speech synthesis technique based on HMM and build Mongolian speech synthesis system based on HMM, for The education of minority area, traffic, communication, office automatic have great importance.

Prior art one, Mongol phoneme synthesizing method based on stem affixe:

First by text analysis model, the text of input is formatted process, record word to be pronounced and punctuate symbol Number, filter aphonic character；Then to each word, first make a look up in stem table, if finding this list Word original shape, then extract the speech data of correspondence；If can not find in stem table, then need to carry out the cutting of stem affixe, To find the stem affixe that this word is corresponding, carry out the extraction of the speech data corresponding to stem affixe of this word simultaneously. Then record prosodic features, and profit PSOLA algorithm carries out prosody modification.Finally according to splicing rule to the voice chosen Segment is spliced, and produces final synthesis voice；

Technical scheme text analysis model needs the problems such as the cutting of process text formatting and stem affixe.In systems, The most simply process when running into the symbol or English word that represent with English, it is simply that sequential read out each English alphabet, The most further considered. for each word obtained in text analyzing, first make a look up in stem table, if looked for To this word, then extract speech data, if can not find, carry out the cutting of stem affixe.Word in Mongol may only enter Cutting of row just can find the stem affixe of correspondence, there may come a time when to need the stem after cutting is carried out secondary cutting, or Carry out repeatedly cutting, so proposing the concept of multistage stem affixe sheer；

The prosodic features of phrase, sentence is represented by recording the prosodic features of each word at text analysis model.Logical The stress of word is expressed in the duration change of overwriting word, the change of pitch variation and amplitude, in phrase each word time Long change, stress change, the change (including stress, duration and fundamental frequency) of the sentence tone.After text analyzing, draw each The rhythm variation characteristic of word, is modified by TDPSOLA and FD-PSOLA in system, finally gives and repair Voice segment after changing；

The splicing of phonetic synthesis unit is the most frequently used three kinds of algorithms: diphone splicing, fight recklessly connect, soft splicing.Fight recklessly and connect Method be the method for a kind of simple two voices are put together (simple concatenation).Soft splicing, the position of splicing is same Sample is positioned at the boundary of two segments.But, come smoothing speech stitching portion by introducing the transient characteristic in natural language Transient condition.Certain may be needed between two speech primitives overlapping.In voice joint synthesizes, fight recklessly if only used Connecing, the voice after synthesis sounds sometimes fast and sometimes slow, and shake is very big, lacks continuity.So in the technical program, entering Have employed during row splicing to fight recklessly and connect the method combined with soft splicing.The soft joining method using head and the tail overlapping in system was carried out Cross.The soft joining method that so-called head and the tail overlap is: carry out the afterbody of previous voice unit and the stem of a rear voice unit The addition of waveforms of certain length.Choosing of certain previous voice segment and a rear voice segment overlapping portion length is most important. Proper equal to about 1/8th of two voice unit minimum lengths to be superimposed by summing up the length of superposition.This Sample both can ensure that after synthesis that the voice between word and word was coherent, smooth, pronunciation can be made again clear, improve phonetic synthesis Naturalness.

The shortcoming of prior art: waveform concatenation based on large database concept is it is generally required to use a sound storehouse the biggest, and this just hinders Hinder its application in mobile device or embedded device；Secondly, the sound of the method synthesis of waveform concatenation is the most single, If needing to change the features such as the sex of synthetic video, age, then needing to re-establish a sound storehouse, required input is very Greatly；And, although there is a stitching algorithm of a lot of prosody adjustment, but the scope that pitch period and duration are adjusted or limited, If adjusting bigger, the naturalness of synthesis voice can be decreased obviously.

Summary of the invention

The purpose of the embodiment of the present invention is a kind of method providing Mongol phonetic synthesis front-end processing based on HMM, Aim to solve the problem that the synthesis cost that existing Mongol phoneme synthesizing method exists is big, scope that pitch period and duration adjust has The problem of limit.

The embodiment of the present invention is achieved in that a kind of method of Mongol phonetic synthesis based on HMM, should be based on The method of the Mongol phonetic synthesis of HMM comprises the following steps:

Step one, first has to be analyzed the speech data in sound storehouse, extracts corresponding speech parameter；

Step 2, according to the speech parameter extracted, the observation vector of HMM can be divided into spectrum and two parts of fundamental frequency, Sum up to context property collection with for the problem set of decision tree-based clustering；

Step 3, after model initialization, sound mother HMM training, the training of extended context correlation model, cluster Model training and duration modeling training carry out the training of HMM, finally obtain and include spectrum, fundamental frequency and duration parameters Cluster HMM and respective decision tree carry out the training result of HMM；

Step 4, input text is converted to context-sensitive unit sequence after text analyzing, then utilizes and train To decision tree each unit is carried out decision-making, obtain correspondence cluster state model, formed cluster state model sequence；

Step 5, according to parameter generation algorithm, utilizes the dynamic characteristic of parameter to generate target acoustical parameters sequence, and leads to Cross STRAIGHT synthesizer and obtain final synthesis voice.

Further, the method being somebody's turn to do Mongol phonetic synthesis based on HMM is divided into training stage and synthesis phase.

Further, in step 2, spectrum argument section uses continuous probability distribution HMM to be modeled, and fundamental frequency part is adopted With many spatial probability distribution HMM.

The another object of the embodiment of the present invention is to provide the side of a kind of Mongol phonetic synthesis front-end processing based on HMM Method, the method being somebody's turn to do Mongol phonetic synthesis front-end processing based on HMM comprises the following steps:

The first step, processes the Mongolian spcial character of input；

Second step, by rule by Mongolian transcription to latin transliteration；

3rd step, uses method based on historical models to carry out automatic part-of-speech tagging Mongolian part-of-speech tagging；

4th step, uses G2P model by latin transliteration to phoneme；

5th step, uses regular by syllable splitting, and adds up the syllable number comprised in each word；

6th step, is combined into the data of prosody prediction desirable format by the part of speech obtained and syllable information；

7th step, carries out prosody prediction by method based on condition random field；

8th step, processes and obtains synthesizing required text formatting.

Further, in step one, Mongolian spcial character includes numeral, punctuation marks used to enclose the title character.

The method of the Mongol phonetic synthesis front-end processing based on HMM that the present invention provides, by language based on HMM The foundation of the corpus of sound synthesis system, the context property relevant to Mongol selects and is used for the problem set of cluster really Fixed, and the process of speech synthesis system front end.The present invention uses Mongolian special symbol to process during front-end processing, Mongolian is to latin transliteration, latin transliteration to phoneme, the cutting of Mongol syllable, the automatic marking of Mongol part of speech and Mongolia The prediction of the language rhythm.The present invention can the parameter such as the duration of the most involutory one-tenth voice, pitch period be adjusted, simultaneously Keep the natural and tripping of synthesis voice；By MLLR, MAP scheduling algorithm, parameter is changed such that it is able to synthesize The sound of different tone colors, it is not necessary to additionally make sound storehouse, reduce the cost of synthesis.

Accompanying drawing explanation

Fig. 1 is the method flow diagram of the Mongol phonetic synthesis based on HMM that the embodiment of the present invention provides；

Fig. 2 is the method flow diagram of the Mongol phonetic synthesis front-end processing based on HMM that the embodiment of the present invention provides.

Detailed description of the invention

In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with embodiment, to the present invention It is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, not For limiting the present invention.

Below in conjunction with the accompanying drawings and the application principle of the present invention is further described by specific embodiment.

As it is shown in figure 1, the method for the Mongol phonetic synthesis based on HMM of the embodiment of the present invention comprises the following steps:

S101: first have to the speech data in sound storehouse is analyzed, extract corresponding speech parameter；

S102: according to the speech parameter extracted, the observation vector of HMM can be divided into spectrum and two parts of fundamental frequency, always Peering context property collection and the problem set for decision tree-based clustering；

S103: after model initialization, sound mother HMM training, the training of extended context correlation model, cluster Model training and duration modeling training carry out the training of HMM, finally obtain and include the poly-of spectrum, fundamental frequency and duration parameters Class HMM and respective decision tree carry out the training result of HMM；

S104: input text is converted to context-sensitive unit sequence after text analyzing, then utilizes training to obtain Decision tree each unit is carried out decision-making, obtain correspondence cluster state model, formed cluster state model sequence；

S105: according to parameter generation algorithm, utilizes the dynamic characteristic of parameter to generate target acoustical parameters sequence, and passes through STRAIGHT synthesizer obtains final synthesis voice.

The method of the Mongol phonetic synthesis based on HMM of the embodiment of the present invention concretely comprises the following steps:

It is divided into two stages: training stage and synthesis phase；

Training stage mainly includes pretreatment and HMM training, at pretreatment stage, first has to the voice number in sound storehouse According to being analyzed, extracting corresponding speech parameter, according to the speech parameter extracted, the observation vector of HMM can divide For spectrum and two parts of fundamental frequency, wherein spectrum argument section uses continuous probability distribution HMM to be modeled, and fundamental frequency part Using many spatial probability distribution HMM, in addition, also having an important job before model training is exactly to belong to context Property collection and being designed for the problem set of decision tree-based clustering, i.e. select some relevant to Mongol also according to priori Parameters,acoustic (spectrum, fundamental frequency and duration) is had the context property of certain impact and designs corresponding problem set for context Correlation model clusters；

Pretreatment is exactly whole HMM training after completing, and its training step is followed successively by model initialization, sound mother HMM instruction Model training after the training of white silk, extended context correlation model, cluster and duration modeling training, the training finally obtained knot Fruit includes spectrum, fundamental frequency and cluster HMM of duration parameters and respective decision tree；

At synthesis phase, first input text is converted to context-sensitive unit sequence, then profit after text analyzing The decision tree obtained with training carries out decision-making to each unit, obtains the cluster state model of correspondence, and forms cluster shape States model sequence, finally, according to parameter generation algorithm, utilize the dynamic characteristic of parameter to generate target acoustical parameters sequence, And obtain final synthesis voice by STRAIGHT synthesizer.

As in figure 2 it is shown, the method for the Mongol phonetic synthesis front-end processing based on HMM of the embodiment of the present invention include with Lower step:

S201: the Mongolian spcial character of input is processed；

S202: by rule by Mongolian transcription to latin transliteration；

S203: use method based on historical models to carry out automatic part-of-speech tagging Mongolian part-of-speech tagging；

S204: use G2P model by latin transliteration to phoneme；

S205: use regular by syllable splitting, and add up the syllable number comprised in each word；

S206: the part of speech obtained and syllable information are combined into the data of prosody prediction desirable format；

S207: carry out prosody prediction by method based on condition random field；

S208: process and obtain synthesizing required text formatting.

The method of the Mongol phonetic synthesis front-end processing based on HMM of the present invention concretely comprises the following steps:

One, the Mongolian spcial character (including the character such as numeral, punctuation marks used to enclose the title) of input is processed；

Two: by rule by Mongolian transcription to latin transliteration；

Three: use method based on historical models to carry out automatic part-of-speech tagging Mongolian part-of-speech tagging；

Four: use G2P model by latin transliteration to phoneme；

Five: use regular by syllable splitting, and add up the syllable number comprised in each word；

Six: three, five parts of speech obtained and syllable information are combined into the data of prosody prediction desirable format；

Seven: carry out prosody prediction by method based on condition random field；

Eight: process and obtain synthesizing required text formatting.

Present invention Mongol based on HMM phoneme synthesizing method can the duration of the most involutory one-tenth voice, pitch period It is adjusted etc. parameter, keeps the natural and tripping of synthesis voice simultaneously；Can be by MLLR, MAP scheduling algorithm to parameter Change such that it is able to synthesizing the sound of different tone color, different from waveform concatenation, this conversion need not additionally make Make sound storehouse, only need a small amount of training data to be trained.

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all spirit in the present invention Any amendment, equivalent and improvement etc. with being made within principle, should be included within the scope of the present invention.

Claims

1. the method for a Mongol phonetic synthesis front-end processing based on HMM, it is characterised in that should be based on HMM The method of Mongol phonetic synthesis front-end processing comprise the following steps:

The first step, processes the Mongolian spcial character of input；

Second step, by rule by Mongolian transcription to latin transliteration；

4th step, uses G2P model by latin transliteration to phoneme；

8th step, processes and obtains synthesizing required text formatting.

2. the method for Mongol phonetic synthesis front-end processing based on HMM as claimed in claim 1, it is characterised in that In step one, Mongolian spcial character includes numeral, punctuation marks used to enclose the title character.