CN103632663A

CN103632663A - HMM-based method of Mongolian speech synthesis and front-end processing

Info

Publication number: CN103632663A
Application number: CN201310595871.8A
Authority: CN
Inventors: 飞龙; 高光来; 赵建东; 张学良
Original assignee: Individual
Current assignee: Inner Mongolia University
Priority date: 2013-11-25
Filing date: 2013-11-25
Publication date: 2014-03-12
Anticipated expiration: 2033-11-25
Also published as: CN103632663B

Abstract

The invention discloses an HMM-based method of Mongolian speech synthesis and front-end processing. The method comprises the following steps of analyzing speech data in a speech corpus; summarizing a context attribute set and a problem set for clustering decision-making trees; performing HMM training and acquiring HMM training results; acquiring corresponding clustering state models, and forming a clustering state model sequence; generating a target acoustic parameter sequence through the dynamic characteristics of parameters, and acquiring synthesis speech through a STRAIGHT synthesizer. The method of front-end processing comprises the following steps of performing special character processing on input Mongolian; transferring Mongolian to Latin; annotating the part of speech of Mongolian; transferring Latin to phonemes by adopting a G2P model; segmenting syllables by adopting rules; synthesizing data in the format required by rhythm forecasting; performing the rhythm forecasting; processing and acquiring a text format required by the synthesis. By the method, the duration and the pitch period parameter of the synthesis speech can be adjusted freely, the naturality and fluency of the synthesis speech are ensured, and the cost of the synthesis is reduced.

Description

A kind of Mongol phonetic synthesis based on HMM and the method for front-end processing

Technical field

The invention belongs to speech synthesis technique field, relate in particular to a kind of Mongol phonetic synthesis based on HMM and the method for front-end processing.

Background technology

Phonetic synthesis is to realize a gordian technique of man machine language's communication, it relates to a plurality of subject technologies such as acoustics, linguistics, digital signal processing, computer science, it is a cutting edge technology of field of information processing, the subject matter solving is exactly how Word message to be converted into the acoustic information that can listen, allows exactly machine lift up one's voice as people.

The phoneme synthesizing method of current main flow is the speech synthesis technique based on HMM, compare with the waveform concatenation technology based on Big-corpus, it has can be at short notice, substantially do not need automatically to construct a new system in the situation of manual intervention, and whole training process do not rely on the advantage of the factors such as speaker, pronunciation style and emotion substantially, at aspects such as Chinese, English, Japanese, there have been many adaptable speech synthesis systems.Aspect Mongolian phonetic synthesis, forefathers have done the research of the waveform concatenation method aspect based on Big-corpus.These research work are significant for the development of Mongol phonetic synthesis, but the Mongol phonetic synthesis research based on HMM is scarcely out of swaddling-clothes.The Mongolian speech synthesis technique of research based on HMM and build the Mongolian speech synthesis system based on HMM, has great importance for education, traffic, communication, the office automatic of minority area.

Prior art one, the Mongol phoneme synthesizing method based on stem affixe:

First by text analysis model, the text of input is formatd to processing, the word that record will pronounce and punctuation mark, the aphonic character of filtering; Then to each word, first in stem table, search, if find this word original shape, extract corresponding speech data; If can not find in stem table, need to carry out the cutting of stem affixe, to find the stem affixe that this word is corresponding, carry out the extraction of the corresponding speech data of stem affixe of this word simultaneously.Then record prosodic features, and sharp PSOLA algorithm carries out prosody modification.Finally according to splicing rule, the voice segment of choosing is spliced, produce final synthetic speech;

Technical scheme text analysis model need to be processed the problems such as cutting of text formatting and stem affixe.In system, when running into the English symbol representing or English word, only simply process, be exactly to read successively each English alphabet, further do not consider. for each word obtaining in text analyzing, first in stem table, search, if find this word, extract speech data, if can not find, carry out the cutting of stem affixe.Word in Mongol may only carry out a cutting just can find corresponding stem affixe, sometimes may need the stem after cutting to carry out secondary cutting, or carry out repeatedly cutting, so proposed the concept of multistage stem affixe sheer;

The prosodic features that represents phrase, sentence by record the prosodic features of each word at text analysis model.By recording the variation of duration variation, pitch variation and the amplitude of word, express the stress of word, in phrase, the variation of the duration of each word, stress change, the variation of the sentence tone (comprising stress, duration and fundamental frequency).After text analyzing, draw the rhythm variation characteristic of each word, in system, by TDPSOLA and FD-PSOLA, modify, finally obtain amended voice segment;

The splicing of phonetic synthesis unit is the most frequently used three kinds of algorithms: diphone splicing, fight recklessly connect, soft splicing.Fighting recklessly the method connecing is the method for a kind of simply two voice are put together (simple concatenation).Soft splicing, the position of splicing is positioned at the boundary of two segments equally.But, by introducing transient characteristic in natural language, carry out the transition situation of smoothing speech stitching portion.Between two speech primitives, may need certain overlapping.In voice joint is synthetic, if only use to fight recklessly, connect, the voice after synthetic sound sometimes fast and sometimes slow, shake very large, shortage continuity.So in the technical program, adopted the method connecing with soft splicing combination of fighting recklessly while splicing.In system, adopt the overlapping soft joining method of head and the tail to carry out transition.The overlapping soft joining method of so-called head and the tail is: the stem of the afterbody of last voice unit and a rear voice unit is carried out to the waveform stack of certain length.Choosing of certain last voice segment and a rear voice segment overlapping portion length is most important.By summing up the length of stack, to equal 1/8th left and right of two voice unit minimum lengths to be superimposed proper.So both can guarantee that the voice between synthetic rear word and word were coherent, smooth, and can make again pronunciation clear, improve the naturalness of phonetic synthesis.

The shortcoming of prior art: the waveform concatenation based on large database concept generally need to be used a very large sound storehouse, and this has just hindered its application in mobile device or embedded device; Secondly, the synthetic sound of the method for waveform concatenation is comparatively single, if need to change the features such as sex, age of synthetic video, needs to re-establish a sound storehouse, and required input is very large; And although there is the stitching algorithm of a lot of prosody adjustments, to the scope of pitch period and duration adjustment or limited, if adjustment is larger, the naturalness of synthetic speech can obviously decline.

Summary of the invention

The object of the embodiment of the present invention is to provide a kind of Mongol phonetic synthesis based on HMM and the method for front-end processing, is intended to solve that the synthetic cost that existing Mongol phoneme synthesizing method exists is large, the problem being limited in scope of pitch period and duration adjustment.

The embodiment of the present invention is achieved in that a kind of method of the Mongol phonetic synthesis based on HMM, and the method for being somebody's turn to do the Mongol phonetic synthesis based on HMM comprises the following steps:

Step 1, first will analyze the speech data in sound storehouse, extracts corresponding speech parameter;

Step 2, according to the speech parameter extracting, the observation vector of HMM can be divided into spectrum and two parts of fundamental frequency, and summary is to context property collection with for the problem set of decision tree cluster;

Step 3, according to model training and duration modeling training after model initialization, the female HMM training of sound, the training of extended context correlation model, cluster, carry out the training of HMM, finally obtain comprising that the cluster HMM of spectrum, fundamental frequency and duration parameters and decision tree separately carry out the training result of HMM;

Step 4, input text is converted to context-sensitive unit sequence after text analyzing, and the decision tree that then utilizes training to obtain carries out decision-making to each unit, obtains corresponding cluster state model, forms cluster state model sequence;

Step 5, according to parameter generation algorithm, utilizes the dynamic perfromance of parameter to generate target parameters,acoustic sequence, and obtains final synthetic speech by STRAIGHT compositor.

Further, the method for this Mongol phonetic synthesis based on HMM is divided into training stage and synthesis phase.

Further, in step 2, spectrum argument section adopts continuous probability distribution HMM to carry out modeling, and fundamental frequency partly adopts many spatial probability distribution HMM.

Another object of the embodiment of the present invention is to provide a kind of method of the Mongol phonetic synthesis front-end processing based on HMM, and the method for being somebody's turn to do the Mongol phonetic synthesis front-end processing based on HMM comprises the following steps:

The first step, carries out special character by the Mongolian of input and processes;

Second step, arrives latin transliteration by rule by Mongolian transcription;

The 3rd step, adopts the method based on historical models to carry out part of speech automatic marking to Mongolian part-of-speech tagging;

The 4th step, adopts G2P model that latin transliteration is arrived to phoneme;

The 5th step, adopts rule by syllable splitting, and adds up the syllable number comprising in each word;

The 6th step, becomes the part of speech obtaining the data of the required form of prosody prediction with syllable information combination;

The 7th step, uses the method based on condition random field to carry out prosody prediction;

The 8th step, processes and obtains synthetic required text formatting.

Further, in step 1, Mongolian is carried out special character and is comprised numeral, punctuation marks used to enclose the title character.

Mongol phonetic synthesis based on HMM provided by the invention and the method for front-end processing, the foundation of the corpus by the speech synthesis system based on HMM, the context property relevant to Mongol selected and for the determining of the problem set of cluster, and the processing of speech synthesis system front end.The present invention adopts Mongolian special symbol to process in front-end processing process, and Mongolian is to latin transliteration, and latin transliteration is to phoneme, the cutting of Mongol syllable, the prediction of the automatic marking of Mongol part of speech and the Mongol rhythm.The present invention can freely adjust parameters such as the duration of synthetic speech, pitch periods, keeps the natural and tripping of synthetic speech simultaneously; By MLLR, MAP scheduling algorithm, parameter is changed, thereby can be synthesized the sound of different tone colors, do not need to make in addition sound storehouse, reduced synthetic cost.

Accompanying drawing explanation

Fig. 1 is the method flow diagram of the Mongol phonetic synthesis based on HMM that provides of the embodiment of the present invention;

Fig. 2 is the method flow diagram of the Mongol phonetic synthesis front-end processing based on HMM that provides of the embodiment of the present invention.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearer, below in conjunction with embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.

Below in conjunction with drawings and the specific embodiments, application principle of the present invention is further described.

As shown in Figure 1, the method for the Mongol phonetic synthesis based on HMM of the embodiment of the present invention comprises the following steps:

S101: first will analyze the speech data in sound storehouse, extract corresponding speech parameter;

S102: according to the speech parameter extracting, the observation vector of HMM can be divided into spectrum and two parts of fundamental frequency, summary is to context property collection with for the problem set of decision tree cluster;

S103: carry out the training of HMM according to model training and duration modeling training after model initialization, the female HMM training of sound, the training of extended context correlation model, cluster, finally obtain comprising that the cluster HMM of spectrum, fundamental frequency and duration parameters and decision tree separately carry out the training result of HMM;

S104: input text is converted to context-sensitive unit sequence after text analyzing, the decision tree that then utilizes training to obtain carries out decision-making to each unit, obtains corresponding cluster state model, forms cluster state model sequence;

S105: according to parameter generation algorithm, utilize the dynamic perfromance of parameter to generate target parameters,acoustic sequence, and obtain final synthetic speech by STRAIGHT compositor.

The method concrete steps of the Mongol phonetic synthesis based on HMM of the embodiment of the present invention are:

Be divided into two stages: training stage and synthesis phase;

Training stage mainly comprises pre-service and HMM training, at pretreatment stage, first to analyze the speech data in sound storehouse, extract corresponding speech parameter, according to the speech parameter extracting, the observation vector of HMM can be divided into spectrum and two parts of fundamental frequency, wherein composing argument section adopts continuous probability distribution HMM to carry out modeling, and fundamental frequency partly adopts many spatial probability distribution HMM, in addition, also having an important job before model training is exactly to design to context property collection with for the problem set of decision tree cluster, according to priori select some relevant to Mongol and to parameters,acoustic (spectrum, fundamental frequency and duration) there is the context property of certain influence and design corresponding problem set for context dependent model cluster,

It after pre-service completes, is exactly whole HMM training, its training step is followed successively by model training and the duration modeling training after model initialization, the female HMM training of sound, the training of extended context correlation model, cluster, and the training result finally obtaining comprises the cluster HMM of spectrum, fundamental frequency and duration parameters and decision tree separately;

At synthesis phase, first input text is converted to context-sensitive unit sequence after text analyzing, then the decision tree that utilizes training to obtain carries out decision-making to each unit, obtain corresponding cluster state model, and form cluster state model sequence, last, according to parameter generation algorithm, utilize the dynamic perfromance of parameter to generate target parameters,acoustic sequence, and obtain final synthetic speech by STRAIGHT compositor.

As shown in Figure 2, the method for the Mongol phonetic synthesis front-end processing based on HMM of the embodiment of the present invention comprises the following steps:

S201: special character is carried out in the Mongolian of input and process;

S202: Mongolian transcription is arrived to latin transliteration by rule;

S203: adopt the method based on historical models to carry out part of speech automatic marking to Mongolian part-of-speech tagging;

S204: adopt G2P model that latin transliteration is arrived to phoneme;

S205: adopt rule by syllable splitting, and add up the syllable number comprising in each word;

S206: the part of speech obtaining is become to the data of the required form of prosody prediction with syllable information combination;

S207: use the method based on condition random field to carry out prosody prediction;

S208: process and obtain synthetic required text formatting.

The method concrete steps of the Mongol phonetic synthesis front-end processing based on HMM of the present invention are:

One, the Mongolian of input being carried out to special character (comprising the characters such as numeral, punctuation marks used to enclose the title) processes;

Two: by rule, Mongolian transcription is arrived to latin transliteration;

Three: to Mongolian part-of-speech tagging, adopt the method based on historical models to carry out part of speech automatic marking;

Four: adopt G2P model that latin transliteration is arrived to phoneme;

Five: adopt rule by syllable splitting, and add up the syllable number comprising in each word;

Six: three, five parts of speech that obtain are become to the data of the required form of prosody prediction with syllable information combination;

Seven: use the method based on condition random field to carry out prosody prediction;

Eight: process and obtain synthetic required text formatting.

The Mongol phoneme synthesizing method that the present invention is based on HMM can freely be adjusted parameters such as the duration of synthetic speech, pitch periods, keeps the natural and tripping of synthetic speech simultaneously; Can to parameter, change by MLLR, MAP scheduling algorithm, thereby can synthesize the sound of different tone colors, different from waveform concatenation, this conversion does not need to make in addition sound storehouse, only needs an a small amount of training data to train.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any modifications of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims

1. a method for the Mongol phonetic synthesis based on HMM, is characterized in that, the method for being somebody's turn to do the Mongol phonetic synthesis based on HMM comprises the following steps:

2. the method for the Mongol phonetic synthesis based on HMM as claimed in claim 1, is characterized in that, the method for being somebody's turn to do the Mongol phonetic synthesis based on HMM is divided into training stage and synthesis phase.

3. the method for the Mongol phonetic synthesis based on HMM as claimed in claim 1, is characterized in that, in step 2, spectrum argument section adopts continuous probability distribution HMM to carry out modeling, and fundamental frequency partly adopts many spatial probability distribution HMM.

4. a method for the Mongol phonetic synthesis front-end processing based on HMM, is characterized in that, the method for being somebody's turn to do the Mongol phonetic synthesis front-end processing based on HMM comprises the following steps:

Second step, arrives latin transliteration by rule by Mongolian transcription;

The 8th step, processes and obtains synthetic required text formatting.

5. the method for the Mongol phonetic synthesis front-end processing based on HMM as claimed in claim 4, is characterized in that, in step 1, Mongolian is carried out special character and comprised numeral, punctuation marks used to enclose the title character.