CN105957518B

CN105957518B - A kind of method of Mongol large vocabulary continuous speech recognition

Info

Publication number: CN105957518B
Application number: CN201610440618.9A
Authority: CN
Inventors: 飞龙; 高光来; 张红伟
Original assignee: Inner Mongolia University
Current assignee: Inner Mongolia University
Priority date: 2016-06-16
Filing date: 2016-06-16
Publication date: 2019-05-31
Anticipated expiration: 2036-06-16
Also published as: CN105957518A

Abstract

The invention discloses a kind of methods of Mongol large vocabulary continuous speech recognition, are made of pretreatment stage, preparation stage, training stage, decoding stage and synthesis conversion stage；Pretreatment stage is the cutting to text training corpus, and establishes pronunciation dictionary；Preparation stage is to extract acoustic feature to the voice signal of input；Training stage is using whole word pronunciation dictionary training acoustic model, utilizes the training text train language model after cutting；Decoding stage is that the acoustic feature of input is identified as text information using acoustic model, language model and pronunciation dictionary；The synthesis conversion stage is using the lattice suffix mistake during regular correcting decoder and to merge stem dative suffix, the sentence that final output is made of a Mongolian word.Solving speech recognition system in the prior art can not be comprising extensive Mongol word, by the excessive overlong time for leading to speech recognition of word amount, the sparse problem of language model data in speech recognition system.

Description

A kind of method of Mongol large vocabulary continuous speech recognition

Technical field

The invention belongs to technical field of voice recognition, are related to a kind of method of Mongol large vocabulary continuous speech recognition.

Background technique

Speech recognition is to realize a key technology of man machine language's communication, it is related to acoustics, linguistics, at digital signal Multiple subject technologies such as reason, computer science are a cutting edge technologies of field of information processing, and how the main problem of solution is The acoustic information received is converted into text information.According to different mission requirements, speech recognition can be divided into: speaker knows Not, the several types such as keyword spotting and continuous speech recognition.It has been successfully applied to industry, household electrical appliances, communication, automobile at present The every field such as electronics, medical treatment, home services and consumption electronic product, and achieve extraordinary effect.

The language identified in practical study application field is still with the most widely used languages such as English and Chinese It is main, and to the language that some use scopes are smaller or number of users is less, the research of speech recognition is still in the initial stage.Mongolia Language studies its speech recognition technology not only to the education of the minority area in China, traffic, logical as such a language News, office automatic have great importance, and the research identified to the other country's language voice for also belonging to agglutinative language provides New idea and method.

According to " research of Mongol voice keyword detection technology, flying dragon, " Chinese Ph.D. Dissertation's full-text database Information technology volume ", in November, 2013 " described in the scheme for building speech recognition system be divided into three phases.As shown in Figure 1, the One stage was preparation stage (or front-end processing stage), its main effect is to extract acoustic feature to the voice signal of input. Second stage is the training stage, and main function is that training is used to decoded acoustic model and language model.Phase III is solution The code stage, that is, be identified as the acoustic feature of input using the obtained acoustic model of second stage training and language model Text information.

It is a processing compression process to voice signal information that acoustic feature, which extracts, this in the process to voice signal into Row analysis processing, retains its information relevant to speech recognition, removes the redundancy unrelated with its.Common extraction acoustics is special The linear prediction cepstrum coefficient (LPCC) of the mode of sign, mel-frequency cepstrum coefficient (MFCC) and Filter-Bank (Fbank) are special Sign.But the distinction and adaptability due to these features do not achieve the effect that expect, often make in the training process It is linearly returned with linear discriminant analysis (Linear Discriminant Analysis, LDA) and feature space maximum likelihood The methods of (featurespace Maximum Likelihood Linear Regression fMLLR) comes the area of Enhanced feature Divide property and adaptability.

In the training process, frequently with GMM-HMM (Gaussian Mixture-markov) model is first trained, DNN is trained later (deep neural network) model is used to substitute GMM (Gaussian Mixture) model, and it is (deep to form the DNN-HMM based on deep neural network Spend neural network-markov) model.To language model, then the general training N-gram language model either language based on RNN Say model.

For acoustic feature, an identification network is created as using acoustic model, language model and pronunciation dictionary structure.The net Network is a directed acyclic graph, and an optimal path (path of maximum probability) for the network is found by Viterbi algorithm, this Paths are exactly the best text information that voice signal is identified by identifying system.Simultaneously in use, language is usually given Speech model assigns different weights, and a long word is arranged and punishes score, for finding the best of language model and acoustic model Specific gravity.

Include million or more Mongol word in Mongol and is continually introducing new vocabulary.In the actual environment I All Mongol words can not be integrally incorporated in pronunciation dictionary, the corpus of text being collected into also can not be by all Mongolia Language word is all summarized, and will appear missing or rare situation to many words, will lead in this way train language model when There is the problem of Sparse in time.Simultaneously with the increase of word quantity in pronunciation dictionary, it will lead to speech recognition system and knowing Calculation amount increases during not, and recognition time extends to the intolerable degree of user.

Summary of the invention

To achieve the above object, the present invention provides a kind of method of Mongol large vocabulary continuous speech recognition, solves Speech recognition system can not include extensive Mongol word in the prior art, by the word amount excessive time for leading to speech recognition It is too long, the sparse problem of language model data in speech recognition system.

The technical scheme adopted by the invention is that a kind of method of Mongol large vocabulary continuous speech recognition, by advance Reason stage, preparation stage, training stage, decoding stage and synthesis conversion stage composition；

Pretreatment stage is exactly by the segmentation of words in language model training text into stem other than verb, lattice suffix and dynamic The form of word, while establishing the pronunciation dictionary based on stem, lattice suffix and verb other than verb；

Preparation stage is to extract acoustic feature to the voice signal of input；

Training stage is to establish acoustic model using the pronunciation dictionary based on the whole word of Mongolian, using based on word other than verb The pronunciation dictionary of dry, lattice suffix and verb establishes language model；

Decoding stage is to utilize acoustic model, language model and the pronunciation based on stem, lattice suffix and verb other than verb Dictionary creation identifies network, and the acoustic feature of input is identified as text information；

The synthesis conversion stage is using the lattice suffix mistake during regular correcting decoder and to merge stem dative suffix, The sentence that final output is made of a Mongolian word.

Of the invention to be further characterized in that, further, pretreatment stage specifically follows the steps below: in training mould Before type, a Mongolian word in the training set text of language model is converted into corresponding Latin state；It later will be after conversion The segmentation of words is deposited at stem, lattice suffix and verb form other than corresponding verb, and by stem, lattice suffix and verb other than verb It is placed on based on other than verb in the pronunciation dictionary of stem, lattice suffix and verb.

Further, the application method of pronunciation dictionary, specifically follows the steps below: establish two kinds of pronunciation dictionaries, one The whole word of kind pronunciation dictionary storage Mongolian and corresponding pronunciation, the training for acoustic model；Another pronunciation dictionary is deposited Put stem other than verb, stem, lattice suffix and verb pronounce accordingly other than lattice suffix and verb and verb, while establishing hair The all possible pronunciation of lattice suffix is all added in pronunciation dictionary when sound dictionary, the decoding for acoustic model.

Further, the conversion stage is synthesized, is specifically followed the steps below:

Step 1, the lattice suffix mistake in text after regular correcting decoder is utilized；

Step 2, stem dative suffix is merged to the word for being combined into corresponding Latin form, while utilizing condition random field mould Type carries out punctuation mark prediction to the sentence after identification, and prediction result is added in the sentence of identification；

Step 3, by the contrast relationship of Latin word and Mongolian word, the Latin word merged is converted into actual A Mongolian word is exactly actual output result by the sentence that a Mongolian word forms.

The invention has the advantages that the invention has the following advantages that

(1) Mongolian Speech Recognition Systems based on stem, lattice suffix and verb other than verb can by identification stem, Lattice suffix and verb realize the identification to most of a Mongolian words.

(2) Mongolian Speech Recognition Systems based on stem, lattice suffix and verb other than verb reduce in pronunciation dictionary The number of word greatly reduces the calculation amount of system identification, by recognition time control within tolerance interval.

(3) Mongolian Speech Recognition Systems based on stem, lattice suffix and verb other than verb solve language in system The sparse problem of model data, so that system performance greatly improves.

Detailed description of the invention

Fig. 1 is speech recognition system frame diagram in the prior art.

Fig. 2 is Mongolian splicing word formation pattern schematic diagram of the present invention.

Fig. 3 is speech recognition system frame diagram of the present invention.

Fig. 4 is the instance graph of pretreatment stage cutting Mongolian sentence of the present invention.

Fig. 5 is two kinds of pronunciation dictionary partial content tables of comparisons of the invention.

Fig. 6 is the selection rule schema of regular correction section ending suffix of the invention.

Fig. 7 is the instance graph in present invention synthesis conversion stage.

Specific embodiment

The principle of Mongol segmentation identification:

Mongol is typical agglutinative language, is mainly spliced by root and affixe to constitute Mongol word, such as Fig. 2 institute Show.From splicing and combining for root and affixe, it can be seen that root and morphological affix are configured the splicing of suffix there is reality Semantic modification, and then there was only phraseological meaning with the splicing of ending suffix later, and to be stored in composition always single for position Word it is last.Ending suffix be then not belonging to stem suffix, it include quiet word lattice suffix, possess and control (owner) suffix, formula verb (when Between, person) suffix and secondary verbal suffix.And for participle suffix, if participle can consider when serving as the predicate of main clause Be ending suffix, but when participle is used as quiet word (especially back connect add lattice suffix when) may be considered stem after Sew.Under normal circumstances, the order of suffix is word-building suffix preceding, and configuration suffix is rear, and the suffix that ends up is last.Structure in word Word suffix and configuration suffix can have more than one, but the suffix that ends up it is general only one (Mongolian is sewed after reversed body possess and control When sewing can there are two end up suffix).By root, word-building suffix and configuration suffix splicing composition stem, allow stem and ending word Sew the basis as Mongolian language word-building, different stems and different ending suffix can be combined into most of Mongol list Word.The training identification of word can be converted into knowing the training of stem and ending suffix in this way in speech recognition system Not.But there is the following in the simple training identification method based on stem and ending suffix.Firstly, Mongolian verb Will appear when stem and ending suffix cutting, phenomena such as falling off and be inserted into of vowel, so it is difficult to ensure that cutting when cutting Accuracy rate.Secondly, the knot of different verbs is spliced in the pronunciation of the ending suffix of verb stem and verb in verb stem suffix When tail suffix, the pronunciation of verb stem and the suffix that ends up the transformation of vowel and consonant phoneme can all occur, be inserted into and fall off etc. one Series of problems, so it is times that impossible complete that the pronunciation of all verb stems and verb ending suffix, which is added to pronunciation dictionary, Business, this proposes very big challenge to the foundation of pronunciation dictionary.However, other stems other than verb sew the ending suffix connect is Lattice suffix, the pronunciation of lattice suffix with stem be it is relatively independent, sew and connect different lattice suffix, will not influence the pronunciation of stem, institute More stable with the pronunciation of stem other than verb, we only need the different pronunciations of lattice pronunciation dictionary is added.

Therefore, verb is separately separated out by we, by stem other than verb and verb and lattice suffix collectively as identification Unit, so in the text identifying system be known as the speech recognition system based on stem, lattice suffix and verb other than verb.

Mongolian Speech Recognition Systems based on stem and ending suffix are built:

Mongolian Speech Recognition Systems based on stem, lattice suffix and verb other than verb are by pretreatment stage, preparation rank Section, training stage, decoding stage and synthesis conversion stage composition.Pretreatment stage is instructed to phonetic symbol text and language model Practice text Latin conversion and conversion after language model training text Inner Mongol ancient Chinese prose word cutting, while establish based on verb with The pronunciation dictionary of outer stem, lattice suffix and verb；Preparation stage is to extract acoustic feature to the voice signal of input；Training stage It is using whole word pronunciation dictionary training acoustic model, utilizes the training text train language model after cutting；Decoding stage is benefit With acoustic model, language model and pronunciation dictionary based on stem, lattice suffix and verb other than verb, by the acoustic feature of input It is identified as text information.Wherein preparation stage, training stage and decoding stage are unrelated with language, and the present invention is mainly to pronunciation word Allusion quotation, newly added pretreatment stage and synthesis conversion stage are adjusted.Since Mongolian letter is in the different location of word Have different deformations, and there are problems that similar shape not unisonance in letter, this when building Mongolian Speech Recognition Systems, It is unfavorable for making a search to the recognition performance of system, so the application is in pretreatment stage by the text in pronunciation dictionary, sound bank The equal transcription of Mongolian word in the text training set of mark and train language model passes through increased conjunction at Latin form Show that actual Mongolian sentence, frame diagram are as shown in Figure 3 at conversion process.

The pretreatment of language model training:

For the training set of language model, need for the word in training set to be cut into stem, lattice other than corresponding verb Suffix and verb form.Mongolian lattice suffix is write in written word using the narrow Nonbreaking Space of Mongolian point.Mongolian is narrow continuously The width in disconnected space is the one third of double byte character, and slightly more shorter than common space, Latin form is indicated with "-".Such as Fig. 4 It is shown, it is carried out other than verb in the corpus of text for the train language model being converted into after Latin form according to "-" letter is convenient The cutting of stem and lattice suffix；Training text after cutting is used to be trained language model.

Language model is trained using the training text after cutting, enable language model in decoding process very Good is matched with the pronunciation dictionary of stem, lattice suffix and verb other than verb.The result obtained after the decoding in this way is Exist with stem, lattice suffix and verb form other than verb.Stem and lattice suffix can combine large-scale Mongolia other than verb Literary word, and other than verb stem, lattice suffix and common verb sum within tens of thousands of.This solves language models The identification problem of Sparse Problem and extensive a Mongolian word in the training process.

The variation and use of pronunciation dictionary:

Different from original Mongolian Speech Recognition Systems, the present invention will use two kinds of pronunciation dictionaries, and one is traditional Store the whole word of Mongolian and its correspond to pronunciation pronunciation dictionary, another kind be storage verb other than stem, lattice suffix and verb with And its pronunciation dictionary accordingly to pronounce, and a variety of pronunciation situations of same lattice suffix are directed to, table one by one is needed in pronunciation dictionary It shows and.As shown in figure 5, being two pronunciation dictionary partial content tables of comparisons, it can be seen that the word of whole word pronunciation dictionary storage There are two types of the forms of expression for the pronunciation dictionary of stem, lattice suffix and verb other than based on verb, and one is constant forms, i.e. verb It is exactly other parts of speech of stem with whole word, indicates consistent in two kinds of pronunciation dictionaries, " sagvjv " and " qasidahv " in Fig. 5 Belong to verb, " elqin " is then that only stem, the form that they are stored in two pronunciation dictionaries are constant；It is another then be by Other words of the non-verb of stem and lattice suffix composition, this word stem, lattice suffix and verb other than based on verb It is divided into stem in pronunciation dictionary and lattice suffix stores respectively." tarihi-ban " and " tere-yi " in Fig. 5, they by Stem and lattice suffix are composed, therefore are divided into word in the pronunciation dictionary of stem, lattice suffix and verb other than based on verb Dry " tarihi ", " tere " and lattice suffix "-ban ", "-yi " are stored respectively.We use whole word in training acoustic model Pronunciation dictionary, such acoustic training model can more accurately indicate to train the corresponding pronunciation phonemes of sentence.Otherwise, it is embroidered with after lattice Multiple pronunciations, the pronunciation default choice of training sentence the first pronunciation therein, will appear the pronunciation of many training sentences in this way Phoneme conversion mistake.The pronunciation dictionary based on stem, lattice suffix and verb other than verb is then used in decoding process.

Not only had to word in collecting using being decoded based on the pronunciation dictionary of stem, lattice suffix and verb other than verb Effect same as the pronunciation dictionary based on whole word, and utilize the pronunciation word based on stem, lattice suffix and verb other than verb Allusion quotation can preferably arrange in pairs or groups with the language model after cutting, and make in the way of stem other than verb, lattice suffix and verb It is able to solve the problem of identifying large-scale a Mongolian word, while this mode reduces word quantity in pronunciation dictionary, Time needed for reducing identification, solves the problems, such as existing Mongol speech recognition overlong time.

Synthesize the conversion stage:

During the experiment, it has been found that in some error results after the decoding, there is universal rule.These rule Rule, is concentrated mainly in Mongol on the decoding error of lattice suffix.Therefore these mistakes are directed to, can be used more Mongolian Rule corrects it.As shown in fig. 6, judging lattice suffix "-dv " ,-du ", the selection of "-tv ", "-tu ", be positive word in stem In the case where, if stem is not with vowel or " n ", " N ", " l ", "-tv " lattice suffix is chosen in " m " ending, if stem is with vowel Or " n " ending then selects "-dv " lattice suffix.Conversely, in the case where stem is not positive word, if stem be not with vowel or " n ", " N ", " l ", " m " ending, then select lattice suffix "-tu ", if stem is with vowel or " n ", " N ", " l ", " m " ending, then It selects lattice suffix "-du ".

Therefore in the synthesis conversion stage, it is necessary first to carry out the lattice suffix mistake in decoding process by the way of rule It corrects, stem dative suffix is merged into corresponding Latin word later, while using condition random field to the Mongolia after identification Sentence is made pauses in reading unpunctuated ancient writings and adds punctuation mark.Finally by the contrast relationship of Latin word and a Mongolian word, it is converted It is exactly actual output result by the sentence that a Mongolian word forms at actual a Mongolian word.

The lattice suffix correction for identifying mistake can be further improved voice using Mongol rule and knows by the synthesis conversion stage Other accuracy rate.The result after identification can be shown in the form of Mongolian simultaneously.This solves a part of acoustic mode The problem of type and language model can not distinguish approximate lattice suffix completely, while solving the display problem of Mongolian.Fig. 7 gives Realize a full instance in synthesis conversion stage, first sentence is the early results after identification in figure, and second sentence is then It is by after rule regulating as a result, the lattice suffix of overstriking is exactly the correct lattice suffix obtained by rule regulating in sentence.The Three sentences are the results that prediction punctuation mark obtains after merging；4th sentence is the knot being converted into after the Mongolian form of expression Fruit.

Claims

1. a kind of method of Mongol large vocabulary continuous speech recognition, which is characterized in that by pretreatment stage, the preparation stage, Training stage, decoding stage and synthesis conversion stage composition；

The pretreatment stage is exactly by the segmentation of words in language model training text into stem other than verb, lattice suffix and dynamic The form of word, while establishing the pronunciation dictionary based on stem, lattice suffix and verb other than verb；

The preparation stage is to extract acoustic feature to the voice signal of input；

The training stage is to establish acoustic model using the pronunciation dictionary based on the whole word of Mongolian, using based on word other than verb The pronunciation dictionary of dry, lattice suffix and verb establishes language model；

The decoding stage is to utilize acoustic model, language model and the pronunciation based on stem, lattice suffix and verb other than verb Dictionary creation identifies network, and the acoustic feature of input is identified as text information；

2. a kind of method of Mongol large vocabulary continuous speech recognition according to claim 1, which is characterized in that described Pretreatment stage specifically follows the steps below: before training pattern, by the Mongolian in the training set text of language model Word is converted into corresponding Latin state；Later by the segmentation of words after conversion at stem other than corresponding verb, lattice suffix and Verb form, and stem, lattice suffix and verb other than verb are stored in the hair based on stem, lattice suffix and verb other than verb In sound dictionary.

3. a kind of method of Mongol large vocabulary continuous speech recognition according to claim 1, which is characterized in that described The application method of pronunciation dictionary, specifically follows the steps below: establishing two kinds of pronunciation dictionaries, a kind of pronunciation dictionary storage Mongolia The whole word of text and corresponding pronunciation, the training for acoustic model；Other than another pronunciation dictionary storage verb after stem, lattice By lattice suffix when sewing and pronounce accordingly with stem, lattice suffix and verb other than verb and verb, while establishing pronunciation dictionary All possible pronunciation is all added in pronunciation dictionary, the decoding for acoustic model.

4. a kind of method of Mongol large vocabulary continuous speech recognition according to claim 1, which is characterized in that described The conversion stage is synthesized, is specifically followed the steps below:

Step 2, stem dative suffix is merged to the word for being combined into corresponding Latin form, while utilizing conditional random field models pair Sentence after identification carries out punctuation mark prediction, and prediction result is added in the sentence of identification；

Step 3, by the contrast relationship of Latin word and Mongolian word, the Latin word merged is converted into actual Mongolia Literary word is exactly actual output result by the sentence that a Mongolian word forms.