CN108182938A - A kind of training method of the Mongol acoustic model based on DNN - Google Patents

A kind of training method of the Mongol acoustic model based on DNN Download PDF

Info

Publication number
CN108182938A
CN108182938A CN201711390467.1A CN201711390467A CN108182938A CN 108182938 A CN108182938 A CN 108182938A CN 201711390467 A CN201711390467 A CN 201711390467A CN 108182938 A CN108182938 A CN 108182938A
Authority
CN
China
Prior art keywords
dnn
mongol
hmm
acoustic
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711390467.1A
Other languages
Chinese (zh)
Other versions
CN108182938B (en
Inventor
马志强
杨双涛
李图雅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University of Technology
Original Assignee
Inner Mongolia University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University of Technology filed Critical Inner Mongolia University of Technology
Priority to CN201711390467.1A priority Critical patent/CN108182938B/en
Publication of CN108182938A publication Critical patent/CN108182938A/en
Application granted granted Critical
Publication of CN108182938B publication Critical patent/CN108182938B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/148Duration modelling in HMMs, e.g. semi HMM, segmental models or transition probabilities
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Abstract

The present invention provides a kind of training methods of the Mongol acoustic model based on DNN.GMM gauss hybrid models are replaced with DNN deep neural networks, realizes and the posterior probability of Mongol acoustic states is estimated, build DNN HMM acoustic models, and disclose the training method of the model.The present invention can effectively reduce the error rate of word identification and the error rate of word identification, improve model performance.

Description

A kind of training method of the Mongol acoustic model based on DNN
Technical field
The invention belongs to Mongol field of speech recognition, and in particular to a kind of instruction of the Mongol acoustic model based on DNN Practice method.
Background technology
Typical large vocabulary Continuous Speech Recognition System (Large Vocabulary Continuous Speech Recognition, LVCSR) by feature extraction, acoustic model, language model and decoder etc. form acoustic models be voice know The core component of other system is built based on GMM model (mixed Gauss model) and HMM model (hidden Markov model) GMM-HMM acoustic models had been once acoustic models most widely used in large vocabulary Continuous Speech Recognition System.
In GMM-HMM models, GMM model carries out probabilistic Modeling to speech feature vector, then passes through EM algorithms (the maximum phase Hope algorithm) the maximization probability that voice observes feature is generated, when Gaussian mixtures number is enough, GMM model can fill Divide the probability distribution of fitting acoustic feature, the time sequence status of observation state generation voice that HMM model is fitted according to GMM model. When the probability using GMM model mixed Gauss model is come when describing voice data distribution, GMM model substantially belongs to shallow-layer mould Type, and be fitted carried out between feature during acoustic feature distribution independence it is assumed that acoustics spy therefore can not be fully described The state space distribution of sign;Meanwhile the intrinsic dimensionality of GMM modelings is usually tens dimensions, it is impossible to be fully described between acoustic feature Correlation, model tormulation ability is limited.
Invention content
The 1980s starts to occur using the research of neural network and HMM model structure acoustic model, still, due to Computer computation ability is insufficient at that time and lacks enough training datas, and the effect of model is not as good as GMM-HMM.Microsoft Asia in 2010 The graduate Deng Li in continent and Hinton groups propose CD-DBN (dynamic Bayesian networks for extensive continuous speech recognition task Network)-HMM mixing acoustics model framework, and carried out related experiment.The experimental results showed that compared to GMM-HMM acoustic models, Speech recognition system recognition correct rate is made to improve 30% or so, CD-DBN-HMM mixing acoustics using CD-DBN-HMM acoustic models The original acoustic model frame of speech recognition has thoroughly been reformed in the proposition of model framework.Compared with traditional gauss hybrid models, Deep neural network belongs to depth model, can preferably represent complex nonlinear function, can more capture speech feature vector it Between correlation, be easy to obtain preferably modeling effect.A kind of illiteracy based on DNN models is proposed based on the above-mentioned achievement present invention The construction and application method of archaism acoustic model, preferably to complete Mongol acoustic model modeling task.
The technical scheme is that:
1. model construction:
Replace GMM gauss hybrid models with DNN deep neural networks, realize to the posterior probability of Mongol acoustic states into Row estimation.In the case of given Mongol acoustic feature sequence, it is used for estimating that current signature belongs to HMM first by DNN models Then shape probability of state describes the dynamic change of Mongol voice signal with HMM model, capture the sequential of Mongol voice messaging Status information.
The training of DNN networks is divided into two stages of pre-training and tuning in Mongol acoustic model.
In the pre-training of DNN networks, successively unsupervised training algorithm is employed, belongs to production training algorithm.Successively Unsupervised pre-training algorithm is that each layer of DNN networks is trained, and only trains one layer therein every time, other layers Parameter keeps the parameter constant of original initialization, during training, to the reduction of each layer of the error output and input as possible, with The parameter for ensureing each layer is all optimal for this layer.Next, using trained each layer of output data as Next layer of input data is input to next layer when then next layer of input data will be than directly training by multilayer neural network Data error it is much smaller, successively unsupervised pre-training algorithm can ensure the error of the inputoutput data between each layer All it is relatively small.
Preferable neural network initiation parameter can be obtained by successively unsupervised pre-training algorithm, use Mongol mark Note data (i.e. significant condition) carry out the tuning for having supervision by BP algorithm (error backpropagation algorithm), finally obtain and can be used for The DNN deep neural network models of acoustic states classification.
2. model uses:
After the pre-training and tuning to DNN networks, DNN-HMM acoustic models can be utilized to Mongol voice data It is identified, specific process is as follows:
Step 1:According to the Mongol acoustic feature of input vector, L layers before DNN deep neural networks of output is calculated.
Step 2:Posterior probability of the current signature about whole acoustic states is calculated using L layers of softmax classification layers. I.e. current signature belongs to the probability of each Mongol acoustic states.
Step 3:According to Bayesian formula, the posterior probability of each state divided by the prior probability of its own obtain each The regular likelihood value of state.
Step 4:It is decoded to obtain optimal path using Viterbi decoding algorithm.
Wherein, the prior probability of hidden state corresponds to frame sum and gross acoustic features frame number by calculating each state Ratio is i.e. available.
3. the training process of model
The labeled data that the DNN deep neural network tuning stages use is alignment to be forced to obtain by GMM-HMM acoustic models , and using the update of stochastic gradient descent algorithm completion model parameter, therefore, firstly the need of training before training DNN-HMM One GMM-HMM Mongol acoustic model good enough, is then given birth to by GMM-HMM Mongol acoustic models by viterbi algorithm Into the required labeled data of DNN-HMM Mongol acoustic training models.
Since DNN models need the Mongol labeled data of Mongol speech frame alignment, and labeled data in tuning Quality often influences whether the performance of DNN models.Therefore, in practical application, we utilize the GMM-HMM Mongols trained The pressure that acoustic model realizes phonetic feature to state is aligned.So the training process of DNN-HMM acoustic models is:It instructs first Practice GMM-HMM Mongol acoustic models, the Mongol voice feature data being aligned;Then in alignment voice feature data On the basis of deep neural network (DNN) is trained and tuning;It is last right again according to obtained Mongol voice observation state Hidden Markov Model (HMM) is trained.
Description of the drawings
Fig. 1 is DNN-HMM Mongol acoustic model figures.
Fig. 2 is DNN network pre-training procedure charts.
Fig. 3 is to scheme relative to the Experimental comparison results of GMM-HMM acoustic models.
Fig. 4 is the influence figure of dropout technologies and the hidden layer number of plies to DNN-HMM model over-fitting distances.
Embodiment
In order to more clearly describe the technology contents of the present invention, carried out with reference to specific embodiment further Description.1. model construction:
Replace GMM gauss hybrid models with DNN deep neural networks, realize to the posterior probability of Mongol acoustic states into Row estimation.In the case of given Mongol acoustic feature sequence, it is used for estimating that current signature belongs to HMM first by DNN models Then shape probability of state describes the dynamic change of Mongol voice signal with HMM model, capture the sequential of Mongol voice messaging Status information.
DNN-HMM Mongol acoustic models structure is specifically as shown in Figure 1.In DNN-HMM Mongol acoustic models, DNN networks by constantly stacking what hidden layer was realized from bottom to top.Wherein S represents the hidden state in HMM model, and A is represented State transition probability matrix, L represent DNN deep neural networks the number of plies (wherein hidden layer be L-1 layer, L0 layers be input layer, LL For layer for output layer, DNN networks include L+1 layers altogether), the connection matrix between W expression layers.DNN-HMM Mongol acoustic models exist Before carrying out Mongol speech recognition process modeling, need to be trained DNN neural networks.In the instruction for completing DNN neural networks It is consistent with GMM-HMM models to the modeling process of Mongol acoustic model after white silk.
The training of DNN networks is divided into two stages of pre-training and tuning in Mongol acoustic model.
In the pre-training (as shown in Figure 2) of DNN networks, successively unsupervised training algorithm is employed, belongs to production Training algorithm.Successively unsupervised pre-training algorithm is that each layer of DNN networks is trained, and only training is therein every time One layer, the parameter of other layers keeps the parameter constant of original initialization, when training, to each layer of the error output and input Reduction as possible, the parameter to ensure each layer are all optimal for this layer.Next, by trained each layer Output data passes through multilayer neural network as next layer of input data when then next layer of input data will be than directly training The error for being input to next layer of data is much smaller, and successively unsupervised pre-training algorithm can ensure that the input between each layer is defeated The error for going out data is all relatively small.Pre-training algorithm is shown in algorithm 1.
Preferable neural network initiation parameter can be obtained by successively unsupervised pre-training algorithm, use Mongol mark Note data (i.e. significant condition) carry out the tuning for having supervision by BP algorithm (error backpropagation algorithm), finally obtain and can be used for The DNN deep neural network models of acoustic states classification.The evolutionary algorithm for having supervision is carried out real using stochastic gradient descent algorithm It is existing, it is specifically shown in algorithm 2.
2. model uses:
Step 1:According to the Mongol acoustic feature of input vector, L layers before DNN deep neural networks of output is calculated. I.e.:
vα=f (zα)=f (Wαvα-1+bα),0≤α<L (1)
Wherein, zαRepresent excitation vector, zα=Wαvα-1+bαAndV α represent activation vector,Wα Represent weight matrix,bαRepresent bigoted vector,NαRepresent α layers of neurode number And Nα∈R。V0Represent the input feature vector of network,In DNN-HMM acoustic models, input feature vector is sound Learn feature vector.Wherein N0=D represents the dimension of input acoustic feature vector,Represent activation primitive to swashing The calculating process of vector is encouraged, f () represents activation primitive.
Step 2:Posterior probability of the current signature about whole acoustic states is calculated using L layers of softmax classification layers. I.e. current signature belongs to the probability of each Mongol acoustic states, i.e. current signature belongs to the probability of each Mongol acoustic states.
vi=Pdnn(i | O)=softmax (i) (2)
In formula (2), i ∈ { 1,2 ..., C }, wherein C represent the hidden state number of acoustic model, xiIt represents The input of softmax i-th of neural unit of layer, viRepresent the output of softmax classification i-th of neural unit of layer, i.e. input sound Learn posterior probability of the feature vector O about i-th of hidden state of acoustic model.
Step 3:According to Bayesian formula, the posterior probability of each state divided by the prior probability of its own obtain each The regular likelihood value of state.
Step 4:It is decoded to obtain optimal path using Viterbi decoding algorithm.
Wherein, the prior probability of hidden state corresponds to frame sum and gross acoustic features frame number by calculating each state Ratio is i.e. available.
3. model training process:
Step 1:GMM-HMM Mongol acoustic training models are carried out, an optimal GMM-HMM Mongols voice is obtained and knows Other system, is represented with gmm-hmm.
Step 2:Gmm-hmm is parsed using Viterbi decoding algorithm, in the model of gmm-hmm Mongol acoustic models Each senone obtains senone_id into line label.
Step 3:Using gmm-hmm Mongol acoustic models, acoustic states tri-phone is mapped to accordingly senone_id。
Step 4:DNN-HMM Mongol acoustic models, mainly HMM are initialized using gmm-hmm Mongols acoustic model Hidden Markov Model argument section finally obtains dnn-hmm1 models.
Step 5:Using Mongol acoustic feature file pre-training DNN deep neural networks, ptdnn is obtained.
Step 6:Using gmm-hmm Mongol acoustic models, Mongol acoustic feature data are subjected to the strong of Status Level System alignment, alignment result are align-raw.
Step 7:The physical state of align-raw is converted into senone_id, obtains the training data that frame level is not aligned align-frame。
Step 8:It finely tunes with having carried out supervision to ptdnn deep neural networks using align data align-data, obtains Network model dnn.
Step 9:According to maximum likelihood algorithm, the transition probability that HMM model in dnn-hmm1 is reevaluated using dnn is obtained Network model represented with dnn-hmm2.
Step 10:If test set recognition accuracy does not improve on dnn and dnn-hmm2, training terminates.Otherwise, it uses Dnn-hmm2 carries out Status Level alignment again to training data, then performs step 7.
In the training process, an optimal GMM-HMM Mongol voice recognition data preparation system (step is trained first 1), it is therefore an objective to be the supervision tuning service of DNN.In training GMM-HMM Mongol acoustic models, using expectation-maximization algorithm Unsupervised training is carried out, avoids the requirement to labeled data;Then using Mongol acoustic feature to deep neural network into Row pre-training (step 5);In the second stage (having the supervision tuning stage) of deep neural network training, it is utilized what is trained The pressure that GMM-HMM Mongols acoustic model carries out phonetic feature to state is aligned (step 6), obtains labeled data;Last profit Tuning (the step 8) for carrying out having supervision to DNN deep neural networks with labeled data.DNN deep neural networks training complete with Afterwards, its next step flow (step 10) is determined according to recognition results of the DNN-HMM on test set.
4. experiment and result:
4.1 be the validity for the DNN-HMM Mongol acoustic models that verification proposes, formulates following experiment:
(1) MFCC acoustic features, the experimental study of expansion GMM-HMM, DNN-HMM Mongol acoustic model modeling are extracted. Observing different Acoustic Modeling units influences the performance of acoustic model and compares different type acoustic model to speech recognition system The influence of system.
(2) by building the DNN-HMM three-tone Mongol acoustic models of different layers of depth network structures, carry out layer Several experimental studies influenced on Mongol acoustic model and on over-fitting.
4.2 experiment parameter:
The corpus of Mongol speech recognition is made of 310 Mongols teaching voices, altogether 2291 Mongol vocabulary, It is named as IMUT310 corpus.Corpus is made of altogether three parts:Audio file, pronunciation mark and corresponding Mongolian text. In experiment, IMUT310 corpus is divided into training set and test set two parts, wherein training set is 287, test set 23 Sentence.Experiment is completed on Kaldi platforms.The specific experiment environment configurations of Kaldi are as shown in table 1.
1 experimental situation of table
In experimentation, Mongol acoustic feature is represented using MFCC acoustic features, shares 39 dimension datas, wherein preceding 13 dimension Feature is made of 12 cepstrum features and 1 energy coefficient, and two 13 dimensional features below are the single orders to 13 dimensional feature of front Difference and second differnce.When extracting Mongol MFFC features, frame window length is 25ms, and frame moves 10ms.To training set and survey Examination collection carries out feature extraction respectively, and whole voice data symbiosis are into 119960 MFCC features, the spy that wherein training data generates It is 112535 to levy, and the feature of Test data generation is 7425.During GMM-HMM acoustic training models, Mongol voice MFCC Feature is tested using 39 dimension datas.When the sub- DNN-HMM of single-tone is tested, Mongol MFCC phonetic features (do not include for 13 dimensions First, second differnce feature).When three-tone DNN-HMM is tested, the feature of Mongol MFCC is 39 dimensions
During DNN network trainings, feature extraction uses the method that context combines, i.e., 5 frames is respectively taken to carry out table before and after present frame Show the context environmental of present frame, therefore, during the experiment, the input number of nodes of the sub- DNN networks of single-tone is 143 (13* (5 + 1+5)), the input number of nodes of three-tone DNN networks is 429 (39* (5+1+5)).The output node layer of DNN networks is considerable Mongol phoneme of speech sound number is examined, according to the standard of corpus annotation, output node is 27;The hidden layer node of DNN networks Number is set as 1024, and tuning frequency of training is set as 60, and initial learning rate is set as 0.015, and final learning rate is set as 0.002。
4.3 experiments and result:
4 experimental considerations units are respectively:The sub- GMM-HMM of single-tone, three-tone GMM-HMM, the sub- DNN-HMM of single-tone and three-tone DNN-HMM is tested.Experimental result data is shown in Table 2, and comparing result is shown in attached drawing 3.
2 GMM-HMM of table and DNN-HMM Mongol acoustic model experimental datas
In attached drawing 3 (a) it can be found that relative to the sub- GMM-HMM Mongols acoustic model of single-tone, the sub- DNN-HMM of single-tone is covered Word Error Rate of the archaism acoustic model on training set reduces 8.84%, the word identification lower error rate on test set 11.14%;But for triphone model, three-tone DNN-HMM Mongols acoustic model is covered than three-tone GMM-HMM Word Error Rate of the archaism acoustic model on training set reduces 1.33%, the word identification lower error rate on test set 7.5%.Attached drawing 3 (b) finds that sentence of the single-tone submodel on training set identifies lower error rate 32.43%, on test set Sentence identification lower error rate 17.88%;For triphone model, three-tone DNN-HMM Mongols acoustic model ratio Sentence of the three-tone GMM-HMM Mongol acoustic models on training set identifies lower error rate 19.3%, on test set Sentence identifies lower error rate 13.63%.
From analyzing above:The sub- DNN-HMM Mongols acoustic model of single-tone is substantially better than the sub- GMM-HMM Mongols of single-tone Acoustic model;For triphone model, three-tone DNN-HMM Mongols acoustic model is than three-tone GMM-HMM Mongols The discrimination of acoustic model is taller.
DNN-HMM Mongols acoustic model can effectively reduce the error rate of word identification and the error rate of word identification, improve mould Type performance.
In order to study the influence of the hidden layer number of plies, dropout technologies to DNN-HMM three-tone Mongol acoustic models, with Using being tested on the basis of four layers of three-tone DNN-HMM Mongol acoustic models of dropout technologies, carry out respectively about hidden Containing the contrast experiment of number and dropout technologies layer by layer, 3 are shown in Table in experimental result data.
Dropout is tested on 3 three-tone DNN-HMM acoustic models of table
In order to represent the degree of over-fitting, we define the over-fitting distance of a model, in speech recognition, What over-fitting was judged often by the discrimination on training set and test set, when discrimination of the data on training set It is very high, and when the discrimination on test set is very low, then, mean that the model has serious over-fitting, Wo Menyong The absolute value of the difference of evaluation index and model evaluation index on training set of the model on test set represents over-fitting The degree of phenomenon, so, its calculation formula is defined as:
The over-fitting distance of model=| evaluation index-model evaluation index on test set of the model on test set |
From 4 dark parts of attached drawing it can be found that in the DNN-HMM Mongols that dropout technique drills is not used to obtain In acoustic model, when the hidden layer network number of plies increases to 7 layers by 4 layers, the over-fitting distance of word identification is increased from 21.17% To 54.81%;The over-fitting distance of distich identification has risen to 80.72% from 35.32%.It can thus be seen that with hidden The increase of the number of plies containing layer network, apart from increasing, becoming larger for over-fitting distance illustrates DNN network structions for the over-fitting of model Mongol acoustic model serious over-fitting, then, the performance of DNN-HMM will be worse and worse.
In figure 4, by the comparison of two kinds of colors of the depth, it will be seen that after using dropout technologies, when hidden When the number of plies containing layer network increases to 7 layers by 4 layers, the over-fitting distance to word identification is respectively 21.43%, 21.91%, 24.07% and 25.48%.And using dropout technologies, the over-fitting distance to word identification is 21.17% respectively, 21.91%th, 42.38%, 54.81%.It follows that using the over-fitting distance after dropout technologies than not using Over-fitting distance after dropout technologies is small, this point, is above equally existed in the over-fitting distance of distich identification.So After adding dropout technologies, the over-fitting caused by implicit number of plies increase is effectively alleviated, so as to improve mould The recognition performance of type.
In this description, the present invention is described with reference to its specific embodiment.But it is clear that it can still make Various modifications and alterations are without departing from the spirit and scope of the invention.Therefore, the description and the appended drawings should be considered as illustrative And not restrictive.

Claims (3)

1. a kind of training method of the Mongol acoustic model based on DNN, it is characterised in that:GMM-HMM Mongols are trained first Acoustic model, the Mongol voice feature data being aligned;Then to depth god on the basis of voice feature data is aligned It is trained through network (DNN) and tuning;The Mongol voice observation state that last basis obtains is again to Hidden Markov Model (HMM) it is trained.
2. a kind of training method of the Mongol acoustic model based on DNN as described in claim 1, it is characterised in that:It is described Training method the specific steps are:
Step 1:GMM-HMM Mongol acoustic training models are carried out, obtain an optimal GMM-HMM Mongol speech recognitions system System, is represented with gmm-hmm.
Step 2:Gmm-hmm is parsed using Viterbi decoding algorithm, to each in the model of gmm-hmm Mongol acoustic models A senone obtains senone_id into line label.
Step 3:Using gmm-hmm Mongol acoustic models, acoustic states tri-phone is mapped to corresponding senone_id.
Step 4:DNN-HMM Mongol acoustic models, the hidden horses of mainly HMM are initialized using gmm-hmm Mongols acoustic model Er Kefu model parameters part, finally obtains dnn-hmm1 models.
Step 5:Using Mongol acoustic feature file pre-training DNN deep neural networks, ptdnn is obtained.
Step 6:Using gmm-hmm Mongol acoustic models, Mongol acoustic feature data are carried out to the pressure pair of Status Level Together, alignment result is align-raw.
Step 7:The physical state of align-raw is converted into senone_id, obtains the training data align- that frame level is not aligned frame。
Step 8:It finely tunes with having carried out supervision to ptdnn deep neural networks using align data align-data, obtains network Model dnn.
Step 9:According to maximum likelihood algorithm, the net that the transition probability of HMM model in dnn-hmm1 obtains is reevaluated using dnn Network model is represented with dnn-hmm2.
Step 10:If test set recognition accuracy does not improve on dnn and dnn-hmm2, training terminates.Otherwise, using dnn- Hmm2 carries out Status Level alignment again to training data, then performs step 7.
3. a kind of training method of the Mongol acoustic model based on DNN as claimed in claim 1 or 2, it is characterised in that: Dropout technologies are added in DNN-HMM Mongol acoustic training models and avoid over-fitting.
CN201711390467.1A 2017-12-21 2017-12-21 A kind of training method of the Mongol acoustic model based on DNN Active CN108182938B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711390467.1A CN108182938B (en) 2017-12-21 2017-12-21 A kind of training method of the Mongol acoustic model based on DNN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711390467.1A CN108182938B (en) 2017-12-21 2017-12-21 A kind of training method of the Mongol acoustic model based on DNN

Publications (2)

Publication Number Publication Date
CN108182938A true CN108182938A (en) 2018-06-19
CN108182938B CN108182938B (en) 2019-03-19

Family

ID=62546662

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711390467.1A Active CN108182938B (en) 2017-12-21 2017-12-21 A kind of training method of the Mongol acoustic model based on DNN

Country Status (1)

Country Link
CN (1) CN108182938B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109326282A (en) * 2018-10-10 2019-02-12 内蒙古工业大学 A kind of small-scale corpus DNN-HMM acoustics training structure
CN111696522A (en) * 2020-05-12 2020-09-22 天津大学 Tibetan language voice recognition method based on HMM and DNN

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105960672A (en) * 2014-09-09 2016-09-21 微软技术许可有限责任公司 Variable-component deep neural network for robust speech recognition
CN105957518A (en) * 2016-06-16 2016-09-21 内蒙古大学 Mongolian large vocabulary continuous speech recognition method
CN106205603A (en) * 2016-08-29 2016-12-07 北京语言大学 A kind of tone appraisal procedure
CN106991999A (en) * 2017-03-29 2017-07-28 北京小米移动软件有限公司 Audio recognition method and device
CN107293288A (en) * 2017-06-09 2017-10-24 清华大学 A kind of residual error shot and long term remembers the acoustic model modeling method of Recognition with Recurrent Neural Network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105960672A (en) * 2014-09-09 2016-09-21 微软技术许可有限责任公司 Variable-component deep neural network for robust speech recognition
CN105957518A (en) * 2016-06-16 2016-09-21 内蒙古大学 Mongolian large vocabulary continuous speech recognition method
CN106205603A (en) * 2016-08-29 2016-12-07 北京语言大学 A kind of tone appraisal procedure
CN106991999A (en) * 2017-03-29 2017-07-28 北京小米移动软件有限公司 Audio recognition method and device
CN107293288A (en) * 2017-06-09 2017-10-24 清华大学 A kind of residual error shot and long term remembers the acoustic model modeling method of Recognition with Recurrent Neural Network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HONGWEI ZHANG ET AL.: "Comparison on Neural Network based acoustic model in Mongolian speech recognition", 《2016 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP)》 *
中国计算机学会: "《计算机科学技术学科发展报告 2014-2015版》", 30 April 2016, 中国科学技术出版社 *
张红伟: "基于深度神经网络的蒙古语语音识别系统声学模型的研究", 《中国优秀硕士学位论文全文数据库,信息科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109326282A (en) * 2018-10-10 2019-02-12 内蒙古工业大学 A kind of small-scale corpus DNN-HMM acoustics training structure
CN111696522A (en) * 2020-05-12 2020-09-22 天津大学 Tibetan language voice recognition method based on HMM and DNN
CN111696522B (en) * 2020-05-12 2024-02-23 天津大学 Tibetan language voice recognition method based on HMM and DNN

Also Published As

Publication number Publication date
CN108182938B (en) 2019-03-19

Similar Documents

Publication Publication Date Title
CN104575490B (en) Spoken language pronunciation evaluating method based on deep neural network posterior probability algorithm
Shan et al. Component fusion: Learning replaceable language model component for end-to-end speech recognition system
CN105741832B (en) Spoken language evaluation method and system based on deep learning
CN108109615A (en) A kind of construction and application method of the Mongol acoustic model based on DNN
CN108172218B (en) Voice modeling method and device
Agarwalla et al. Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech
JP7070894B2 (en) Time series information learning system, method and neural network model
CN104538028A (en) Continuous voice recognition method based on deep long and short term memory recurrent neural network
CN108962223A (en) A kind of voice gender identification method, equipment and medium based on deep learning
CN104217721B (en) Based on the phonetics transfer method under the conditions of the asymmetric sound bank that speaker model aligns
Hashimoto et al. Trajectory training considering global variance for speech synthesis based on neural networks
CN106340297A (en) Speech recognition method and system based on cloud computing and confidence calculation
CN108364634A (en) Spoken language pronunciation evaluating method based on deep neural network posterior probability algorithm
KR101664815B1 (en) Method for creating a speech model
Guo et al. Deep neural network based i-vector mapping for speaker verification using short utterances
CN111599339B (en) Speech splicing synthesis method, system, equipment and medium with high naturalness
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
CN108182938B (en) A kind of training method of the Mongol acoustic model based on DNN
WO2022148176A1 (en) Method, device, and computer program product for english pronunciation assessment
Shah et al. Unsupervised Vocal Tract Length Warped Posterior Features for Non-Parallel Voice Conversion.
Hai et al. Cross-lingual phone mapping for large vocabulary speech recognition of under-resourced languages
Gómez et al. Improvements on automatic speech segmentation at the phonetic level
Razavi et al. An HMM-based formalism for automatic subword unit derivation and pronunciation generation
CN114333762B (en) Expressive force-based speech synthesis method, expressive force-based speech synthesis system, electronic device and storage medium
CN107492373B (en) Tone recognition method based on feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant