CN108182938A - A kind of training method of the Mongol acoustic model based on DNN - Google Patents
A kind of training method of the Mongol acoustic model based on DNN Download PDFInfo
- Publication number
- CN108182938A CN108182938A CN201711390467.1A CN201711390467A CN108182938A CN 108182938 A CN108182938 A CN 108182938A CN 201711390467 A CN201711390467 A CN 201711390467A CN 108182938 A CN108182938 A CN 108182938A
- Authority
- CN
- China
- Prior art keywords
- dnn
- mongol
- hmm
- acoustic
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/144—Training of HMMs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/148—Duration modelling in HMMs, e.g. semi HMM, segmental models or transition probabilities
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Abstract
The present invention provides a kind of training methods of the Mongol acoustic model based on DNN.GMM gauss hybrid models are replaced with DNN deep neural networks, realizes and the posterior probability of Mongol acoustic states is estimated, build DNN HMM acoustic models, and disclose the training method of the model.The present invention can effectively reduce the error rate of word identification and the error rate of word identification, improve model performance.
Description
Technical field
The invention belongs to Mongol field of speech recognition, and in particular to a kind of instruction of the Mongol acoustic model based on DNN
Practice method.
Background technology
Typical large vocabulary Continuous Speech Recognition System (Large Vocabulary Continuous Speech
Recognition, LVCSR) by feature extraction, acoustic model, language model and decoder etc. form acoustic models be voice know
The core component of other system is built based on GMM model (mixed Gauss model) and HMM model (hidden Markov model)
GMM-HMM acoustic models had been once acoustic models most widely used in large vocabulary Continuous Speech Recognition System.
In GMM-HMM models, GMM model carries out probabilistic Modeling to speech feature vector, then passes through EM algorithms (the maximum phase
Hope algorithm) the maximization probability that voice observes feature is generated, when Gaussian mixtures number is enough, GMM model can fill
Divide the probability distribution of fitting acoustic feature, the time sequence status of observation state generation voice that HMM model is fitted according to GMM model.
When the probability using GMM model mixed Gauss model is come when describing voice data distribution, GMM model substantially belongs to shallow-layer mould
Type, and be fitted carried out between feature during acoustic feature distribution independence it is assumed that acoustics spy therefore can not be fully described
The state space distribution of sign;Meanwhile the intrinsic dimensionality of GMM modelings is usually tens dimensions, it is impossible to be fully described between acoustic feature
Correlation, model tormulation ability is limited.
Invention content
The 1980s starts to occur using the research of neural network and HMM model structure acoustic model, still, due to
Computer computation ability is insufficient at that time and lacks enough training datas, and the effect of model is not as good as GMM-HMM.Microsoft Asia in 2010
The graduate Deng Li in continent and Hinton groups propose CD-DBN (dynamic Bayesian networks for extensive continuous speech recognition task
Network)-HMM mixing acoustics model framework, and carried out related experiment.The experimental results showed that compared to GMM-HMM acoustic models,
Speech recognition system recognition correct rate is made to improve 30% or so, CD-DBN-HMM mixing acoustics using CD-DBN-HMM acoustic models
The original acoustic model frame of speech recognition has thoroughly been reformed in the proposition of model framework.Compared with traditional gauss hybrid models,
Deep neural network belongs to depth model, can preferably represent complex nonlinear function, can more capture speech feature vector it
Between correlation, be easy to obtain preferably modeling effect.A kind of illiteracy based on DNN models is proposed based on the above-mentioned achievement present invention
The construction and application method of archaism acoustic model, preferably to complete Mongol acoustic model modeling task.
The technical scheme is that:
1. model construction:
Replace GMM gauss hybrid models with DNN deep neural networks, realize to the posterior probability of Mongol acoustic states into
Row estimation.In the case of given Mongol acoustic feature sequence, it is used for estimating that current signature belongs to HMM first by DNN models
Then shape probability of state describes the dynamic change of Mongol voice signal with HMM model, capture the sequential of Mongol voice messaging
Status information.
The training of DNN networks is divided into two stages of pre-training and tuning in Mongol acoustic model.
In the pre-training of DNN networks, successively unsupervised training algorithm is employed, belongs to production training algorithm.Successively
Unsupervised pre-training algorithm is that each layer of DNN networks is trained, and only trains one layer therein every time, other layers
Parameter keeps the parameter constant of original initialization, during training, to the reduction of each layer of the error output and input as possible, with
The parameter for ensureing each layer is all optimal for this layer.Next, using trained each layer of output data as
Next layer of input data is input to next layer when then next layer of input data will be than directly training by multilayer neural network
Data error it is much smaller, successively unsupervised pre-training algorithm can ensure the error of the inputoutput data between each layer
All it is relatively small.
Preferable neural network initiation parameter can be obtained by successively unsupervised pre-training algorithm, use Mongol mark
Note data (i.e. significant condition) carry out the tuning for having supervision by BP algorithm (error backpropagation algorithm), finally obtain and can be used for
The DNN deep neural network models of acoustic states classification.
2. model uses:
After the pre-training and tuning to DNN networks, DNN-HMM acoustic models can be utilized to Mongol voice data
It is identified, specific process is as follows:
Step 1:According to the Mongol acoustic feature of input vector, L layers before DNN deep neural networks of output is calculated.
Step 2:Posterior probability of the current signature about whole acoustic states is calculated using L layers of softmax classification layers.
I.e. current signature belongs to the probability of each Mongol acoustic states.
Step 3:According to Bayesian formula, the posterior probability of each state divided by the prior probability of its own obtain each
The regular likelihood value of state.
Step 4:It is decoded to obtain optimal path using Viterbi decoding algorithm.
Wherein, the prior probability of hidden state corresponds to frame sum and gross acoustic features frame number by calculating each state
Ratio is i.e. available.
3. the training process of model
The labeled data that the DNN deep neural network tuning stages use is alignment to be forced to obtain by GMM-HMM acoustic models
, and using the update of stochastic gradient descent algorithm completion model parameter, therefore, firstly the need of training before training DNN-HMM
One GMM-HMM Mongol acoustic model good enough, is then given birth to by GMM-HMM Mongol acoustic models by viterbi algorithm
Into the required labeled data of DNN-HMM Mongol acoustic training models.
Since DNN models need the Mongol labeled data of Mongol speech frame alignment, and labeled data in tuning
Quality often influences whether the performance of DNN models.Therefore, in practical application, we utilize the GMM-HMM Mongols trained
The pressure that acoustic model realizes phonetic feature to state is aligned.So the training process of DNN-HMM acoustic models is:It instructs first
Practice GMM-HMM Mongol acoustic models, the Mongol voice feature data being aligned;Then in alignment voice feature data
On the basis of deep neural network (DNN) is trained and tuning;It is last right again according to obtained Mongol voice observation state
Hidden Markov Model (HMM) is trained.
Description of the drawings
Fig. 1 is DNN-HMM Mongol acoustic model figures.
Fig. 2 is DNN network pre-training procedure charts.
Fig. 3 is to scheme relative to the Experimental comparison results of GMM-HMM acoustic models.
Fig. 4 is the influence figure of dropout technologies and the hidden layer number of plies to DNN-HMM model over-fitting distances.
Embodiment
In order to more clearly describe the technology contents of the present invention, carried out with reference to specific embodiment further
Description.1. model construction:
Replace GMM gauss hybrid models with DNN deep neural networks, realize to the posterior probability of Mongol acoustic states into
Row estimation.In the case of given Mongol acoustic feature sequence, it is used for estimating that current signature belongs to HMM first by DNN models
Then shape probability of state describes the dynamic change of Mongol voice signal with HMM model, capture the sequential of Mongol voice messaging
Status information.
DNN-HMM Mongol acoustic models structure is specifically as shown in Figure 1.In DNN-HMM Mongol acoustic models,
DNN networks by constantly stacking what hidden layer was realized from bottom to top.Wherein S represents the hidden state in HMM model, and A is represented
State transition probability matrix, L represent DNN deep neural networks the number of plies (wherein hidden layer be L-1 layer, L0 layers be input layer, LL
For layer for output layer, DNN networks include L+1 layers altogether), the connection matrix between W expression layers.DNN-HMM Mongol acoustic models exist
Before carrying out Mongol speech recognition process modeling, need to be trained DNN neural networks.In the instruction for completing DNN neural networks
It is consistent with GMM-HMM models to the modeling process of Mongol acoustic model after white silk.
The training of DNN networks is divided into two stages of pre-training and tuning in Mongol acoustic model.
In the pre-training (as shown in Figure 2) of DNN networks, successively unsupervised training algorithm is employed, belongs to production
Training algorithm.Successively unsupervised pre-training algorithm is that each layer of DNN networks is trained, and only training is therein every time
One layer, the parameter of other layers keeps the parameter constant of original initialization, when training, to each layer of the error output and input
Reduction as possible, the parameter to ensure each layer are all optimal for this layer.Next, by trained each layer
Output data passes through multilayer neural network as next layer of input data when then next layer of input data will be than directly training
The error for being input to next layer of data is much smaller, and successively unsupervised pre-training algorithm can ensure that the input between each layer is defeated
The error for going out data is all relatively small.Pre-training algorithm is shown in algorithm 1.
Preferable neural network initiation parameter can be obtained by successively unsupervised pre-training algorithm, use Mongol mark
Note data (i.e. significant condition) carry out the tuning for having supervision by BP algorithm (error backpropagation algorithm), finally obtain and can be used for
The DNN deep neural network models of acoustic states classification.The evolutionary algorithm for having supervision is carried out real using stochastic gradient descent algorithm
It is existing, it is specifically shown in algorithm 2.
2. model uses:
Step 1:According to the Mongol acoustic feature of input vector, L layers before DNN deep neural networks of output is calculated.
I.e.:
vα=f (zα)=f (Wαvα-1+bα),0≤α<L (1)
Wherein, zαRepresent excitation vector, zα=Wαvα-1+bαAndV α represent activation vector,Wα
Represent weight matrix,bαRepresent bigoted vector,NαRepresent α layers of neurode number
And Nα∈R。V0Represent the input feature vector of network,In DNN-HMM acoustic models, input feature vector is sound
Learn feature vector.Wherein N0=D represents the dimension of input acoustic feature vector,Represent activation primitive to swashing
The calculating process of vector is encouraged, f () represents activation primitive.
Step 2:Posterior probability of the current signature about whole acoustic states is calculated using L layers of softmax classification layers.
I.e. current signature belongs to the probability of each Mongol acoustic states, i.e. current signature belongs to the probability of each Mongol acoustic states.
vi=Pdnn(i | O)=softmax (i) (2)
In formula (2), i ∈ { 1,2 ..., C }, wherein C represent the hidden state number of acoustic model, xiIt represents
The input of softmax i-th of neural unit of layer, viRepresent the output of softmax classification i-th of neural unit of layer, i.e. input sound
Learn posterior probability of the feature vector O about i-th of hidden state of acoustic model.
Step 3:According to Bayesian formula, the posterior probability of each state divided by the prior probability of its own obtain each
The regular likelihood value of state.
Step 4:It is decoded to obtain optimal path using Viterbi decoding algorithm.
Wherein, the prior probability of hidden state corresponds to frame sum and gross acoustic features frame number by calculating each state
Ratio is i.e. available.
3. model training process:
Step 1:GMM-HMM Mongol acoustic training models are carried out, an optimal GMM-HMM Mongols voice is obtained and knows
Other system, is represented with gmm-hmm.
Step 2:Gmm-hmm is parsed using Viterbi decoding algorithm, in the model of gmm-hmm Mongol acoustic models
Each senone obtains senone_id into line label.
Step 3:Using gmm-hmm Mongol acoustic models, acoustic states tri-phone is mapped to accordingly
senone_id。
Step 4:DNN-HMM Mongol acoustic models, mainly HMM are initialized using gmm-hmm Mongols acoustic model
Hidden Markov Model argument section finally obtains dnn-hmm1 models.
Step 5:Using Mongol acoustic feature file pre-training DNN deep neural networks, ptdnn is obtained.
Step 6:Using gmm-hmm Mongol acoustic models, Mongol acoustic feature data are subjected to the strong of Status Level
System alignment, alignment result are align-raw.
Step 7:The physical state of align-raw is converted into senone_id, obtains the training data that frame level is not aligned
align-frame。
Step 8:It finely tunes with having carried out supervision to ptdnn deep neural networks using align data align-data, obtains
Network model dnn.
Step 9:According to maximum likelihood algorithm, the transition probability that HMM model in dnn-hmm1 is reevaluated using dnn is obtained
Network model represented with dnn-hmm2.
Step 10:If test set recognition accuracy does not improve on dnn and dnn-hmm2, training terminates.Otherwise, it uses
Dnn-hmm2 carries out Status Level alignment again to training data, then performs step 7.
In the training process, an optimal GMM-HMM Mongol voice recognition data preparation system (step is trained first
1), it is therefore an objective to be the supervision tuning service of DNN.In training GMM-HMM Mongol acoustic models, using expectation-maximization algorithm
Unsupervised training is carried out, avoids the requirement to labeled data;Then using Mongol acoustic feature to deep neural network into
Row pre-training (step 5);In the second stage (having the supervision tuning stage) of deep neural network training, it is utilized what is trained
The pressure that GMM-HMM Mongols acoustic model carries out phonetic feature to state is aligned (step 6), obtains labeled data;Last profit
Tuning (the step 8) for carrying out having supervision to DNN deep neural networks with labeled data.DNN deep neural networks training complete with
Afterwards, its next step flow (step 10) is determined according to recognition results of the DNN-HMM on test set.
4. experiment and result:
4.1 be the validity for the DNN-HMM Mongol acoustic models that verification proposes, formulates following experiment:
(1) MFCC acoustic features, the experimental study of expansion GMM-HMM, DNN-HMM Mongol acoustic model modeling are extracted.
Observing different Acoustic Modeling units influences the performance of acoustic model and compares different type acoustic model to speech recognition system
The influence of system.
(2) by building the DNN-HMM three-tone Mongol acoustic models of different layers of depth network structures, carry out layer
Several experimental studies influenced on Mongol acoustic model and on over-fitting.
4.2 experiment parameter:
The corpus of Mongol speech recognition is made of 310 Mongols teaching voices, altogether 2291 Mongol vocabulary,
It is named as IMUT310 corpus.Corpus is made of altogether three parts:Audio file, pronunciation mark and corresponding Mongolian text.
In experiment, IMUT310 corpus is divided into training set and test set two parts, wherein training set is 287, test set 23
Sentence.Experiment is completed on Kaldi platforms.The specific experiment environment configurations of Kaldi are as shown in table 1.
1 experimental situation of table
In experimentation, Mongol acoustic feature is represented using MFCC acoustic features, shares 39 dimension datas, wherein preceding 13 dimension
Feature is made of 12 cepstrum features and 1 energy coefficient, and two 13 dimensional features below are the single orders to 13 dimensional feature of front
Difference and second differnce.When extracting Mongol MFFC features, frame window length is 25ms, and frame moves 10ms.To training set and survey
Examination collection carries out feature extraction respectively, and whole voice data symbiosis are into 119960 MFCC features, the spy that wherein training data generates
It is 112535 to levy, and the feature of Test data generation is 7425.During GMM-HMM acoustic training models, Mongol voice MFCC
Feature is tested using 39 dimension datas.When the sub- DNN-HMM of single-tone is tested, Mongol MFCC phonetic features (do not include for 13 dimensions
First, second differnce feature).When three-tone DNN-HMM is tested, the feature of Mongol MFCC is 39 dimensions
During DNN network trainings, feature extraction uses the method that context combines, i.e., 5 frames is respectively taken to carry out table before and after present frame
Show the context environmental of present frame, therefore, during the experiment, the input number of nodes of the sub- DNN networks of single-tone is 143 (13* (5
+ 1+5)), the input number of nodes of three-tone DNN networks is 429 (39* (5+1+5)).The output node layer of DNN networks is considerable
Mongol phoneme of speech sound number is examined, according to the standard of corpus annotation, output node is 27;The hidden layer node of DNN networks
Number is set as 1024, and tuning frequency of training is set as 60, and initial learning rate is set as 0.015, and final learning rate is set as
0.002。
4.3 experiments and result:
4 experimental considerations units are respectively:The sub- GMM-HMM of single-tone, three-tone GMM-HMM, the sub- DNN-HMM of single-tone and three-tone
DNN-HMM is tested.Experimental result data is shown in Table 2, and comparing result is shown in attached drawing 3.
2 GMM-HMM of table and DNN-HMM Mongol acoustic model experimental datas
In attached drawing 3 (a) it can be found that relative to the sub- GMM-HMM Mongols acoustic model of single-tone, the sub- DNN-HMM of single-tone is covered
Word Error Rate of the archaism acoustic model on training set reduces 8.84%, the word identification lower error rate on test set
11.14%;But for triphone model, three-tone DNN-HMM Mongols acoustic model is covered than three-tone GMM-HMM
Word Error Rate of the archaism acoustic model on training set reduces 1.33%, the word identification lower error rate on test set
7.5%.Attached drawing 3 (b) finds that sentence of the single-tone submodel on training set identifies lower error rate 32.43%, on test set
Sentence identification lower error rate 17.88%;For triphone model, three-tone DNN-HMM Mongols acoustic model ratio
Sentence of the three-tone GMM-HMM Mongol acoustic models on training set identifies lower error rate 19.3%, on test set
Sentence identifies lower error rate 13.63%.
From analyzing above:The sub- DNN-HMM Mongols acoustic model of single-tone is substantially better than the sub- GMM-HMM Mongols of single-tone
Acoustic model;For triphone model, three-tone DNN-HMM Mongols acoustic model is than three-tone GMM-HMM Mongols
The discrimination of acoustic model is taller.
DNN-HMM Mongols acoustic model can effectively reduce the error rate of word identification and the error rate of word identification, improve mould
Type performance.
In order to study the influence of the hidden layer number of plies, dropout technologies to DNN-HMM three-tone Mongol acoustic models, with
Using being tested on the basis of four layers of three-tone DNN-HMM Mongol acoustic models of dropout technologies, carry out respectively about hidden
Containing the contrast experiment of number and dropout technologies layer by layer, 3 are shown in Table in experimental result data.
Dropout is tested on 3 three-tone DNN-HMM acoustic models of table
In order to represent the degree of over-fitting, we define the over-fitting distance of a model, in speech recognition,
What over-fitting was judged often by the discrimination on training set and test set, when discrimination of the data on training set
It is very high, and when the discrimination on test set is very low, then, mean that the model has serious over-fitting, Wo Menyong
The absolute value of the difference of evaluation index and model evaluation index on training set of the model on test set represents over-fitting
The degree of phenomenon, so, its calculation formula is defined as:
The over-fitting distance of model=| evaluation index-model evaluation index on test set of the model on test set |
From 4 dark parts of attached drawing it can be found that in the DNN-HMM Mongols that dropout technique drills is not used to obtain
In acoustic model, when the hidden layer network number of plies increases to 7 layers by 4 layers, the over-fitting distance of word identification is increased from 21.17%
To 54.81%;The over-fitting distance of distich identification has risen to 80.72% from 35.32%.It can thus be seen that with hidden
The increase of the number of plies containing layer network, apart from increasing, becoming larger for over-fitting distance illustrates DNN network structions for the over-fitting of model
Mongol acoustic model serious over-fitting, then, the performance of DNN-HMM will be worse and worse.
In figure 4, by the comparison of two kinds of colors of the depth, it will be seen that after using dropout technologies, when hidden
When the number of plies containing layer network increases to 7 layers by 4 layers, the over-fitting distance to word identification is respectively 21.43%, 21.91%,
24.07% and 25.48%.And using dropout technologies, the over-fitting distance to word identification is 21.17% respectively,
21.91%th, 42.38%, 54.81%.It follows that using the over-fitting distance after dropout technologies than not using
Over-fitting distance after dropout technologies is small, this point, is above equally existed in the over-fitting distance of distich identification.So
After adding dropout technologies, the over-fitting caused by implicit number of plies increase is effectively alleviated, so as to improve mould
The recognition performance of type.
In this description, the present invention is described with reference to its specific embodiment.But it is clear that it can still make
Various modifications and alterations are without departing from the spirit and scope of the invention.Therefore, the description and the appended drawings should be considered as illustrative
And not restrictive.
Claims (3)
1. a kind of training method of the Mongol acoustic model based on DNN, it is characterised in that:GMM-HMM Mongols are trained first
Acoustic model, the Mongol voice feature data being aligned;Then to depth god on the basis of voice feature data is aligned
It is trained through network (DNN) and tuning;The Mongol voice observation state that last basis obtains is again to Hidden Markov Model
(HMM) it is trained.
2. a kind of training method of the Mongol acoustic model based on DNN as described in claim 1, it is characterised in that:It is described
Training method the specific steps are:
Step 1:GMM-HMM Mongol acoustic training models are carried out, obtain an optimal GMM-HMM Mongol speech recognitions system
System, is represented with gmm-hmm.
Step 2:Gmm-hmm is parsed using Viterbi decoding algorithm, to each in the model of gmm-hmm Mongol acoustic models
A senone obtains senone_id into line label.
Step 3:Using gmm-hmm Mongol acoustic models, acoustic states tri-phone is mapped to corresponding senone_id.
Step 4:DNN-HMM Mongol acoustic models, the hidden horses of mainly HMM are initialized using gmm-hmm Mongols acoustic model
Er Kefu model parameters part, finally obtains dnn-hmm1 models.
Step 5:Using Mongol acoustic feature file pre-training DNN deep neural networks, ptdnn is obtained.
Step 6:Using gmm-hmm Mongol acoustic models, Mongol acoustic feature data are carried out to the pressure pair of Status Level
Together, alignment result is align-raw.
Step 7:The physical state of align-raw is converted into senone_id, obtains the training data align- that frame level is not aligned
frame。
Step 8:It finely tunes with having carried out supervision to ptdnn deep neural networks using align data align-data, obtains network
Model dnn.
Step 9:According to maximum likelihood algorithm, the net that the transition probability of HMM model in dnn-hmm1 obtains is reevaluated using dnn
Network model is represented with dnn-hmm2.
Step 10:If test set recognition accuracy does not improve on dnn and dnn-hmm2, training terminates.Otherwise, using dnn-
Hmm2 carries out Status Level alignment again to training data, then performs step 7.
3. a kind of training method of the Mongol acoustic model based on DNN as claimed in claim 1 or 2, it is characterised in that:
Dropout technologies are added in DNN-HMM Mongol acoustic training models and avoid over-fitting.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711390467.1A CN108182938B (en) | 2017-12-21 | 2017-12-21 | A kind of training method of the Mongol acoustic model based on DNN |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711390467.1A CN108182938B (en) | 2017-12-21 | 2017-12-21 | A kind of training method of the Mongol acoustic model based on DNN |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108182938A true CN108182938A (en) | 2018-06-19 |
CN108182938B CN108182938B (en) | 2019-03-19 |
Family
ID=62546662
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711390467.1A Active CN108182938B (en) | 2017-12-21 | 2017-12-21 | A kind of training method of the Mongol acoustic model based on DNN |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108182938B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109326282A (en) * | 2018-10-10 | 2019-02-12 | 内蒙古工业大学 | A kind of small-scale corpus DNN-HMM acoustics training structure |
CN111696522A (en) * | 2020-05-12 | 2020-09-22 | 天津大学 | Tibetan language voice recognition method based on HMM and DNN |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105960672A (en) * | 2014-09-09 | 2016-09-21 | 微软技术许可有限责任公司 | Variable-component deep neural network for robust speech recognition |
CN105957518A (en) * | 2016-06-16 | 2016-09-21 | 内蒙古大学 | Mongolian large vocabulary continuous speech recognition method |
CN106205603A (en) * | 2016-08-29 | 2016-12-07 | 北京语言大学 | A kind of tone appraisal procedure |
CN106991999A (en) * | 2017-03-29 | 2017-07-28 | 北京小米移动软件有限公司 | Audio recognition method and device |
CN107293288A (en) * | 2017-06-09 | 2017-10-24 | 清华大学 | A kind of residual error shot and long term remembers the acoustic model modeling method of Recognition with Recurrent Neural Network |
-
2017
- 2017-12-21 CN CN201711390467.1A patent/CN108182938B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105960672A (en) * | 2014-09-09 | 2016-09-21 | 微软技术许可有限责任公司 | Variable-component deep neural network for robust speech recognition |
CN105957518A (en) * | 2016-06-16 | 2016-09-21 | 内蒙古大学 | Mongolian large vocabulary continuous speech recognition method |
CN106205603A (en) * | 2016-08-29 | 2016-12-07 | 北京语言大学 | A kind of tone appraisal procedure |
CN106991999A (en) * | 2017-03-29 | 2017-07-28 | 北京小米移动软件有限公司 | Audio recognition method and device |
CN107293288A (en) * | 2017-06-09 | 2017-10-24 | 清华大学 | A kind of residual error shot and long term remembers the acoustic model modeling method of Recognition with Recurrent Neural Network |
Non-Patent Citations (3)
Title |
---|
HONGWEI ZHANG ET AL.: "Comparison on Neural Network based acoustic model in Mongolian speech recognition", 《2016 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP)》 * |
中国计算机学会: "《计算机科学技术学科发展报告 2014-2015版》", 30 April 2016, 中国科学技术出版社 * |
张红伟: "基于深度神经网络的蒙古语语音识别系统声学模型的研究", 《中国优秀硕士学位论文全文数据库,信息科技辑》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109326282A (en) * | 2018-10-10 | 2019-02-12 | 内蒙古工业大学 | A kind of small-scale corpus DNN-HMM acoustics training structure |
CN111696522A (en) * | 2020-05-12 | 2020-09-22 | 天津大学 | Tibetan language voice recognition method based on HMM and DNN |
CN111696522B (en) * | 2020-05-12 | 2024-02-23 | 天津大学 | Tibetan language voice recognition method based on HMM and DNN |
Also Published As
Publication number | Publication date |
---|---|
CN108182938B (en) | 2019-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104575490B (en) | Spoken language pronunciation evaluating method based on deep neural network posterior probability algorithm | |
Shan et al. | Component fusion: Learning replaceable language model component for end-to-end speech recognition system | |
CN105741832B (en) | Spoken language evaluation method and system based on deep learning | |
CN108109615A (en) | A kind of construction and application method of the Mongol acoustic model based on DNN | |
CN108172218B (en) | Voice modeling method and device | |
Agarwalla et al. | Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech | |
JP7070894B2 (en) | Time series information learning system, method and neural network model | |
CN104538028A (en) | Continuous voice recognition method based on deep long and short term memory recurrent neural network | |
CN108962223A (en) | A kind of voice gender identification method, equipment and medium based on deep learning | |
CN104217721B (en) | Based on the phonetics transfer method under the conditions of the asymmetric sound bank that speaker model aligns | |
Hashimoto et al. | Trajectory training considering global variance for speech synthesis based on neural networks | |
CN106340297A (en) | Speech recognition method and system based on cloud computing and confidence calculation | |
CN108364634A (en) | Spoken language pronunciation evaluating method based on deep neural network posterior probability algorithm | |
KR101664815B1 (en) | Method for creating a speech model | |
Guo et al. | Deep neural network based i-vector mapping for speaker verification using short utterances | |
CN111599339B (en) | Speech splicing synthesis method, system, equipment and medium with high naturalness | |
CN111091809B (en) | Regional accent recognition method and device based on depth feature fusion | |
CN108182938B (en) | A kind of training method of the Mongol acoustic model based on DNN | |
WO2022148176A1 (en) | Method, device, and computer program product for english pronunciation assessment | |
Shah et al. | Unsupervised Vocal Tract Length Warped Posterior Features for Non-Parallel Voice Conversion. | |
Hai et al. | Cross-lingual phone mapping for large vocabulary speech recognition of under-resourced languages | |
Gómez et al. | Improvements on automatic speech segmentation at the phonetic level | |
Razavi et al. | An HMM-based formalism for automatic subword unit derivation and pronunciation generation | |
CN114333762B (en) | Expressive force-based speech synthesis method, expressive force-based speech synthesis system, electronic device and storage medium | |
CN107492373B (en) | Tone recognition method based on feature fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |