CN108182938A

CN108182938A - A kind of training method of the Mongol acoustic model based on DNN

Info

Publication number: CN108182938A
Application number: CN201711390467.1A
Authority: CN
Inventors: 马志强; 杨双涛; 李图雅
Original assignee: Inner Mongolia University of Technology
Current assignee: Inner Mongolia University of Technology
Priority date: 2017-12-21
Filing date: 2017-12-21
Publication date: 2018-06-19
Anticipated expiration: 2037-12-21
Also published as: CN108182938B

Abstract

The present invention provides a kind of training methods of the Mongol acoustic model based on DNN.GMM gauss hybrid models are replaced with DNN deep neural networks, realizes and the posterior probability of Mongol acoustic states is estimated, build DNN HMM acoustic models, and disclose the training method of the model.The present invention can effectively reduce the error rate of word identification and the error rate of word identification, improve model performance.

Description

A kind of training method of the Mongol acoustic model based on DNN

Technical field

The invention belongs to Mongol field of speech recognition, and in particular to a kind of instruction of the Mongol acoustic model based on DNN Practice method.

Background technology

Typical large vocabulary Continuous Speech Recognition System (Large Vocabulary Continuous Speech Recognition, LVCSR) by feature extraction, acoustic model, language model and decoder etc. form acoustic models be voice know The core component of other system is built based on GMM model (mixed Gauss model) and HMM model (hidden Markov model) GMM-HMM acoustic models had been once acoustic models most widely used in large vocabulary Continuous Speech Recognition System.

In GMM-HMM models, GMM model carries out probabilistic Modeling to speech feature vector, then passes through EM algorithms (the maximum phase Hope algorithm) the maximization probability that voice observes feature is generated, when Gaussian mixtures number is enough, GMM model can fill Divide the probability distribution of fitting acoustic feature, the time sequence status of observation state generation voice that HMM model is fitted according to GMM model. When the probability using GMM model mixed Gauss model is come when describing voice data distribution, GMM model substantially belongs to shallow-layer mould Type, and be fitted carried out between feature during acoustic feature distribution independence it is assumed that acoustics spy therefore can not be fully described The state space distribution of sign；Meanwhile the intrinsic dimensionality of GMM modelings is usually tens dimensions, it is impossible to be fully described between acoustic feature Correlation, model tormulation ability is limited.

Invention content

The 1980s starts to occur using the research of neural network and HMM model structure acoustic model, still, due to Computer computation ability is insufficient at that time and lacks enough training datas, and the effect of model is not as good as GMM-HMM.Microsoft Asia in 2010 The graduate Deng Li in continent and Hinton groups propose CD-DBN (dynamic Bayesian networks for extensive continuous speech recognition task Network)-HMM mixing acoustics model framework, and carried out related experiment.The experimental results showed that compared to GMM-HMM acoustic models, Speech recognition system recognition correct rate is made to improve 30% or so, CD-DBN-HMM mixing acoustics using CD-DBN-HMM acoustic models The original acoustic model frame of speech recognition has thoroughly been reformed in the proposition of model framework.Compared with traditional gauss hybrid models, Deep neural network belongs to depth model, can preferably represent complex nonlinear function, can more capture speech feature vector it Between correlation, be easy to obtain preferably modeling effect.A kind of illiteracy based on DNN models is proposed based on the above-mentioned achievement present invention The construction and application method of archaism acoustic model, preferably to complete Mongol acoustic model modeling task.

The technical scheme is that：

1. model construction：

Replace GMM gauss hybrid models with DNN deep neural networks, realize to the posterior probability of Mongol acoustic states into Row estimation.In the case of given Mongol acoustic feature sequence, it is used for estimating that current signature belongs to HMM first by DNN models Then shape probability of state describes the dynamic change of Mongol voice signal with HMM model, capture the sequential of Mongol voice messaging Status information.

The training of DNN networks is divided into two stages of pre-training and tuning in Mongol acoustic model.

In the pre-training of DNN networks, successively unsupervised training algorithm is employed, belongs to production training algorithm.Successively Unsupervised pre-training algorithm is that each layer of DNN networks is trained, and only trains one layer therein every time, other layers Parameter keeps the parameter constant of original initialization, during training, to the reduction of each layer of the error output and input as possible, with The parameter for ensureing each layer is all optimal for this layer.Next, using trained each layer of output data as Next layer of input data is input to next layer when then next layer of input data will be than directly training by multilayer neural network Data error it is much smaller, successively unsupervised pre-training algorithm can ensure the error of the inputoutput data between each layer All it is relatively small.

Preferable neural network initiation parameter can be obtained by successively unsupervised pre-training algorithm, use Mongol mark Note data (i.e. significant condition) carry out the tuning for having supervision by BP algorithm (error backpropagation algorithm), finally obtain and can be used for The DNN deep neural network models of acoustic states classification.

2. model uses：

After the pre-training and tuning to DNN networks, DNN-HMM acoustic models can be utilized to Mongol voice data It is identified, specific process is as follows：

Step 1：According to the Mongol acoustic feature of input vector, L layers before DNN deep neural networks of output is calculated.

Step 2：Posterior probability of the current signature about whole acoustic states is calculated using L layers of softmax classification layers. I.e. current signature belongs to the probability of each Mongol acoustic states.

Step 3：According to Bayesian formula, the posterior probability of each state divided by the prior probability of its own obtain each The regular likelihood value of state.

Step 4：It is decoded to obtain optimal path using Viterbi decoding algorithm.

Wherein, the prior probability of hidden state corresponds to frame sum and gross acoustic features frame number by calculating each state Ratio is i.e. available.

3. the training process of model

The labeled data that the DNN deep neural network tuning stages use is alignment to be forced to obtain by GMM-HMM acoustic models , and using the update of stochastic gradient descent algorithm completion model parameter, therefore, firstly the need of training before training DNN-HMM One GMM-HMM Mongol acoustic model good enough, is then given birth to by GMM-HMM Mongol acoustic models by viterbi algorithm Into the required labeled data of DNN-HMM Mongol acoustic training models.

Since DNN models need the Mongol labeled data of Mongol speech frame alignment, and labeled data in tuning Quality often influences whether the performance of DNN models.Therefore, in practical application, we utilize the GMM-HMM Mongols trained The pressure that acoustic model realizes phonetic feature to state is aligned.So the training process of DNN-HMM acoustic models is：It instructs first Practice GMM-HMM Mongol acoustic models, the Mongol voice feature data being aligned；Then in alignment voice feature data On the basis of deep neural network (DNN) is trained and tuning；It is last right again according to obtained Mongol voice observation state Hidden Markov Model (HMM) is trained.

Description of the drawings

Fig. 1 is DNN-HMM Mongol acoustic model figures.

Fig. 2 is DNN network pre-training procedure charts.

Fig. 3 is to scheme relative to the Experimental comparison results of GMM-HMM acoustic models.

Fig. 4 is the influence figure of dropout technologies and the hidden layer number of plies to DNN-HMM model over-fitting distances.

Embodiment

In order to more clearly describe the technology contents of the present invention, carried out with reference to specific embodiment further Description.1. model construction：

DNN-HMM Mongol acoustic models structure is specifically as shown in Figure 1.In DNN-HMM Mongol acoustic models, DNN networks by constantly stacking what hidden layer was realized from bottom to top.Wherein S represents the hidden state in HMM model, and A is represented State transition probability matrix, L represent DNN deep neural networks the number of plies (wherein hidden layer be L-1 layer, L0 layers be input layer, LL For layer for output layer, DNN networks include L+1 layers altogether), the connection matrix between W expression layers.DNN-HMM Mongol acoustic models exist Before carrying out Mongol speech recognition process modeling, need to be trained DNN neural networks.In the instruction for completing DNN neural networks It is consistent with GMM-HMM models to the modeling process of Mongol acoustic model after white silk.

In the pre-training (as shown in Figure 2) of DNN networks, successively unsupervised training algorithm is employed, belongs to production Training algorithm.Successively unsupervised pre-training algorithm is that each layer of DNN networks is trained, and only training is therein every time One layer, the parameter of other layers keeps the parameter constant of original initialization, when training, to each layer of the error output and input Reduction as possible, the parameter to ensure each layer are all optimal for this layer.Next, by trained each layer Output data passes through multilayer neural network as next layer of input data when then next layer of input data will be than directly training The error for being input to next layer of data is much smaller, and successively unsupervised pre-training algorithm can ensure that the input between each layer is defeated The error for going out data is all relatively small.Pre-training algorithm is shown in algorithm 1.

Preferable neural network initiation parameter can be obtained by successively unsupervised pre-training algorithm, use Mongol mark Note data (i.e. significant condition) carry out the tuning for having supervision by BP algorithm (error backpropagation algorithm), finally obtain and can be used for The DNN deep neural network models of acoustic states classification.The evolutionary algorithm for having supervision is carried out real using stochastic gradient descent algorithm It is existing, it is specifically shown in algorithm 2.

2. model uses：

Step 1：According to the Mongol acoustic feature of input vector, L layers before DNN deep neural networks of output is calculated. I.e.：

v^α=f (z^α)=f (W^αv^α-1+b^α),0≤α<L (1)

Wherein, z^αRepresent excitation vector, z^α=W^αv^α-1+b^αAndV α represent activation vector,W^α Represent weight matrix,b^αRepresent bigoted vector,N_αRepresent α layers of neurode number And N_α∈R。V⁰Represent the input feature vector of network,In DNN-HMM acoustic models, input feature vector is sound Learn feature vector.Wherein N₀=D represents the dimension of input acoustic feature vector,Represent activation primitive to swashing The calculating process of vector is encouraged, f () represents activation primitive.

Step 2：Posterior probability of the current signature about whole acoustic states is calculated using L layers of softmax classification layers. I.e. current signature belongs to the probability of each Mongol acoustic states, i.e. current signature belongs to the probability of each Mongol acoustic states.

v_i=P_dnn(i | O)=softmax (i) (2)

In formula (2), i ∈ { 1,2 ..., C }, wherein C represent the hidden state number of acoustic model, x_iIt represents The input of softmax i-th of neural unit of layer, v_iRepresent the output of softmax classification i-th of neural unit of layer, i.e. input sound Learn posterior probability of the feature vector O about i-th of hidden state of acoustic model.

Step 4：It is decoded to obtain optimal path using Viterbi decoding algorithm.

3. model training process：

Step 1：GMM-HMM Mongol acoustic training models are carried out, an optimal GMM-HMM Mongols voice is obtained and knows Other system, is represented with gmm-hmm.

Step 2：Gmm-hmm is parsed using Viterbi decoding algorithm, in the model of gmm-hmm Mongol acoustic models Each senone obtains senone_id into line label.

Step 3：Using gmm-hmm Mongol acoustic models, acoustic states tri-phone is mapped to accordingly senone_id。

Step 4：DNN-HMM Mongol acoustic models, mainly HMM are initialized using gmm-hmm Mongols acoustic model Hidden Markov Model argument section finally obtains dnn-hmm1 models.

Step 5：Using Mongol acoustic feature file pre-training DNN deep neural networks, ptdnn is obtained.

Step 6：Using gmm-hmm Mongol acoustic models, Mongol acoustic feature data are subjected to the strong of Status Level System alignment, alignment result are align-raw.

Step 7：The physical state of align-raw is converted into senone_id, obtains the training data that frame level is not aligned align-frame。

Step 8：It finely tunes with having carried out supervision to ptdnn deep neural networks using align data align-data, obtains Network model dnn.

Step 9：According to maximum likelihood algorithm, the transition probability that HMM model in dnn-hmm1 is reevaluated using dnn is obtained Network model represented with dnn-hmm2.

Step 10：If test set recognition accuracy does not improve on dnn and dnn-hmm2, training terminates.Otherwise, it uses Dnn-hmm2 carries out Status Level alignment again to training data, then performs step 7.

In the training process, an optimal GMM-HMM Mongol voice recognition data preparation system (step is trained first 1), it is therefore an objective to be the supervision tuning service of DNN.In training GMM-HMM Mongol acoustic models, using expectation-maximization algorithm Unsupervised training is carried out, avoids the requirement to labeled data；Then using Mongol acoustic feature to deep neural network into Row pre-training (step 5)；In the second stage (having the supervision tuning stage) of deep neural network training, it is utilized what is trained The pressure that GMM-HMM Mongols acoustic model carries out phonetic feature to state is aligned (step 6), obtains labeled data；Last profit Tuning (the step 8) for carrying out having supervision to DNN deep neural networks with labeled data.DNN deep neural networks training complete with Afterwards, its next step flow (step 10) is determined according to recognition results of the DNN-HMM on test set.

4. experiment and result：

4.1 be the validity for the DNN-HMM Mongol acoustic models that verification proposes, formulates following experiment：

(1) MFCC acoustic features, the experimental study of expansion GMM-HMM, DNN-HMM Mongol acoustic model modeling are extracted. Observing different Acoustic Modeling units influences the performance of acoustic model and compares different type acoustic model to speech recognition system The influence of system.

(2) by building the DNN-HMM three-tone Mongol acoustic models of different layers of depth network structures, carry out layer Several experimental studies influenced on Mongol acoustic model and on over-fitting.

4.2 experiment parameter：

The corpus of Mongol speech recognition is made of 310 Mongols teaching voices, altogether 2291 Mongol vocabulary, It is named as IMUT310 corpus.Corpus is made of altogether three parts：Audio file, pronunciation mark and corresponding Mongolian text. In experiment, IMUT310 corpus is divided into training set and test set two parts, wherein training set is 287, test set 23 Sentence.Experiment is completed on Kaldi platforms.The specific experiment environment configurations of Kaldi are as shown in table 1.

1 experimental situation of table

In experimentation, Mongol acoustic feature is represented using MFCC acoustic features, shares 39 dimension datas, wherein preceding 13 dimension Feature is made of 12 cepstrum features and 1 energy coefficient, and two 13 dimensional features below are the single orders to 13 dimensional feature of front Difference and second differnce.When extracting Mongol MFFC features, frame window length is 25ms, and frame moves 10ms.To training set and survey Examination collection carries out feature extraction respectively, and whole voice data symbiosis are into 119960 MFCC features, the spy that wherein training data generates It is 112535 to levy, and the feature of Test data generation is 7425.During GMM-HMM acoustic training models, Mongol voice MFCC Feature is tested using 39 dimension datas.When the sub- DNN-HMM of single-tone is tested, Mongol MFCC phonetic features (do not include for 13 dimensions First, second differnce feature).When three-tone DNN-HMM is tested, the feature of Mongol MFCC is 39 dimensions

During DNN network trainings, feature extraction uses the method that context combines, i.e., 5 frames is respectively taken to carry out table before and after present frame Show the context environmental of present frame, therefore, during the experiment, the input number of nodes of the sub- DNN networks of single-tone is 143 (13* (5 + 1+5)), the input number of nodes of three-tone DNN networks is 429 (39* (5+1+5)).The output node layer of DNN networks is considerable Mongol phoneme of speech sound number is examined, according to the standard of corpus annotation, output node is 27；The hidden layer node of DNN networks Number is set as 1024, and tuning frequency of training is set as 60, and initial learning rate is set as 0.015, and final learning rate is set as 0.002。

4.3 experiments and result：

4 experimental considerations units are respectively：The sub- GMM-HMM of single-tone, three-tone GMM-HMM, the sub- DNN-HMM of single-tone and three-tone DNN-HMM is tested.Experimental result data is shown in Table 2, and comparing result is shown in attached drawing 3.

2 GMM-HMM of table and DNN-HMM Mongol acoustic model experimental datas

In attached drawing 3 (a) it can be found that relative to the sub- GMM-HMM Mongols acoustic model of single-tone, the sub- DNN-HMM of single-tone is covered Word Error Rate of the archaism acoustic model on training set reduces 8.84%, the word identification lower error rate on test set 11.14%；But for triphone model, three-tone DNN-HMM Mongols acoustic model is covered than three-tone GMM-HMM Word Error Rate of the archaism acoustic model on training set reduces 1.33%, the word identification lower error rate on test set 7.5%.Attached drawing 3 (b) finds that sentence of the single-tone submodel on training set identifies lower error rate 32.43%, on test set Sentence identification lower error rate 17.88%；For triphone model, three-tone DNN-HMM Mongols acoustic model ratio Sentence of the three-tone GMM-HMM Mongol acoustic models on training set identifies lower error rate 19.3%, on test set Sentence identifies lower error rate 13.63%.

From analyzing above：The sub- DNN-HMM Mongols acoustic model of single-tone is substantially better than the sub- GMM-HMM Mongols of single-tone Acoustic model；For triphone model, three-tone DNN-HMM Mongols acoustic model is than three-tone GMM-HMM Mongols The discrimination of acoustic model is taller.

DNN-HMM Mongols acoustic model can effectively reduce the error rate of word identification and the error rate of word identification, improve mould Type performance.

In order to study the influence of the hidden layer number of plies, dropout technologies to DNN-HMM three-tone Mongol acoustic models, with Using being tested on the basis of four layers of three-tone DNN-HMM Mongol acoustic models of dropout technologies, carry out respectively about hidden Containing the contrast experiment of number and dropout technologies layer by layer, 3 are shown in Table in experimental result data.

Dropout is tested on 3 three-tone DNN-HMM acoustic models of table

In order to represent the degree of over-fitting, we define the over-fitting distance of a model, in speech recognition, What over-fitting was judged often by the discrimination on training set and test set, when discrimination of the data on training set It is very high, and when the discrimination on test set is very low, then, mean that the model has serious over-fitting, Wo Menyong The absolute value of the difference of evaluation index and model evaluation index on training set of the model on test set represents over-fitting The degree of phenomenon, so, its calculation formula is defined as：

The over-fitting distance of model=| evaluation index-model evaluation index on test set of the model on test set |

From 4 dark parts of attached drawing it can be found that in the DNN-HMM Mongols that dropout technique drills is not used to obtain In acoustic model, when the hidden layer network number of plies increases to 7 layers by 4 layers, the over-fitting distance of word identification is increased from 21.17% To 54.81%；The over-fitting distance of distich identification has risen to 80.72% from 35.32%.It can thus be seen that with hidden The increase of the number of plies containing layer network, apart from increasing, becoming larger for over-fitting distance illustrates DNN network structions for the over-fitting of model Mongol acoustic model serious over-fitting, then, the performance of DNN-HMM will be worse and worse.

In figure 4, by the comparison of two kinds of colors of the depth, it will be seen that after using dropout technologies, when hidden When the number of plies containing layer network increases to 7 layers by 4 layers, the over-fitting distance to word identification is respectively 21.43%, 21.91%, 24.07% and 25.48%.And using dropout technologies, the over-fitting distance to word identification is 21.17% respectively, 21.91%th, 42.38%, 54.81%.It follows that using the over-fitting distance after dropout technologies than not using Over-fitting distance after dropout technologies is small, this point, is above equally existed in the over-fitting distance of distich identification.So After adding dropout technologies, the over-fitting caused by implicit number of plies increase is effectively alleviated, so as to improve mould The recognition performance of type.

In this description, the present invention is described with reference to its specific embodiment.But it is clear that it can still make Various modifications and alterations are without departing from the spirit and scope of the invention.Therefore, the description and the appended drawings should be considered as illustrative And not restrictive.

Claims

1. a kind of training method of the Mongol acoustic model based on DNN, it is characterised in that：GMM-HMM Mongols are trained first Acoustic model, the Mongol voice feature data being aligned；Then to depth god on the basis of voice feature data is aligned It is trained through network (DNN) and tuning；The Mongol voice observation state that last basis obtains is again to Hidden Markov Model (HMM) it is trained.

2. a kind of training method of the Mongol acoustic model based on DNN as described in claim 1, it is characterised in that：It is described Training method the specific steps are：

Step 1：GMM-HMM Mongol acoustic training models are carried out, obtain an optimal GMM-HMM Mongol speech recognitions system System, is represented with gmm-hmm.

Step 2：Gmm-hmm is parsed using Viterbi decoding algorithm, to each in the model of gmm-hmm Mongol acoustic models A senone obtains senone_id into line label.

Step 3：Using gmm-hmm Mongol acoustic models, acoustic states tri-phone is mapped to corresponding senone_id.

Step 4：DNN-HMM Mongol acoustic models, the hidden horses of mainly HMM are initialized using gmm-hmm Mongols acoustic model Er Kefu model parameters part, finally obtains dnn-hmm1 models.

Step 6：Using gmm-hmm Mongol acoustic models, Mongol acoustic feature data are carried out to the pressure pair of Status Level Together, alignment result is align-raw.

Step 7：The physical state of align-raw is converted into senone_id, obtains the training data align- that frame level is not aligned frame。

Step 9：According to maximum likelihood algorithm, the net that the transition probability of HMM model in dnn-hmm1 obtains is reevaluated using dnn Network model is represented with dnn-hmm2.

Step 10：If test set recognition accuracy does not improve on dnn and dnn-hmm2, training terminates.Otherwise, using dnn- Hmm2 carries out Status Level alignment again to training data, then performs step 7.

3. a kind of training method of the Mongol acoustic model based on DNN as claimed in claim 1 or 2, it is characterised in that： Dropout technologies are added in DNN-HMM Mongol acoustic training models and avoid over-fitting.