CN103117060A

CN103117060A - Modeling approach and modeling system of acoustic model used in speech recognition

Info

Publication number: CN103117060A
Application number: CN2013100200107A
Authority: CN
Inventors: 颜永红; 肖业鸣; 潘接林
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2013-01-18
Filing date: 2013-01-18
Publication date: 2013-05-22
Anticipated expiration: 2033-01-18
Also published as: CN103117060B

Abstract

The invention relates to a modeling approach and a modeling system of an acoustic model used in speech recognition. The modeling approach includes the steps of: S1, training an initial model, wherein a modeling unit is a tri-phone state which is clustered by a phoneme decision tree and a state transition probability is provided by the model, S2, obtaining state information of a frame level based on the fact that the initial model aligns the tri-phone state of phonetic features of training data compulsively, S3, pre-training a deep neural network to obtain initial weights of each hidden layer, S4, training the initialized network through error back propagation algorithm based on the obtained frame level state information and updating the weights. According to the modeling approach, a context relevant tri-phone state is used as the modeling unit, the model is established based on the deep neural network, weight of each hidden layer of the network is initialized through restricted Boltzmann algorithm, and the weights can be updated subsequently by means of error back propagation algorithm. Therefore, risk that the network is easy to get into local extremum in pre-training is relieved effectively, and modeling accuracy of the acoustic model is improved greatly.

Description

The modeling method, the modeling that are used for the acoustic model of speech recognition

Technical field

The present invention relates to field of speech recognition, relate in particular to a kind of modeling method and modeling of the acoustic model for speech recognition.

Background technology

The main flow framework of speech recognition is at present identified based on statistical model.Typically the speech recognition system framework as shown in Figure 1: comprise voice collecting and front-end processing module, characteristic extracting module, acoustic model module, language model module and decoder module.The basic procedure of speech recognition is as follows: carry out feature extraction afterwards through front-end processing after voice acquisition device collector's voice, the characteristic sequence that extracts such as MFCC or PLP obtain it by acoustic model and observe probability, send into demoder in conjunction with probabilistic language model and obtain most possible text sequence.Described acoustic model modeling adopts mixed Gauss model to carry out modeling to the probability distribution of phonetic feature based on the Hidden Markov framework.Described mixed Gauss model can be done some inappropriate hypothesis to phonetic feature and distribution thereof, and as the linear independence hypothesis of adjacent phonetic feature, it is observed probability and obeys mixed Gaussian distribution etc.In addition, when mixed Gauss model carries out parameter training, objective function is to make the likelihood probability of observing feature maximum, and what use during decoding is the maximum a posteriori criterion, and is inconsistent on probability model.As seen traditional acoustic model, modeling accuracy is not high, causes the speech recognition effect not good enough.

Summary of the invention

For the problems referred to above, the embodiment of the present invention proposes a kind of modeling method, modeling of the acoustic model for speech recognition.

In first aspect, the embodiment of the present invention proposes a kind of modeling method of the acoustic model for speech recognition, described method comprises: with hidden Markov-mixed Gaussian HMM-GMM model of training data training, the modeling unit of this HMM-GMM model is the three-tone state after the phonetic feature of described training data passes through phoneme decision tree cluster, and described HMM-GMM model obtains the state transition probability of described three-tone state by the maximum EM algorithm of expectation; Based on described HMM-GMM model, the three-tone state of described training data phonetic feature is forced alignment, obtain described phonetic feature frame level status information; To carrying out pre-training as the deep layer neural network of described acoustic model to obtain the parameter for the weight of each hidden layer of the described deep layer network of initialization; Three-tone state based on described training data phonetic feature adopts error backpropagation algorithm that described deep layer neural network is trained, and upgrades the weight of its each hidden layer.

Preferably, described based on described HMM-GMM model, the three-tone state of described training data phonetic feature is forced alignment, obtain described phonetic feature frame level status information, be specially: based on described HMM-GMM model, the most probable three-tone state of described training data phonetic feature and its is carried out corresponding, obtain described phonetic feature frame level status information.

Preferably, describedly be specially for the parameter of the weight of each hidden layer of the described deep layer network of initialization obtaining carry out pre-training as the deep layer neural network of described acoustic model: utilize limited Boltzmann machine successively to train to convergence based on described training data, with the weight of each hidden layer of the described deep layer network of parameter initialization that obtains.

In second aspect, the embodiment of the present invention proposes a kind of modeling for the speech recognition acoustic model, it comprises: the first module, be used for hidden Markov-mixed Gaussian HMM-GMM model of training data training, the modeling unit of this HMM-GMM model is the three-tone state after the phonetic feature of described training data passes through phoneme decision tree cluster, and described HMM-GMM model obtains the state transition probability of described three-tone state by the maximum EM algorithm of expectation; The second module is used for based on described HMM-GMM model, and the three-tone state of described training data phonetic feature is forced alignment, obtains described phonetic feature frame level status information; The 3rd module is used for carrying out pre-training as the deep layer neural network of described acoustic model to obtain the parameter for the weight of each hidden layer of the described deep layer network of initialization; Four module is used for adopting error backpropagation algorithm that described deep layer neural network is trained based on the three-tone state of described training data phonetic feature, upgrades the weight of its each hidden layer.

Preferably, described the second module is based on described HMM-GMM model, the three-tone state of described training data phonetic feature is forced alignment, obtain described phonetic feature frame level status information, be specially: described the second module is based on described HMM-GMM model, the most probable three-tone state of described training data phonetic feature and its is carried out corresponding, obtain described phonetic feature frame level status information.

Preferably, described the 3rd module is specially for the parameter of the weight of each hidden layer of the described deep layer network of initialization obtaining carry out pre-training as the deep layer neural network of described acoustic model: described the 3rd module utilizes limited Boltzmann machine successively to train to convergence based on described training data, with the weight of each hidden layer of the described deep layer network of parameter initialization that obtains.

The embodiment of the present invention adopts the three-tone state, based on the deep layer neural net model establishing, use the weight of described each hidden layer of network of limited Boltzmann's algorithm initialization, described weight can also be updated by the back-propagation algorithm follow-up, can effectively alleviate when described network is trained in advance and easily be absorbed in the risk of local extremum, and further improve the modeling accuracy of acoustic model.

Description of drawings

The present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.

Fig. 1 is existing speech recognition system schematic diagram;

Fig. 2 is the relevant deep layer neural network speech recognition system block diagram of the based on the context of the embodiment of the present invention;

Fig. 3 is the modeling method schematic diagram of the acoustic model that is used for speech recognition of the embodiment of the present invention;

Fig. 4 is the modeling schematic diagram of the acoustic model that is used for speech recognition of the embodiment of the present invention.

Embodiment

Below by drawings and Examples, the technical scheme of the embodiment of the present invention is described in further detail.

Consider that mixed Gauss model need to make incorrect hypothesis to phonetic feature and probability distribution thereof, the embodiment of the present invention uses context-sensitive deep layer neural network to replace mixed Gauss model to carry out the acoustic model modeling.Described deep layer neural network comprises a plurality of hidden layers, and its modeling unit is the context dependent three-tone state after phoneme decision tree cluster.The fundamental block diagram of whole system as shown in Figure 2.

Adopt minimum cross entropy criterion as objective function during the deep layer neural metwork training, because it has a plurality of hidden layers, its error function has a lot of local extremums, causes the deep layer neural network to be easy to be absorbed in local extremum and too early convergence at training process.For this problem, the pre-training of neural network of passing through of neural calculating field proposition comes the initializes weights parameter, then adopts traditional error backpropagation algorithm that network parameter is trained.Pre-training algorithm adopts limited Boltzmann machine, and limited Boltzmann machine is the two-dimensional plot model, comprises a visible layer and a hidden layer, wherein between each unit of same layer without the interconnected and dense link in unit different layers.This model is by the joint distribution of an energy function definition visible layer and hidden layer variable, and concrete formula is as follows:

Wherein v is the visible layer variable, and h is the hidden layer variable, and E (v, h) is energy function, and p (v, h) is its joint distribution probability, observes feature likelihood probability p (v) by maximum during training, and its weight parameter more new formula is as follows:

Δw _ij＝＜v _ih _j＞ _data-＜v _ih _j＞ _model

w _ij(t+1)＝w _ij(t)+Δw _ij

W wherein _ijBe connection weight, t is iterations,＜＞represent the variable in bracket is got average.

By successively training limited Boltzmann machine, its parameter is used for initialization deep layer neural network, thereby makes its initial weight fall into a reasonable starting point of weight space, be absorbed in the risk of local extremum when having alleviated to a certain extent network training.Adopt simultaneously three-tone state after phoneme decision tree cluster as the teacher signal of neural network, comprised the context relation of phoneme, make the modeling of acoustic model meticulousr and accurate.

Fig. 3 is the modeling method schematic diagram of the acoustic model that is used for speech recognition of the embodiment of the present invention.Described method comprises: step 1, set up initial model.Particularly, with hidden Markov-mixed Gaussian HMM-GMM model of training data training, the modeling unit of this HMM-GMM model is the three-tone state after the phonetic feature of described training data passes through phoneme decision tree cluster, and described HMM-GMM model obtains the state transition probability of described three-tone state by the maximum EM algorithm of expectation;

Step 2, the phonetic feature frame level status information of the phonetic feature of acquisition training data.Particularly, based on described HMM-GMM model, the three-tone state of described training data phonetic feature is forced alignment, obtain described phonetic feature frame level status information;

Step 3, each hidden layer weight of initialization deep layer neural network.Particularly, to carrying out pre-training as the deep layer neural network of described acoustic model to obtain the parameter for the weight of each hidden layer of the described deep layer network of initialization;

Step 4 is upgraded each hidden layer weight of deep layer neural network.Particularly, based on the three-tone state employing error backpropagation algorithm of described training data phonetic feature, described deep layer neural network is trained, upgrade the weight of its each hidden layer.

Be noted that described hidden Markov-mixed Gaussian HMM-GMM model also can be write as hidden Markov/mixed Gaussian HMM/GMM model.

Pre-training in described step 3 can be considered as a kind of unsupervised training.Training in step 3 can be considered as a kind of training that supervision is arranged.

In addition, the pre-training in step 3 and step 2 can be carried out simultaneously.

When described HMM-GMM model is used for speech recognition as acoustic model, be converted to likelihood probability based on the posterior probability that phonetic feature is generated through the deep layer neural network by Bayesian formula and send into demoder and decode, the text sequence that obtains after decoding is namely as the content of speaking that recognizes.Can assess the effect of speech recognition based on the difference of described speak content and the real raw tone that recognizes.Can assess in speech recognition system performance as the deep layer neural network of acoustic model according to this effect, can consider where necessary it is carried out retraining, even can consider state transition probability in described HMM-GMM model is designed again.

Fig. 4 is the modeling schematic diagram of the acoustic model that is used for speech recognition of the embodiment of the present invention.Described modeling comprises: the first module, be used for hidden Markov-mixed Gaussian HMM-GMM model of training data training, the modeling unit of this HMM-GMM model is the three-tone state after the phonetic feature of described training data passes through phoneme decision tree cluster, and described HMM-GMM model obtains the state transition probability of described three-tone state by the maximum EM algorithm of expectation; The second module is used for based on described HMM-GMM model, and the three-tone state of described training data phonetic feature is forced alignment, obtains described phonetic feature frame level status information; The 3rd module is used for carrying out pre-training as the deep layer neural network of described acoustic model to obtain the parameter for the weight of each hidden layer of the described deep layer network of initialization; Four module is used for adopting error backpropagation algorithm that described deep layer neural network is trained based on the three-tone state of described training data phonetic feature, upgrades the weight of its each hidden layer.

The embodiment of the present invention adopts the deep layer neural network to replace mixed Gauss model to carry out the acoustic model modeling, utilized the three-tone state with context dependent characteristic during modeling, and be different from described mixed Gauss model and need to do some ad hoc hypothesis to phonetic feature and distribution thereof, directly provide the posterior probability of phonetic feature.Described three-tone state has taken into full account the context dependence of language, makes modeling unit more careful, and described a plurality of hidden layers are more similar to human speech sensory perceptual system principle, are beneficial to the extraction of carrying out the high-order characteristic information.The embodiment of the present invention is used the weight of described each hidden layer of network of limited Boltzmann's algorithm initialization, described weight can also be updated by the back-propagation algorithm follow-up, can effectively alleviate when described network is trained in advance and easily be absorbed in the risk of local extremum, and further improve the modeling accuracy of acoustic model.

Those skilled in the art should further recognize, each exemplary module and algorithm steps in conjunction with embodiment description disclosed herein, can realize with electronic hardware, computer software or combination both, for the interchangeability of hardware and software clearly is described, composition and the step of each example described in general manner according to function in the above description.These functions are carried out with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Those skilled in the art can specifically should be used for realizing described function with distinct methods to each, but this realization should not thought the scope that exceeds the application.

The method of describing in conjunction with embodiment disclosed herein or the step of algorithm can use the software module of hardware, processor execution, and perhaps both combination is implemented.Software module can be placed in the storage medium of any other form known in random access memory (RAM), internal memory, ROM (read-only memory) (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field.

It is to be noted, these are only preferred embodiment of the present invention, be not to limit practical range of the present invention, technician with professional knowledge base can realize the present invention by above embodiment, therefore every any variation, modification and improvement according to making within the spirit and principles in the present invention, all covered by the scope of the claims of the present invention.Namely, above embodiment is only unrestricted in order to technical scheme of the present invention to be described, although with reference to preferred embodiment, the present invention is had been described in detail, those of ordinary skill in the art is to be understood that, can modify or be equal to replacement technical scheme of the present invention, and not break away from the spirit and scope of technical solution of the present invention.

Claims

1. a modeling method that is used for the acoustic model of speech recognition, is characterized in that, described method comprises:

With hidden Markov-mixed Gaussian HMM-GMM model of training data training, the modeling unit of this HMM-GMM model is the three-tone state after the phonetic feature of described training data passes through phoneme decision tree cluster, described HMM-GMM model obtains by the maximum EM Algorithm for Training of expectation, obtains simultaneously the state transition probability of described three-tone state;

Based on described HMM-GMM model, described training data phonetic feature is forced alignment, obtain described other three-tone status information of phonetic feature frame level;

To carrying out pre-training as the deep layer neural network of described acoustic model to obtain the parameter for the weight of each hidden layer of the described deep layer network of initialization;

Phonetic feature frame level status information based on described training data phonetic feature adopts error backpropagation algorithm that described deep layer neural network is trained, and upgrades the weight of its each hidden layer.

2. modeling method as claimed in claim 1, it is characterized in that, described based on described HMM-GMM model, the three-tone state of described training data phonetic feature is forced alignment, obtain described phonetic feature frame level status information, be specially: based on described HMM-GMM model, the most probable three-tone state of described training data phonetic feature and its is carried out corresponding, obtain described phonetic feature frame level status information.

3. modeling method as claimed in claim 1, it is characterized in that, describedly be specially for the parameter of the weight of each hidden layer of the described deep layer network of initialization obtaining carry out pre-training as the deep layer neural network of described acoustic model: utilize limited Boltzmann machine successively to train to convergence based on described training data, with the weight of each hidden layer of the described deep layer network of parameter initialization that obtains.

4. a modeling that is used for the speech recognition acoustic model, is characterized in that, described modeling comprises:

The first module, be used for hidden Markov-mixed Gaussian HMM-GMM model of training data training, the modeling unit of this HMM-GMM model is the three-tone state after the phonetic feature of described training data passes through phoneme decision tree cluster, and described HMM-GMM model obtains the state transition probability of described three-tone state by the maximum EM algorithm of expectation;

The second module is used for based on described HMM-GMM model, and described training data phonetic feature is forced alignment, obtains the three-tone status information of described phonetic feature frame level;

The 3rd module is used for carrying out pre-training as the deep layer neural network of described acoustic model to obtain the parameter for the weight of each hidden layer of the described deep layer network of initialization;

Four module is used for adopting error backpropagation algorithm that described deep layer neural network is trained based on the phonetic feature frame level status information of described training data phonetic feature, upgrades the weight of its each hidden layer.

5. modeling as claimed in claim 4, it is characterized in that, described the second module is based on described HMM-GMM model, the three-tone state of described training data phonetic feature is forced alignment, obtain described phonetic feature frame level status information, be specially: described the second module is based on described HMM-GMM model, the most probable three-tone state of described training data phonetic feature and its carried out corresponding, obtains described phonetic feature frame level status information.

6. modeling as claimed in claim 4, it is characterized in that, described the 3rd module is specially for the parameter of the weight of each hidden layer of the described deep layer network of initialization obtaining carry out pre-training as the deep layer neural network of described acoustic model: described the 3rd module utilizes limited Boltzmann machine successively to train to convergence based on described training data, with the weight of each hidden layer of the described deep layer network of parameter initialization that obtains.