CN113421555B

CN113421555B - BN-SGMM-HMM-based low-resource voice recognition method

Info

Publication number: CN113421555B
Application number: CN202110897247.8A
Authority: CN
Inventors: 赵宏亮; 雷杰
Original assignee: Liaoning University
Current assignee: Liaoning University
Priority date: 2021-08-05
Filing date: 2021-08-05
Publication date: 2024-04-12
Anticipated expiration: 2041-08-05
Also published as: CN113421555A

Abstract

A low-resource voice recognition method based on BN-SGMM-HMM combines bottleneck characteristics trained by a neural network with a subspace Gaussian mixture model to form a baseline system under a low-resource condition, forms a BN-SGMM-HMM acoustic model, and transplants the model onto a raspberry party to complete a voice recognition task.

Description

BN-SGMM-HMM-based low-resource voice recognition method

Technical Field

The invention relates to the field of voice recognition, in particular to a low-resource voice recognition method based on BN-SGMM-HMM.

Background

The existing speech recognition generally adopts the following methods:

method 1: the traditional MFCC characteristics are used as input, the GMM-HMM is used as an acoustic model, and the voice model trained by the method cannot perform multi-layer and back propagation operation like a neural network because the GMM-HMM is a shallow model, so that the recognition rate is often lower than that of the acoustic model subjected to the neural network operation; and GMM-HMM is not suitable for application scenarios where corpus is scarce (i.e. low resources).

Method 2: the traditional DNN-HMM model has huge calculation amount due to the multi-level complex structure of the DNN neural network, adopts a bottleneck network for training, and reduces budget amount due to the fact that a later network is required to be removed when bottleneck characteristics are extracted, and nonlinear operation at the back of the bottleneck layer is not required; compared with the traditional MFCC (multi-frequency domain) feature, the extracted bottleneck feature has the advantages of long-term correlation and compact representation of DNN feature voice and the like due to cross entropy training, and the recognition rate of a baseline system formed by using the bottleneck feature training is more ideal than that of DNN-HMM (recognition rate).

Disclosure of Invention

The invention aims to solve the technical problem of providing a low-resource voice recognition method based on BN-SGMM-HMM, which is used for training an open-source Chinese corpus based on the model and finally realizing the method on a raspberry-set hardware platform and has good recognition effect.

In order to achieve the above purpose, the invention adopts the following technical scheme: the low-resource voice recognition method based on BN-SGMM-HMM is characterized by comprising the following steps:

1) Preprocessing training data and extracting: setting and diversity are carried out on an original database, and then feature extraction is carried out, so that MFCC features are obtained;

2) Creating a single-phoneme acoustic model:

3) Creating a triphone acoustic model: obtaining FMLLR characteristics;

4) Training a neural network: taking the FMLLR characteristics as input characteristics of a bottleneck neural network, removing a network layer after the bottleneck layer is trained by the neural network, and finally extracting the bottleneck characteristics subjected to cross entropy training by taking the bottleneck layer as an output layer;

5) Training of BN-SGMM-HMM baseline system: taking bottleneck characteristics extracted from the neural network as input characteristics of an SGMM-HMM acoustic model, and finally forming a BN-SGMM-HMM baseline system;

6) The hardware implementation: the process of Kaldi compiling is put on a virtual machine for compiling, and a file which is finally compiled is stored in a raspberry group; updating control variables contained in the current terminal; finally, confirming whether the cross compiling environment configuration of the raspberry group is completed or not;

7) And transplanting the trained acoustic model file, the voice model word network file and the dictionary file into a raspberry group to input voice, decoding the voice through a decoder carried by Kaldi, and finally outputting the text of the voice to a terminal.

In the step 1), the specific method comprises the following steps:

1.1 Preparing an original corpus, and setting a path of the corpus in a training script;

1.2 Executing a data preparation script, dividing the data into a training set, a testing set and a developing set, and generating a mapping relation between a speaker number and voice, the gender of the speaker and related information of an original voice file;

1.3 After the related information is generated, preparing a dictionary and a corresponding phoneme model, and finishing the data preparation;

1.4 Feature extraction is carried out on the voice signals, the extraction range is a training set, a development set and a test set, and the executed scripts are steps/make_mfcc. Sh and computer_cmvn_stats. Sh;

1.5 In make_mfcc, sh, pre-emphasis, framing, windowing, fast fourier transform, mel transform, log energy, and first-order second-order differential computation to extract dynamic features are needed to transform the original speech into feature vectors;

1.6 After obtaining the characteristics, executing the computer_cmvn_stats.sh file, normalizing the obtained acoustic characteristics by a cepstrum mean variance, and completing the characteristic extraction part.

In the step 2), the specific method comprises the following steps:

2.1 Using the previously trained MFCC features to initialize the GMM model for the monophonins;

2.2 The model training is iterated by adopting an E-M algorithm, and data alignment is carried out;

2.3 The alignment model obtained by the last training is iterated again until the model converges.

7. The BN-SGMM-HMM based low-resource speech recognition method according to claim 1, wherein in the step 3), the specific method is:

3.1 Training on the basis of aligned single-phoneme models, training a corpus at the same time to generate a language dictionary file, a voice path file, a voice and speaker mapping file and a phoneme file of a speech segment, putting the three-phoneme models together to perform similarity clustering cutting, clustering the three-phoneme models with similar pronunciation into one model, and sharing parameters; then, training the triphone model by a method for training the monophone model;

3.2 Performing feature transformation including linear discriminant analysis, maximum likelihood linear transformation and speaker self-training; the speaker self-training is based on FMLLR, and the obtained FMLLR characteristics are used for training of the next neural network;

in the step 4), the specific method comprises the following steps:

4.1 Before the neural network formally trains, adopting unsupervised training of RBM of a limited Boltzmann machine to initialize layer by layer, stacking the trained RBM models, and forming a DBN model by the stacked RBM models; in kaldi, neural network pre-training is accomplished by performing pretrain_dbn.sh; in the pre-training process, all RBMs are trained by using a CD algorithm and matching with a small batch of random gradient descent algorithm, the size of each mini-batch is 256, impulse factors are set to be 0.9, and weight attenuation is not set; the first RBM adopts a Gaussian-Bernoulli unit, the learning rate is 0.005, and the iteration number is 50; the RBM at the back adopts a Bernoulli-Bernoulli unit, the learning rate is 0.08, the iteration times are 25 times, and the number of experimental hidden layer nodes is 1024;

4.2 During fine tuning, adopting a back propagation algorithm, wherein the initial learning rate is 0.008, and the adopted activation function is Sigmoid; during training, the learning rate is kept unchanged in the first 10 periods, from the 11 th period, the learning rate is changed to be half of the original learning rate in each training period, the total iteration is 30 times, and the training is stopped when the difference of the learning rates between the two iterations is smaller than 0.001; if the conditions are not met after 30 times, training is forcibly stopped; the neural network has a structure of 1 input layer, 5 hidden layers and 1 output layer, and 7 layers, wherein the first input layer has 440 nodes, the hidden layers have 1024 nodes except the fourth hidden layer which has 40 nodes, and the output layer has 2016 nodes, which correspond to the status number of triphones after clustering in the GMM-HMM baseline system;

4.3 After the network training is finished, the bottleneck characteristics are extracted, the BN layer in the train is set in the sh process, the numerical value of the node number bn_dim of the bottleneck layer is set, then a make_bn_features.sh script is called, after the characteristics are extracted, the network behind the bottleneck layer is removed, the obtained bottleneck characteristics are subjected to CMVN, and the bottleneck characteristics are extracted, and are used for building an SGMM-HMM baseline system.

8. The BN-SGMM-HMM based low-resource speech recognition method according to claim 1, wherein in the step 5), the specific method is:

5.1 When training the SGMM-HMM model, firstly, training an HMM model to obtain the state binding of the GMM-HMM, carrying out Gaussian clustering on the acoustic model to generate 400 UBM, and then, training the UBM model to adjust parameters by an E-M algorithm; to complete an initialized SGMM model, there are two last steps of E-M training, the first step aligns the GMM-HMM baseline with the Viterbi state; the second step is to use Viterbi alignment to obtain SGMM model; and finally, distinguishing training based on MMI criteria to obtain a final BN-SGMM-HMM model.

The invention has the beneficial effects that:

1. the language identification method adopts a BN-SGMM-HMM model, the parameter scale is lower than that of the traditional voice identification method, and the language identification method can be operated on embedded voice equipment with limited hardware resources.

2. The language identification method is suitable for low-resource voice identification conditions, and can achieve good identification effect under the condition of limited training corpus.

3. The language identification method has low development cost, adopts an open-source Chinese language identification library, and saves the cost of purchasing a corpus and a voice identification API. And the language identification acoustic model is constructed based on kaldi, so that the development period and development difficulty of developers are greatly reduced.

4. The hardware implementation cost of the language identification method is low, and the open source hardware raspberry group 4b is adopted, so that compared with the cost of other development boards and voice chips, the development difficulty and period are reduced, and the chip streaming cost is saved.

Drawings

Fig. 1: MFCC feature extraction flow chart.

Fig. 2: a feature transformation flow chart.

Fig. 3: DBN training flow block diagram.

Fig. 4: back propagation algorithm block diagram

Fig. 5: bottleneck feature extraction graph.

Fig. 6: a flow chart of training of subspace gaussian mixture models.

Fig. 7: BN-SGMM-HMM training patterns.

Fig. 8: cross-compiling a flow chart.

Fig. 9: an online speech recognition system block diagram.

Fig. 10: an overall block diagram of a hardware implementation.

Detailed Description

The following description of the embodiments of the present invention will be made with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention.

A low-resource voice recognition method based on BN-SGMM-HMM comprises the following steps:

early preparation of training data:

1. an original corpus is prepared, and a path of the corpus is set in a training script.

2. Executing a data preparation script, dividing the data into a training set train, a test set est and a development set dev, and generating relevant information such as a mapping relation between a speaker number and voice, the gender of the speaker, an original voice file and the like.

3. After the relevant information is generated, the dictionary and the corresponding phoneme model are prepared. The data preparation is completed to this point in the feature extraction section.

4. The characteristic extraction is carried out on the voice signal, the extraction range is a training set, a development set and a test set, and the executed scripts are steps/make_mfcc. Sh and computer_cmvn_stats. Sh.

5. In the make_mfcc, sh, pre-emphasis, framing, windowing, fast fourier transformation, mel transformation, log energy and first-order second-order differential calculation for extracting dynamic characteristics are needed, and the purpose of the script converts the original voice into a feature vector, so that not only can the key information of the original voice waveform be extracted, but also the calculated amount can be reduced, and redundant information can be removed.

6. After the features are obtained, a computer_cmvn_stats file is executed, and the file is that the obtained acoustic features are subjected to cepstrum mean variance normalization CMVN. The purpose of normalization is to normalize the input acoustic features to conform to a normal distribution, thereby reducing the effect of noise on speech. The feature extraction section is completed so far. The MFCC feature extraction flow is shown in fig. 1.

Single-phoneme model creation of acoustic model:

7. the previously trained MFCC features are used to initialize the GMM model for the monophonins.

8. And iterating the model training by adopting an E-M algorithm, and aligning data.

9. And iterating the alignment model obtained by the last training until the model converges.

Three-phoneme model creation of acoustic model:

10. since the actual pronunciation of a phoneme is influenced by neighboring phonemes in the actual pronunciation, especially in the word or sentence of the co-pronunciation, a context-dependent acoustic model is introduced, and among the context-dependent models, the best-performing is a triphone model whose model is composed of < left neighboring phoneme+center phoneme+right neighboring phoneme >.

11. The triphone training requires training on the basis of aligned monophonic models, and also requires training data and documents such as language dictionaries. Meanwhile, in order to solve model parameter explosion caused by introducing the triphone model, the triphone models are put together to perform similarity clustering cutting, the triphone models with similar pronunciation are clustered into one model, parameters are shared, and the scale of the model parameters is effectively reduced. The process of training the triphone model after this is quite similar to the process of training the monophone model.

12. Meanwhile, in order to improve the recognition rate of the triphone model, feature transformation is carried out, and the feature transformation mainly comprises linear discriminant analysis LDA, maximum likelihood linear transformation MLLT and speaker self-training SAT. The speaker self training is based on feature maximum likelihood regression FMLLR, and the obtained FMLLR features are used for the training of the next neural network. The feature transformation flow is shown in fig. 2.

Training of neural networks

13. Before the neural network is formally trained, the unsupervised training of the RBM of the limited Boltzmann machine is adopted to initialize layer by layer before the neural network is formally trained, and the trained RBM models are stacked to form the DBN model. In kaldi, neural network pre-training is accomplished by performing pretrain_dbn.sh; in the pre-training process, all RBMs are trained by using a CD algorithm and matching with a small batch of random gradient descent algorithm, the size of each mini-batch is 256, impulse factors are set to be 0.9, and weight attenuation is not set; the first RBM adopts a Gaussian-Bernoulli unit, the learning rate is 0.005, and the iteration number is 50; and the RBM at the back adopts a Bernoulli-Bernoulli unit, the learning rate is 0.08, the iteration number is 25, and the number of experimental hidden nodes is 1024.

14. In the fine tuning process, a back propagation algorithm is adopted, the initial learning rate is 0.008, and the adopted activation function is Sigmoid. During training, the learning rate is kept unchanged in the first 10 periods, and from the 11 th period, the learning rate is halved in every training period, and the total number of the training periods is 30. Stopping training when the difference of the learning rates between the two iterations is less than 0.001; if the above conditions are not met after 30 times, training is forcibly stopped. In this context, the neural network has a structure of 1 input layer, 5 hidden layers and 1 output layer, and 7 total layers, wherein the first input layer has 440 nodes, the hidden layers have 1024 nodes except the fourth hidden layer has 40 nodes, and the output layer has 2016 nodes, which corresponds to the status number of triphones after clustering in the GMM-HMM baseline system. The DBN training flow diagram is shown in fig. 3 and the back propagation algorithm diagram is shown in fig. 4.

15. After the network training is finished, the bottleneck characteristics are extracted, the BN layer in the train is set in the training, the numerical value of the number bn_dim of the bottleneck layer is set, then a make_bn_features.sh script is called, after the characteristics are extracted, the network behind the bottleneck layer is removed, and the bottleneck characteristics are extracted after the obtained bottleneck characteristics pass through the CMVN. This feature is used to build the SGMM-HMM baseline system later. Bottleneck feature extraction is shown in fig. 5.

Training of BN-SGMM-HMM baseline system:

16. when the SGMM-HMM model is trained, firstly, an HMM model is required to be trained, the state binding of the GMM-HMM is obtained, the acoustic model is Gaussian clustered to generate 400 UBM, and then the UBM model is trained to require an E-M algorithm to adjust parameters; to complete an initialized SGMM model, there are two last steps of E-M training, the first step aligns the GMM-HMM baseline with the Viterbi state; the second step is to use Viterbi alignment to obtain SGMM model; finally using MMI criterion

And carrying out distinguishing training to obtain a final BN-SGMM-HMM model. Each of the two steps is an iterative process as shown in fig. 6.

17. And when training the SGMM, firstly adopting a Gaussian mixture clustering algorithm to cluster Gaussian mixture in the acoustic model, and generating 400 UBM. And (3) performing E-M algorithm on the generated UBM to obtain an SGMM model, and finally performing distinguishing training based on MMI criteria to obtain a final BN-SGMM-HMM model as shown in figure 7.

The hardware implementation:

18. since the general Kaldi is developed on the PC side, the x86-64 instruction set of Intel is always adopted, and when the raspberry group is realized, the adopted instruction set is the instruction set of ARM, and if the Kaldi is directly put into a flash memory card to be directly compiled on an embedded platform, a lot of dependency packages and compiled files are lacking. In order to ensure the smoothness of the compiling process, the Kaldi compiling process is put on a virtual machine for compiling, and the finally compiled file is stored in a raspberry group.

19. In the patent, a system in a PC virtual machine adopts a Ubuntu_18.04 bit system, an arm-linux-gnueabihf-ras-x 64 cross compiling packet provided by a raspberry group official is used, after the whole cross compiling packet is moved into Ubuntu_18.04, an environment variable for changing the cross compiling packet is introduced through a basherc file in the linux, and after the completion, a control variable contained in a current terminal is updated. And confirming whether the raspberry group cross compiling environment configuration is finished or not through an Arm-linux-gnueabihf-gcc-v instruction. The cross-compilation flow diagram is shown in fig. 8.

20. And transplanting the trained acoustic model file, the voice model word network file and the dictionary file into the raspberry group 4B. Three files corresponding to 'final. Mdl', 'HCLG. Fst' and 'words. Txt' in Kaldi are input voice and decoded by a decoder carried by Kaldi, and finally the text of voice is output to a terminal. A block diagram of an online speech recognition system is shown in fig. 9. An overall block diagram of a hardware implementation is shown in fig. 10.

Example 1:

the test speaker totally comprises 10 persons, five voices are provided, each voice is spoken for 10 times, the number of inserted words, the number of deleted words and the number of substituted words in each voice are counted, and the WER of each sentence is counted. The voices tested at this time are shown in table 1.

Table 1: test statement content table

The statistical results are shown in Table 2, with WER of the speech recognition system maintained substantially below 20% and an average of about 14.92%.

Table 2: test statement structure statistics table

Claims

1. The low-resource voice recognition method based on BN-SGMM-HMM is characterized by comprising the following steps:

1.6 After obtaining the characteristics, executing a computer_cmvn_stats.sh file, normalizing the obtained acoustic characteristics by a cepstrum mean variance, and completing the characteristic extraction part;

2) Creating a single-phoneme acoustic model:

2.3 Performing iteration on the alignment model obtained by the last training until the model converges;

3) Creating a triphone acoustic model: obtaining FMLLR characteristics;

5) Training of BN-SGMM-HMM: taking bottleneck characteristics extracted from the neural network as input characteristics of an SGMM-HMM acoustic model, and finally forming BN-SGMM-HMM;

2. The BN-SGMM-HMM based low-resource speech recognition method according to claim 1, wherein in the step 3), the specific method is:

3.2 Performing feature transformation including linear discriminant analysis, maximum likelihood linear transformation and speaker self-training; the speaker self-training is based on FMLLR, and the obtained FMLLR characteristics are used for training of the next neural network.

3. The BN-SGMM-HMM based low-resource speech recognition method according to claim 1, wherein in the step 4), the specific method is:

4. The BN-SGMM-HMM based low-resource speech recognition method according to claim 1, wherein in the step 5), the specific method is: