CN113421555A

CN113421555A - BN-SGMM-HMM-based low-resource speech recognition method

Info

Publication number: CN113421555A
Application number: CN202110897247.8A
Authority: CN
Inventors: 赵宏亮; 雷杰
Original assignee: Liaoning University
Current assignee: Liaoning University
Priority date: 2021-08-05
Filing date: 2021-08-05
Publication date: 2021-09-21
Anticipated expiration: 2041-08-05
Also published as: CN113421555B

Abstract

Based on the BN-SGMM-HMM low-resource speech recognition method, under the low-resource condition, the bottleneck characteristics trained through a neural network are combined with a subspace Gaussian mixture model to form a baseline system, a BN-SGMM-HMM acoustic model is formed, the model is transplanted to a raspberry to complete a speech recognition task, compared with a traditional speech recognition model, the model is remarkably improved in recognition rate, smaller in parameter scale than that of the traditional speech recognition system, and low in cost when being transplanted to open-source hardware, and the speech recognition system can be used without being networked.

Description

BN-SGMM-HMM-based low-resource speech recognition method

Technical Field

The invention relates to the field of voice recognition, in particular to a low-resource voice recognition method based on BN-SGMM-HMM.

Background

The existing speech recognition generally adopts the following methods:

the method comprises the following steps: the traditional MFCC features are used as input, the GMM-HMM is used as an acoustic model, and the speech model trained by the method cannot perform multilayer and backward propagation operation like a neural network because the GMM-HMM is a shallow model, so that the recognition rate is often lower than that of the acoustic model subjected to neural network operation; and the GMM-HMM is not suitable for the application scenario of corpus starvation (i.e. low resources).

The method 2 comprises the following steps: in the traditional DNN-HMM model, due to the multi-level complex structure of the DNN neural network, the calculated amount is huge, the bottleneck network is adopted for training, and due to the fact that the subsequent network needs to be removed when the bottleneck characteristic is extracted, the nonlinear operation behind the bottleneck layer is not needed, the budget amount is reduced; compared with the traditional MFCC features, the extracted bottleneck features have the advantages of DNN feature speech long-time correlation, compact representation and the like due to cross entropy training, and the recognition rate of a baseline system formed by bottleneck feature training is more ideal than the DNN-HMM recognition rate.

Disclosure of Invention

The invention aims to solve the technical problem of providing a low-resource speech recognition method based on BN-SGMM-HMM, which trains an open-source Chinese language database on the basis of the model and is finally realized on a raspberry dispatching hardware platform with good recognition effect.

In order to achieve the purpose, the invention adopts the following technical scheme: the low-resource speech recognition method based on BN-SGMM-HMM is characterized by comprising the following steps:

1) preprocessing and extracting training data: setting and diversity are carried out on an original database, and then feature extraction is carried out to obtain MFCC features;

2) creating a monophonic acoustic model:

3) creating a triphone acoustic model: obtaining FMLLR characteristics;

4) training a neural network: taking the FMLLR characteristics as input characteristics of a bottleneck neural network, removing a network layer behind a bottleneck layer after the neural network training, and taking the bottleneck layer as an output layer to finally extract the bottleneck characteristics after cross entropy training;

5) training of BN-SGMM-HMM baseline System: taking the bottleneck characteristics extracted by the neural network as input characteristics of the SGMM-HMM acoustic model, and finally forming a BN-SGMM-HMM baseline system;

6) hardware implementation: the Kaldi compiling process is placed on a virtual machine for compiling, and finally, a file which is compiled is stored in a raspberry group; updating a control variable contained in the current terminal; finally, whether the raspberry pi cross compiling environment configuration is completed is confirmed;

7) and transplanting the trained acoustic model file, the voice model word network file and the dictionary file into a raspberry group to input voice, decoding the voice through a decoder carried by Kaldi, and finally outputting the text of the voice to a terminal.

In the step 1), the specific method comprises the following steps:

1.1) preparing an original corpus and setting a path of the corpus in a training script;

1.2) executing a data preparation script, dividing data into a training set, a testing set and a development set, and generating a mapping relation between a speaker number and voice, the gender of the speaker and related information of an original voice file;

1.3) after the relevant information is generated, starting to prepare a dictionary and a corresponding phoneme model until the data preparation is finished;

1.4) carrying out feature extraction on the voice signals, wherein the extraction range is a training set, a development set and a test set, and the executed scripts are steps/make _ mfcc.sh and computer _ cmvn _ states.sh;

1.5) in make _ mfcc, sh, the original voice is converted into a feature vector through pre-emphasis, framing, windowing, fast Fourier transform, Mel transform, log energy and first-order and second-order difference calculation for extracting dynamic features;

1.6) after obtaining the characteristics, executing a computer _ cmvn _ states.sh file, and normalizing the obtained acoustic characteristics through cepstrum mean variance until the characteristic extraction part is finished.

In the step 2), the specific method is as follows:

2.1) using the previously trained MFCC features to initialize the GMM model for single-tone pixels;

2.2) adopting an E-M algorithm to iterate model training and aligning data;

2.3) iterating the alignment model obtained by the last training until the model converges.

7. The BN-SGMM-HMM based low resource speech recognition method of claim 1, wherein in the step 3), the specific method is:

3.1) training on the basis of the aligned single-phoneme models, training a corpus at the same time, generating a language dictionary file, a voice path file, a voice and speaker mapping file and a phoneme file of a speech segment, putting the three-phoneme models together for similarity clustering cutting, clustering the three-phoneme models with similar pronunciations into a model, and sharing parameters; then, training the triphone model by a method of training a single-phone model;

3.2) carrying out feature transformation, including linear discriminant analysis, maximum likelihood linear transformation and speaker self-training; the speaker self-training is based on FMLLR, and the obtained FMLLR characteristics are used for the next step of neural network training;

in the step 4), the specific method is as follows:

4.1) before formal training of the neural network, adopting unsupervised training of a finite Boltzmann machine RBM to initialize layer by layer, stacking the trained RBM models, and forming a DBN model by the stacked RBM models; in kaldi, the neural network pre-training is completed by executing pretrain _ dbn.sh; in the pre-training process, all RBMs are trained by using a CD algorithm and a small-batch random gradient descent algorithm, the size of each mini-batch is 256, impulse factors are set to be 0.9, and weight attenuation is not set; the first RBM adopts a Gaussian-Bernoulli unit, the learning rate of the first RBM is 0.005, and the iteration number of the first RBM is 50; the RBM adopts a Bernoulli-Bernoulli unit, the learning rate is 0.08, the iteration times are 25, and the number of experimental hidden nodes is 1024;

4.2) adopting a back propagation algorithm in the fine tuning process, wherein the initial learning rate is 0.008, and the adopted activation function is Sigmoid; during training, the learning rate is kept unchanged in the first 10 periods, from the 11 th period, the learning rate is changed to be half of the original rate in each training period, 30 iterations are performed in total, and the training is stopped when the difference between the learning rates of the two iterations is less than 0.001; if the above conditions are not met after 30 times, forcibly stopping training; the structure of the neural network is 7 layers including a 1-layer input layer, a 5-layer hidden layer and a 1-layer output layer, wherein the first-layer input layer comprises 440 nodes, the hidden layer comprises 1024 nodes except 40 nodes of the fourth-layer hidden layer, and the output layer comprises 2016 nodes and corresponds to the state number of the clustered triphones in the GMM-HMM baseline system;

4.3) after the network training is finished, extracting the bottleneck characteristics, specifically setting a BN layer in train in the training process, setting the number value of the node points BN _ dim of the bottleneck layer, calling a make _ BN _ features.

8. The BN-SGMM-HMM based low resource speech recognition method of claim 1, wherein in the step 5), the specific method is:

5.1) when the SGMM-HMM model is trained, firstly, an HMM model needs to be trained to obtain the state binding of the GMM-HMM, the acoustic model is subjected to Gaussian clustering to generate 400 UBMs, and then the UBM model is trained to need an E-M algorithm to optimize parameters; to complete an initialized SGMM model, there are two last steps of E-M training, the first step aligns GMM-HMM baselines with Viterbi states; secondly, obtaining an SGMM model by Viterbi alignment; and finally, carrying out discriminative training based on an MMI criterion to obtain a final BN-SGMM-HMM model.

The beneficial effects created by the invention are as follows:

1. the speech recognition method adopts a BN-SGMM-HMM model, the parameter scale of the model is lower than that of the traditional speech recognition method, and the model can be operated on embedded speech equipment with limited hardware resources.

2. The language identification method is suitable for low-resource speech identification conditions, and can achieve a good identification effect under the condition of limited training corpus.

3. The language identification method has low development cost, and the cost for purchasing a corpus and a speech identification API is saved by adopting an open-source Chinese speech identification library. And the construction of the language identification acoustic model is based on kaldi, so that the development period and the development difficulty of developers are greatly reduced.

4. The hardware implementation cost of the language identification method is low, the open source hardware raspberry 4b is adopted, and compared with the cost of other development boards and voice chips, the language identification method reduces the development difficulty and period and saves the chip slide cost.

Drawings

FIG. 1: MFCC feature extraction flow chart.

FIG. 2: and (5) a characteristic transformation flow chart.

FIG. 3: DBN training flow block diagram.

FIG. 4: back propagation algorithm block diagram

FIG. 5: and extracting a graph from the bottleneck characteristic.

FIG. 6: a flow chart for training of a subspace gaussian mixture model.

FIG. 7: BN-SGMM-HMM training graph.

FIG. 8: and (4) cross compiling the flow chart.

FIG. 9: block diagram of an online speech recognition system.

FIG. 10: a block diagram of a hardware implementation.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments.

A low-resource speech recognition method based on BN-SGMM-HMM comprises the following steps:

early-stage preparation of training data:

1. an original corpus is prepared and the path of the corpus is set in a training script.

2. And executing a data preparation script, dividing the data into a training set train, a testing set est and a development set dev, and generating the mapping relation between the number of the speaker and the voice, the gender of the speaker, the original voice file and other related information.

3. After the relevant information is generated, the dictionary and the corresponding phoneme model are prepared. This data is now ready for the completed feature extraction portion.

4. And performing feature extraction on the voice signals, wherein the extraction ranges are a training set, a development set and a test set, and executed scripts are steps/make _ mfcc.

5. In make _ mfcc, sh, it needs pre-emphasis, framing, windowing, fast fourier transform, mel transform, log energy, and first-order and second-order difference calculation for extracting dynamic characteristics, and the script converts the original voice into characteristic vector, which can extract the key information of original voice waveform, reduce the calculation amount, and remove the redundant information.

6. And after the characteristics are obtained, executing a computer _ CMVN _ states.sh file, wherein the CMVN is obtained by performing cepstrum mean variance normalization on the obtained acoustic characteristics. The purpose of normalization is to normalize the input acoustic features to conform to a normal distribution, thereby reducing the influence of noise on the speech. This completes the feature extraction. The MFCC feature extraction flow is shown in FIG. 1.

Single-tone model creation of acoustic models:

7. the previously trained MFCC features are used to initialize the GMM model for single-phone.

8. And (5) iterating the model training by adopting an E-M algorithm, and aligning data.

9. And iterating the alignment model obtained by the last training until the model converges.

Triphone model creation of acoustic models:

10. since the actual pronunciation of a phoneme is affected by adjacent and similar phonemes in the actual pronunciation, especially a co-pronounced word or sentence, a context-dependent acoustic model is introduced, and among the context-dependent models, the most effective one is the triphone model, which is constructed as < left adjacent phoneme + center phoneme + right adjacent phoneme >.

11. Triphone training requires training based on aligned monophonic models, as well as training data and documents such as language dictionaries. Meanwhile, in order to solve model parameter quantity explosion caused by introducing the triphone model, the triphone model is put together for similarity clustering cutting, the triphone models with similar pronunciations are clustered into one model, parameters are shared, and the scale of model parameters is effectively reduced. After which the process of training the triphone model is very similar to the process of training the monophonic model.

12. Meanwhile, in order to improve the recognition rate of the triphone model, feature transformation is carried out, and the feature transformation mainly comprises Linear Discriminant Analysis (LDA), Maximum Likelihood Linear Transformation (MLLT) and speaker self-training (SAT). And the speaker self-training is based on the feature maximum likelihood regression FMLLR, and the obtained FMLLR features are used for the next step of neural network training. The feature transformation flow is shown in fig. 2.

Training of neural networks

13. Before formal training of the neural network, adopting unsupervised training of a finite Boltzmann machine RBM to initialize layer by layer before formal training of the neural network, stacking the trained RBM models, and forming the DBN model by the stacked RBM models. In kaldi, the neural network pre-training is completed by executing pretrain _ dbn.sh; in the pre-training process, all RBMs are trained by using a CD algorithm and a small-batch random gradient descent algorithm, the size of each mini-batch is 256, impulse factors are set to be 0.9, and weight attenuation is not set; the first RBM adopts a Gaussian-Bernoulli unit, the learning rate of the first RBM is 0.005, and the iteration number of the first RBM is 50; and the RBM adopts a Bernoulli-Bernoulli unit, the learning rate is 0.08, the iteration times are 25, and the number of experimental hidden nodes is 1024.

14. And in the fine adjustment process, a back propagation algorithm is adopted, the initial learning rate is 0.008, and the adopted activation function is Sigmoid. During training, the learning rate is kept unchanged in the first 10 periods, and from the 11 th period, the learning rate is changed to be half of the original rate every training period, and 30 iterations are performed. Stopping training when the difference of the learning rates between two iterations is less than 0.001; if the above condition is not satisfied after 30 times, the training is forcibly stopped. Herein, the structure of the neural network is 1 input layer, 5 hidden layers and 1 output layer, which are 7 layers, wherein the first input layer has 440 nodes, the hidden layers are 1024 nodes except the fourth hidden layer which has 40 nodes, and the output layer has 2016 nodes, which corresponds to the number of states of the clustered triphones in the GMM-HMM baseline system. The DBN training flow block is shown in fig. 3, and the back propagation algorithm block is shown in fig. 4.

15. After network training is finished, bottleneck characteristics are extracted, the specific process is that a BN layer in train is set during training, the number value of the node number BN _ dim of the bottleneck layer is set, then a make _ BN _ features.sh script is called, the network behind the bottleneck layer can be removed after the characteristics are extracted, and the obtained bottleneck characteristics can be extracted after CMVN. This feature is used to build the SGMM-HMM baseline system later. The bottleneck feature extraction is shown in fig. 5.

Training of BN-SGMM-HMM baseline System:

16. when the SGMM-HMM model is trained, firstly, an HMM model needs to be trained to obtain the state binding of the GMM-HMM, the acoustic model is subjected to Gaussian clustering to generate 400 UBMs, and then the UBM model is trained to need an E-M algorithm to optimize parameters; to complete an initialized SGMM model, there are two last steps of E-M training, the first step aligns GMM-HMM baselines with Viterbi states; secondly, obtaining an SGMM model by Viterbi alignment; finally using MMI-based criteria

And performing discriminative training to obtain a final BN-SGMM-HMM model. Each of the above two steps is an iterative process as shown in fig. 6.

17. And when the SGMM is trained, firstly adopting a Gaussian mixture clustering algorithm to cluster the Gaussian mixtures in the acoustic model to generate 400 UBMs. And (3) obtaining an SGMM model by carrying out E-M algorithm on the generated UBM, and finally carrying out discriminative training based on MMI criterion to obtain a final BN-SGMM-HMM model as shown in FIG. 7.

Hardware implementation:

18. since the general Kaldi is developed on the PC side, the x86-64 instruction set of Intel is always adopted, and when the raspberry is realized, the adopted instruction set is that of ARM, if the Kaldi is directly put into a flash card and directly compiled on an embedded platform, a lot of dependency packages and compiled files are lacked. In order to ensure the smoothness of the compiling process, the Kaldi compiling process is placed on a virtual machine for compiling, and finally, the compiled file is stored in a raspberry group.

19. In the patent, a system in a PC virtual machine adopts an Ubuntu _ 18.0464 bit system, an arm-linux-gnueabihf-raspbian-x64 cross compiling package provided by a raspberry group official is used, after the whole cross compiling package is moved into Ubuntu _18.04, an environment variable of the cross compiling package is introduced through a bashrc file in linux, and a control variable contained in a current terminal is updated after the control variable is completed. And then, whether the raspberry pi cross compiling environment configuration is completed or not is confirmed through an Arm-linux-gnueabihf-gcc-v instruction. The cross-compilation flow diagram is shown in fig. 8.

20. And transplanting the trained acoustic model file, the voice model word network file and the dictionary file into a raspberry pi 4B. In Kaldi, corresponding to three files, "final.mdl", "hclg.fst" and "words.txt", the voice is input and decoded by a decoder of the Kaldi itself, and finally the text of the voice is output to the terminal. A block diagram of an online speech recognition system is shown in fig. 9. An overall block diagram of the hardware implementation is shown in fig. 10.

Example 1:

the number of the inserted characters, the number of the deleted characters and the number of the substituted characters in each voice are counted, and the WER of each sentence is counted. The speech of this test is shown in table 1.

Table 1: test statement content table

Statistical results as shown in table 2, the WER of the speech recognition system remained substantially below 20% with an average of about 14.92%.

Table 2: statistical table for test statement structure

Claims

1. The low-resource speech recognition method based on BN-SGMM-HMM is characterized by comprising the following steps:

2) creating a monophonic acoustic model:

3) creating a triphone acoustic model: obtaining FMLLR characteristics;

2. The BN-SGMM-HMM based low resource speech recognition method of claim 1, wherein in the step 1), the specific method is:

3. The BN-SGMM-HMM based low resource speech recognition method of claim 1, wherein in the step 2), the specific method is:

2.2) adopting an E-M algorithm to iterate model training and aligning data;

4. The BN-SGMM-HMM based low resource speech recognition method of claim 1, wherein in the step 3), the specific method is:

3.2) carrying out feature transformation, including linear discriminant analysis, maximum likelihood linear transformation and speaker self-training; the speaker self-training is based on FMLLR, and the obtained FMLLR characteristics are used for the next step of neural network training.

5. The BN-SGMM-HMM based low resource speech recognition method of claim 1, wherein in the step 4), the specific method is:

6. The BN-SGMM-HMM based low resource speech recognition method of claim 1, wherein in the step 5), the specific method is: