CN113421555A - BN-SGMM-HMM-based low-resource speech recognition method - Google Patents

BN-SGMM-HMM-based low-resource speech recognition method Download PDF

Info

Publication number
CN113421555A
CN113421555A CN202110897247.8A CN202110897247A CN113421555A CN 113421555 A CN113421555 A CN 113421555A CN 202110897247 A CN202110897247 A CN 202110897247A CN 113421555 A CN113421555 A CN 113421555A
Authority
CN
China
Prior art keywords
training
model
hmm
sgmm
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110897247.8A
Other languages
Chinese (zh)
Other versions
CN113421555B (en
Inventor
赵宏亮
雷杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning University
Original Assignee
Liaoning University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning University filed Critical Liaoning University
Priority to CN202110897247.8A priority Critical patent/CN113421555B/en
Publication of CN113421555A publication Critical patent/CN113421555A/en
Application granted granted Critical
Publication of CN113421555B publication Critical patent/CN113421555B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Signal Processing (AREA)
  • Character Discrimination (AREA)
  • Image Analysis (AREA)

Abstract

Based on the BN-SGMM-HMM low-resource speech recognition method, under the low-resource condition, the bottleneck characteristics trained through a neural network are combined with a subspace Gaussian mixture model to form a baseline system, a BN-SGMM-HMM acoustic model is formed, the model is transplanted to a raspberry to complete a speech recognition task, compared with a traditional speech recognition model, the model is remarkably improved in recognition rate, smaller in parameter scale than that of the traditional speech recognition system, and low in cost when being transplanted to open-source hardware, and the speech recognition system can be used without being networked.

Description

BN-SGMM-HMM-based low-resource speech recognition method
Technical Field
The invention relates to the field of voice recognition, in particular to a low-resource voice recognition method based on BN-SGMM-HMM.
Background
The existing speech recognition generally adopts the following methods:
the method comprises the following steps: the traditional MFCC features are used as input, the GMM-HMM is used as an acoustic model, and the speech model trained by the method cannot perform multilayer and backward propagation operation like a neural network because the GMM-HMM is a shallow model, so that the recognition rate is often lower than that of the acoustic model subjected to neural network operation; and the GMM-HMM is not suitable for the application scenario of corpus starvation (i.e. low resources).
The method 2 comprises the following steps: in the traditional DNN-HMM model, due to the multi-level complex structure of the DNN neural network, the calculated amount is huge, the bottleneck network is adopted for training, and due to the fact that the subsequent network needs to be removed when the bottleneck characteristic is extracted, the nonlinear operation behind the bottleneck layer is not needed, the budget amount is reduced; compared with the traditional MFCC features, the extracted bottleneck features have the advantages of DNN feature speech long-time correlation, compact representation and the like due to cross entropy training, and the recognition rate of a baseline system formed by bottleneck feature training is more ideal than the DNN-HMM recognition rate.
Disclosure of Invention
The invention aims to solve the technical problem of providing a low-resource speech recognition method based on BN-SGMM-HMM, which trains an open-source Chinese language database on the basis of the model and is finally realized on a raspberry dispatching hardware platform with good recognition effect.
In order to achieve the purpose, the invention adopts the following technical scheme: the low-resource speech recognition method based on BN-SGMM-HMM is characterized by comprising the following steps:
1) preprocessing and extracting training data: setting and diversity are carried out on an original database, and then feature extraction is carried out to obtain MFCC features;
2) creating a monophonic acoustic model:
3) creating a triphone acoustic model: obtaining FMLLR characteristics;
4) training a neural network: taking the FMLLR characteristics as input characteristics of a bottleneck neural network, removing a network layer behind a bottleneck layer after the neural network training, and taking the bottleneck layer as an output layer to finally extract the bottleneck characteristics after cross entropy training;
5) training of BN-SGMM-HMM baseline System: taking the bottleneck characteristics extracted by the neural network as input characteristics of the SGMM-HMM acoustic model, and finally forming a BN-SGMM-HMM baseline system;
6) hardware implementation: the Kaldi compiling process is placed on a virtual machine for compiling, and finally, a file which is compiled is stored in a raspberry group; updating a control variable contained in the current terminal; finally, whether the raspberry pi cross compiling environment configuration is completed is confirmed;
7) and transplanting the trained acoustic model file, the voice model word network file and the dictionary file into a raspberry group to input voice, decoding the voice through a decoder carried by Kaldi, and finally outputting the text of the voice to a terminal.
In the step 1), the specific method comprises the following steps:
1.1) preparing an original corpus and setting a path of the corpus in a training script;
1.2) executing a data preparation script, dividing data into a training set, a testing set and a development set, and generating a mapping relation between a speaker number and voice, the gender of the speaker and related information of an original voice file;
1.3) after the relevant information is generated, starting to prepare a dictionary and a corresponding phoneme model until the data preparation is finished;
1.4) carrying out feature extraction on the voice signals, wherein the extraction range is a training set, a development set and a test set, and the executed scripts are steps/make _ mfcc.sh and computer _ cmvn _ states.sh;
1.5) in make _ mfcc, sh, the original voice is converted into a feature vector through pre-emphasis, framing, windowing, fast Fourier transform, Mel transform, log energy and first-order and second-order difference calculation for extracting dynamic features;
1.6) after obtaining the characteristics, executing a computer _ cmvn _ states.sh file, and normalizing the obtained acoustic characteristics through cepstrum mean variance until the characteristic extraction part is finished.
In the step 2), the specific method is as follows:
2.1) using the previously trained MFCC features to initialize the GMM model for single-tone pixels;
2.2) adopting an E-M algorithm to iterate model training and aligning data;
2.3) iterating the alignment model obtained by the last training until the model converges.
7. The BN-SGMM-HMM based low resource speech recognition method of claim 1, wherein in the step 3), the specific method is:
3.1) training on the basis of the aligned single-phoneme models, training a corpus at the same time, generating a language dictionary file, a voice path file, a voice and speaker mapping file and a phoneme file of a speech segment, putting the three-phoneme models together for similarity clustering cutting, clustering the three-phoneme models with similar pronunciations into a model, and sharing parameters; then, training the triphone model by a method of training a single-phone model;
3.2) carrying out feature transformation, including linear discriminant analysis, maximum likelihood linear transformation and speaker self-training; the speaker self-training is based on FMLLR, and the obtained FMLLR characteristics are used for the next step of neural network training;
in the step 4), the specific method is as follows:
4.1) before formal training of the neural network, adopting unsupervised training of a finite Boltzmann machine RBM to initialize layer by layer, stacking the trained RBM models, and forming a DBN model by the stacked RBM models; in kaldi, the neural network pre-training is completed by executing pretrain _ dbn.sh; in the pre-training process, all RBMs are trained by using a CD algorithm and a small-batch random gradient descent algorithm, the size of each mini-batch is 256, impulse factors are set to be 0.9, and weight attenuation is not set; the first RBM adopts a Gaussian-Bernoulli unit, the learning rate of the first RBM is 0.005, and the iteration number of the first RBM is 50; the RBM adopts a Bernoulli-Bernoulli unit, the learning rate is 0.08, the iteration times are 25, and the number of experimental hidden nodes is 1024;
4.2) adopting a back propagation algorithm in the fine tuning process, wherein the initial learning rate is 0.008, and the adopted activation function is Sigmoid; during training, the learning rate is kept unchanged in the first 10 periods, from the 11 th period, the learning rate is changed to be half of the original rate in each training period, 30 iterations are performed in total, and the training is stopped when the difference between the learning rates of the two iterations is less than 0.001; if the above conditions are not met after 30 times, forcibly stopping training; the structure of the neural network is 7 layers including a 1-layer input layer, a 5-layer hidden layer and a 1-layer output layer, wherein the first-layer input layer comprises 440 nodes, the hidden layer comprises 1024 nodes except 40 nodes of the fourth-layer hidden layer, and the output layer comprises 2016 nodes and corresponds to the state number of the clustered triphones in the GMM-HMM baseline system;
4.3) after the network training is finished, extracting the bottleneck characteristics, specifically setting a BN layer in train in the training process, setting the number value of the node points BN _ dim of the bottleneck layer, calling a make _ BN _ features.
8. The BN-SGMM-HMM based low resource speech recognition method of claim 1, wherein in the step 5), the specific method is:
5.1) when the SGMM-HMM model is trained, firstly, an HMM model needs to be trained to obtain the state binding of the GMM-HMM, the acoustic model is subjected to Gaussian clustering to generate 400 UBMs, and then the UBM model is trained to need an E-M algorithm to optimize parameters; to complete an initialized SGMM model, there are two last steps of E-M training, the first step aligns GMM-HMM baselines with Viterbi states; secondly, obtaining an SGMM model by Viterbi alignment; and finally, carrying out discriminative training based on an MMI criterion to obtain a final BN-SGMM-HMM model.
The beneficial effects created by the invention are as follows:
1. the speech recognition method adopts a BN-SGMM-HMM model, the parameter scale of the model is lower than that of the traditional speech recognition method, and the model can be operated on embedded speech equipment with limited hardware resources.
2. The language identification method is suitable for low-resource speech identification conditions, and can achieve a good identification effect under the condition of limited training corpus.
3. The language identification method has low development cost, and the cost for purchasing a corpus and a speech identification API is saved by adopting an open-source Chinese speech identification library. And the construction of the language identification acoustic model is based on kaldi, so that the development period and the development difficulty of developers are greatly reduced.
4. The hardware implementation cost of the language identification method is low, the open source hardware raspberry 4b is adopted, and compared with the cost of other development boards and voice chips, the language identification method reduces the development difficulty and period and saves the chip slide cost.
Drawings
FIG. 1: MFCC feature extraction flow chart.
FIG. 2: and (5) a characteristic transformation flow chart.
FIG. 3: DBN training flow block diagram.
FIG. 4: back propagation algorithm block diagram
FIG. 5: and extracting a graph from the bottleneck characteristic.
FIG. 6: a flow chart for training of a subspace gaussian mixture model.
FIG. 7: BN-SGMM-HMM training graph.
FIG. 8: and (4) cross compiling the flow chart.
FIG. 9: block diagram of an online speech recognition system.
FIG. 10: a block diagram of a hardware implementation.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments.
A low-resource speech recognition method based on BN-SGMM-HMM comprises the following steps:
early-stage preparation of training data:
1. an original corpus is prepared and the path of the corpus is set in a training script.
2. And executing a data preparation script, dividing the data into a training set train, a testing set est and a development set dev, and generating the mapping relation between the number of the speaker and the voice, the gender of the speaker, the original voice file and other related information.
3. After the relevant information is generated, the dictionary and the corresponding phoneme model are prepared. This data is now ready for the completed feature extraction portion.
4. And performing feature extraction on the voice signals, wherein the extraction ranges are a training set, a development set and a test set, and executed scripts are steps/make _ mfcc.
5. In make _ mfcc, sh, it needs pre-emphasis, framing, windowing, fast fourier transform, mel transform, log energy, and first-order and second-order difference calculation for extracting dynamic characteristics, and the script converts the original voice into characteristic vector, which can extract the key information of original voice waveform, reduce the calculation amount, and remove the redundant information.
6. And after the characteristics are obtained, executing a computer _ CMVN _ states.sh file, wherein the CMVN is obtained by performing cepstrum mean variance normalization on the obtained acoustic characteristics. The purpose of normalization is to normalize the input acoustic features to conform to a normal distribution, thereby reducing the influence of noise on the speech. This completes the feature extraction. The MFCC feature extraction flow is shown in FIG. 1.
Single-tone model creation of acoustic models:
7. the previously trained MFCC features are used to initialize the GMM model for single-phone.
8. And (5) iterating the model training by adopting an E-M algorithm, and aligning data.
9. And iterating the alignment model obtained by the last training until the model converges.
Triphone model creation of acoustic models:
10. since the actual pronunciation of a phoneme is affected by adjacent and similar phonemes in the actual pronunciation, especially a co-pronounced word or sentence, a context-dependent acoustic model is introduced, and among the context-dependent models, the most effective one is the triphone model, which is constructed as < left adjacent phoneme + center phoneme + right adjacent phoneme >.
11. Triphone training requires training based on aligned monophonic models, as well as training data and documents such as language dictionaries. Meanwhile, in order to solve model parameter quantity explosion caused by introducing the triphone model, the triphone model is put together for similarity clustering cutting, the triphone models with similar pronunciations are clustered into one model, parameters are shared, and the scale of model parameters is effectively reduced. After which the process of training the triphone model is very similar to the process of training the monophonic model.
12. Meanwhile, in order to improve the recognition rate of the triphone model, feature transformation is carried out, and the feature transformation mainly comprises Linear Discriminant Analysis (LDA), Maximum Likelihood Linear Transformation (MLLT) and speaker self-training (SAT). And the speaker self-training is based on the feature maximum likelihood regression FMLLR, and the obtained FMLLR features are used for the next step of neural network training. The feature transformation flow is shown in fig. 2.
Training of neural networks
13. Before formal training of the neural network, adopting unsupervised training of a finite Boltzmann machine RBM to initialize layer by layer before formal training of the neural network, stacking the trained RBM models, and forming the DBN model by the stacked RBM models. In kaldi, the neural network pre-training is completed by executing pretrain _ dbn.sh; in the pre-training process, all RBMs are trained by using a CD algorithm and a small-batch random gradient descent algorithm, the size of each mini-batch is 256, impulse factors are set to be 0.9, and weight attenuation is not set; the first RBM adopts a Gaussian-Bernoulli unit, the learning rate of the first RBM is 0.005, and the iteration number of the first RBM is 50; and the RBM adopts a Bernoulli-Bernoulli unit, the learning rate is 0.08, the iteration times are 25, and the number of experimental hidden nodes is 1024.
14. And in the fine adjustment process, a back propagation algorithm is adopted, the initial learning rate is 0.008, and the adopted activation function is Sigmoid. During training, the learning rate is kept unchanged in the first 10 periods, and from the 11 th period, the learning rate is changed to be half of the original rate every training period, and 30 iterations are performed. Stopping training when the difference of the learning rates between two iterations is less than 0.001; if the above condition is not satisfied after 30 times, the training is forcibly stopped. Herein, the structure of the neural network is 1 input layer, 5 hidden layers and 1 output layer, which are 7 layers, wherein the first input layer has 440 nodes, the hidden layers are 1024 nodes except the fourth hidden layer which has 40 nodes, and the output layer has 2016 nodes, which corresponds to the number of states of the clustered triphones in the GMM-HMM baseline system. The DBN training flow block is shown in fig. 3, and the back propagation algorithm block is shown in fig. 4.
15. After network training is finished, bottleneck characteristics are extracted, the specific process is that a BN layer in train is set during training, the number value of the node number BN _ dim of the bottleneck layer is set, then a make _ BN _ features.sh script is called, the network behind the bottleneck layer can be removed after the characteristics are extracted, and the obtained bottleneck characteristics can be extracted after CMVN. This feature is used to build the SGMM-HMM baseline system later. The bottleneck feature extraction is shown in fig. 5.
Training of BN-SGMM-HMM baseline System:
16. when the SGMM-HMM model is trained, firstly, an HMM model needs to be trained to obtain the state binding of the GMM-HMM, the acoustic model is subjected to Gaussian clustering to generate 400 UBMs, and then the UBM model is trained to need an E-M algorithm to optimize parameters; to complete an initialized SGMM model, there are two last steps of E-M training, the first step aligns GMM-HMM baselines with Viterbi states; secondly, obtaining an SGMM model by Viterbi alignment; finally using MMI-based criteria
And performing discriminative training to obtain a final BN-SGMM-HMM model. Each of the above two steps is an iterative process as shown in fig. 6.
17. And when the SGMM is trained, firstly adopting a Gaussian mixture clustering algorithm to cluster the Gaussian mixtures in the acoustic model to generate 400 UBMs. And (3) obtaining an SGMM model by carrying out E-M algorithm on the generated UBM, and finally carrying out discriminative training based on MMI criterion to obtain a final BN-SGMM-HMM model as shown in FIG. 7.
Hardware implementation:
18. since the general Kaldi is developed on the PC side, the x86-64 instruction set of Intel is always adopted, and when the raspberry is realized, the adopted instruction set is that of ARM, if the Kaldi is directly put into a flash card and directly compiled on an embedded platform, a lot of dependency packages and compiled files are lacked. In order to ensure the smoothness of the compiling process, the Kaldi compiling process is placed on a virtual machine for compiling, and finally, the compiled file is stored in a raspberry group.
19. In the patent, a system in a PC virtual machine adopts an Ubuntu _ 18.0464 bit system, an arm-linux-gnueabihf-raspbian-x64 cross compiling package provided by a raspberry group official is used, after the whole cross compiling package is moved into Ubuntu _18.04, an environment variable of the cross compiling package is introduced through a bashrc file in linux, and a control variable contained in a current terminal is updated after the control variable is completed. And then, whether the raspberry pi cross compiling environment configuration is completed or not is confirmed through an Arm-linux-gnueabihf-gcc-v instruction. The cross-compilation flow diagram is shown in fig. 8.
20. And transplanting the trained acoustic model file, the voice model word network file and the dictionary file into a raspberry pi 4B. In Kaldi, corresponding to three files, "final.mdl", "hclg.fst" and "words.txt", the voice is input and decoded by a decoder of the Kaldi itself, and finally the text of the voice is output to the terminal. A block diagram of an online speech recognition system is shown in fig. 9. An overall block diagram of the hardware implementation is shown in fig. 10.
Example 1:
the number of the inserted characters, the number of the deleted characters and the number of the substituted characters in each voice are counted, and the WER of each sentence is counted. The speech of this test is shown in table 1.
Table 1: test statement content table
Figure BDA0003198474930000071
Statistical results as shown in table 2, the WER of the speech recognition system remained substantially below 20% with an average of about 14.92%.
Table 2: statistical table for test statement structure
Figure BDA0003198474930000072

Claims (6)

1. The low-resource speech recognition method based on BN-SGMM-HMM is characterized by comprising the following steps:
1) preprocessing and extracting training data: setting and diversity are carried out on an original database, and then feature extraction is carried out to obtain MFCC features;
2) creating a monophonic acoustic model:
3) creating a triphone acoustic model: obtaining FMLLR characteristics;
4) training a neural network: taking the FMLLR characteristics as input characteristics of a bottleneck neural network, removing a network layer behind a bottleneck layer after the neural network training, and taking the bottleneck layer as an output layer to finally extract the bottleneck characteristics after cross entropy training;
5) training of BN-SGMM-HMM baseline System: taking the bottleneck characteristics extracted by the neural network as input characteristics of the SGMM-HMM acoustic model, and finally forming a BN-SGMM-HMM baseline system;
6) hardware implementation: the Kaldi compiling process is placed on a virtual machine for compiling, and finally, a file which is compiled is stored in a raspberry group; updating a control variable contained in the current terminal; finally, whether the raspberry pi cross compiling environment configuration is completed is confirmed;
7) and transplanting the trained acoustic model file, the voice model word network file and the dictionary file into a raspberry group to input voice, decoding the voice through a decoder carried by Kaldi, and finally outputting the text of the voice to a terminal.
2. The BN-SGMM-HMM based low resource speech recognition method of claim 1, wherein in the step 1), the specific method is:
1.1) preparing an original corpus and setting a path of the corpus in a training script;
1.2) executing a data preparation script, dividing data into a training set, a testing set and a development set, and generating a mapping relation between a speaker number and voice, the gender of the speaker and related information of an original voice file;
1.3) after the relevant information is generated, starting to prepare a dictionary and a corresponding phoneme model until the data preparation is finished;
1.4) carrying out feature extraction on the voice signals, wherein the extraction range is a training set, a development set and a test set, and the executed scripts are steps/make _ mfcc.sh and computer _ cmvn _ states.sh;
1.5) in make _ mfcc, sh, the original voice is converted into a feature vector through pre-emphasis, framing, windowing, fast Fourier transform, Mel transform, log energy and first-order and second-order difference calculation for extracting dynamic features;
1.6) after obtaining the characteristics, executing a computer _ cmvn _ states.sh file, and normalizing the obtained acoustic characteristics through cepstrum mean variance until the characteristic extraction part is finished.
3. The BN-SGMM-HMM based low resource speech recognition method of claim 1, wherein in the step 2), the specific method is:
2.1) using the previously trained MFCC features to initialize the GMM model for single-tone pixels;
2.2) adopting an E-M algorithm to iterate model training and aligning data;
2.3) iterating the alignment model obtained by the last training until the model converges.
4. The BN-SGMM-HMM based low resource speech recognition method of claim 1, wherein in the step 3), the specific method is:
3.1) training on the basis of the aligned single-phoneme models, training a corpus at the same time, generating a language dictionary file, a voice path file, a voice and speaker mapping file and a phoneme file of a speech segment, putting the three-phoneme models together for similarity clustering cutting, clustering the three-phoneme models with similar pronunciations into a model, and sharing parameters; then, training the triphone model by a method of training a single-phone model;
3.2) carrying out feature transformation, including linear discriminant analysis, maximum likelihood linear transformation and speaker self-training; the speaker self-training is based on FMLLR, and the obtained FMLLR characteristics are used for the next step of neural network training.
5. The BN-SGMM-HMM based low resource speech recognition method of claim 1, wherein in the step 4), the specific method is:
4.1) before formal training of the neural network, adopting unsupervised training of a finite Boltzmann machine RBM to initialize layer by layer, stacking the trained RBM models, and forming a DBN model by the stacked RBM models; in kaldi, the neural network pre-training is completed by executing pretrain _ dbn.sh; in the pre-training process, all RBMs are trained by using a CD algorithm and a small-batch random gradient descent algorithm, the size of each mini-batch is 256, impulse factors are set to be 0.9, and weight attenuation is not set; the first RBM adopts a Gaussian-Bernoulli unit, the learning rate of the first RBM is 0.005, and the iteration number of the first RBM is 50; the RBM adopts a Bernoulli-Bernoulli unit, the learning rate is 0.08, the iteration times are 25, and the number of experimental hidden nodes is 1024;
4.2) adopting a back propagation algorithm in the fine tuning process, wherein the initial learning rate is 0.008, and the adopted activation function is Sigmoid; during training, the learning rate is kept unchanged in the first 10 periods, from the 11 th period, the learning rate is changed to be half of the original rate in each training period, 30 iterations are performed in total, and the training is stopped when the difference between the learning rates of the two iterations is less than 0.001; if the above conditions are not met after 30 times, forcibly stopping training; the structure of the neural network is 7 layers including a 1-layer input layer, a 5-layer hidden layer and a 1-layer output layer, wherein the first-layer input layer comprises 440 nodes, the hidden layer comprises 1024 nodes except 40 nodes of the fourth-layer hidden layer, and the output layer comprises 2016 nodes and corresponds to the state number of the clustered triphones in the GMM-HMM baseline system;
4.3) after the network training is finished, extracting the bottleneck characteristics, specifically setting a BN layer in train in the training process, setting the number value of the node points BN _ dim of the bottleneck layer, calling a make _ BN _ features.
6. The BN-SGMM-HMM based low resource speech recognition method of claim 1, wherein in the step 5), the specific method is:
5.1) when the SGMM-HMM model is trained, firstly, an HMM model needs to be trained to obtain the state binding of the GMM-HMM, the acoustic model is subjected to Gaussian clustering to generate 400 UBMs, and then the UBM model is trained to need an E-M algorithm to optimize parameters; to complete an initialized SGMM model, there are two last steps of E-M training, the first step aligns GMM-HMM baselines with Viterbi states; secondly, obtaining an SGMM model by Viterbi alignment; and finally, carrying out discriminative training based on an MMI criterion to obtain a final BN-SGMM-HMM model.
CN202110897247.8A 2021-08-05 2021-08-05 BN-SGMM-HMM-based low-resource voice recognition method Active CN113421555B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110897247.8A CN113421555B (en) 2021-08-05 2021-08-05 BN-SGMM-HMM-based low-resource voice recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110897247.8A CN113421555B (en) 2021-08-05 2021-08-05 BN-SGMM-HMM-based low-resource voice recognition method

Publications (2)

Publication Number Publication Date
CN113421555A true CN113421555A (en) 2021-09-21
CN113421555B CN113421555B (en) 2024-04-12

Family

ID=77718952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110897247.8A Active CN113421555B (en) 2021-08-05 2021-08-05 BN-SGMM-HMM-based low-resource voice recognition method

Country Status (1)

Country Link
CN (1) CN113421555B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2736042A1 (en) * 2012-11-23 2014-05-28 Samsung Electronics Co., Ltd Apparatus and method for constructing multilingual acoustic model and computer readable recording medium for storing program for performing the method
US9721561B2 (en) * 2013-12-05 2017-08-01 Nuance Communications, Inc. Method and apparatus for speech recognition using neural networks with speaker adaptation
CN109192199A (en) * 2018-06-30 2019-01-11 中国人民解放军战略支援部队信息工程大学 A kind of data processing method of combination bottleneck characteristic acoustic model
CN109545201B (en) * 2018-12-15 2023-06-06 中国人民解放军战略支援部队信息工程大学 Construction method of acoustic model based on deep mixing factor analysis

Also Published As

Publication number Publication date
CN113421555B (en) 2024-04-12

Similar Documents

Publication Publication Date Title
Ma et al. Short utterance based speech language identification in intelligent vehicles with time-scale modifications and deep bottleneck features
US10629185B2 (en) Statistical acoustic model adaptation method, acoustic model learning method suitable for statistical acoustic model adaptation, storage medium storing parameters for building deep neural network, and computer program for adapting statistical acoustic model
Blaauw et al. A neural parametric singing synthesizer
US11222627B1 (en) Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system
US20150032449A1 (en) Method and Apparatus for Using Convolutional Neural Networks in Speech Recognition
CN101777347B (en) Model complementary Chinese accent identification method and system
Yamagishi et al. Model adaptation approach to speech synthesis with diverse voices and styles
CN111696522A (en) Tibetan language voice recognition method based on HMM and DNN
Shon et al. Unsupervised representation learning of speech for dialect identification
Maghsoodi et al. Speaker recognition with random digit strings using uncertainty normalized HMM-based i-vectors
Li et al. Multi-task learning of structured output layer bidirectional LSTMs for speech synthesis
King et al. Unsupervised adaptation for HMM-based speech synthesis
Grézl et al. Study of large data resources for multilingual training and system porting
Shahamiri et al. An investigation towards speaker identification using a single-sound-frame
Bocchieri et al. Speech recognition modeling advances for mobile voice search
CN111696525A (en) Kaldi-based Chinese speech recognition acoustic model construction method
CN111933121B (en) Acoustic model training method and device
CN113421555B (en) BN-SGMM-HMM-based low-resource voice recognition method
US9355636B1 (en) Selective speech recognition scoring using articulatory features
Alhlffee MFCC-Based Feature Extraction Model for Long Time Period Emotion Speech Using CNN.
TWI731921B (en) Speech recognition method and device
Yamagishi et al. Improved average-voice-based speech synthesis using gender-mixed modeling and a parameter generation algorithm considering GV
Ajayi et al. Systematic review on speech recognition tools and techniques needed for speech application development
Tan et al. An automatic non-native speaker recognition system
Srun et al. Development of speech recognition system based on cmusphinx for khmer language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant