CN113421555B - BN-SGMM-HMM-based low-resource voice recognition method - Google Patents

BN-SGMM-HMM-based low-resource voice recognition method Download PDF

Info

Publication number
CN113421555B
CN113421555B CN202110897247.8A CN202110897247A CN113421555B CN 113421555 B CN113421555 B CN 113421555B CN 202110897247 A CN202110897247 A CN 202110897247A CN 113421555 B CN113421555 B CN 113421555B
Authority
CN
China
Prior art keywords
training
model
hmm
sgmm
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110897247.8A
Other languages
Chinese (zh)
Other versions
CN113421555A (en
Inventor
赵宏亮
雷杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning University
Original Assignee
Liaoning University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning University filed Critical Liaoning University
Priority to CN202110897247.8A priority Critical patent/CN113421555B/en
Publication of CN113421555A publication Critical patent/CN113421555A/en
Application granted granted Critical
Publication of CN113421555B publication Critical patent/CN113421555B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Signal Processing (AREA)
  • Character Discrimination (AREA)
  • Image Analysis (AREA)

Abstract

A low-resource voice recognition method based on BN-SGMM-HMM combines bottleneck characteristics trained by a neural network with a subspace Gaussian mixture model to form a baseline system under a low-resource condition, forms a BN-SGMM-HMM acoustic model, and transplants the model onto a raspberry party to complete a voice recognition task.

Description

BN-SGMM-HMM-based low-resource voice recognition method
Technical Field
The invention relates to the field of voice recognition, in particular to a low-resource voice recognition method based on BN-SGMM-HMM.
Background
The existing speech recognition generally adopts the following methods:
method 1: the traditional MFCC characteristics are used as input, the GMM-HMM is used as an acoustic model, and the voice model trained by the method cannot perform multi-layer and back propagation operation like a neural network because the GMM-HMM is a shallow model, so that the recognition rate is often lower than that of the acoustic model subjected to the neural network operation; and GMM-HMM is not suitable for application scenarios where corpus is scarce (i.e. low resources).
Method 2: the traditional DNN-HMM model has huge calculation amount due to the multi-level complex structure of the DNN neural network, adopts a bottleneck network for training, and reduces budget amount due to the fact that a later network is required to be removed when bottleneck characteristics are extracted, and nonlinear operation at the back of the bottleneck layer is not required; compared with the traditional MFCC (multi-frequency domain) feature, the extracted bottleneck feature has the advantages of long-term correlation and compact representation of DNN feature voice and the like due to cross entropy training, and the recognition rate of a baseline system formed by using the bottleneck feature training is more ideal than that of DNN-HMM (recognition rate).
Disclosure of Invention
The invention aims to solve the technical problem of providing a low-resource voice recognition method based on BN-SGMM-HMM, which is used for training an open-source Chinese corpus based on the model and finally realizing the method on a raspberry-set hardware platform and has good recognition effect.
In order to achieve the above purpose, the invention adopts the following technical scheme: the low-resource voice recognition method based on BN-SGMM-HMM is characterized by comprising the following steps:
1) Preprocessing training data and extracting: setting and diversity are carried out on an original database, and then feature extraction is carried out, so that MFCC features are obtained;
2) Creating a single-phoneme acoustic model:
3) Creating a triphone acoustic model: obtaining FMLLR characteristics;
4) Training a neural network: taking the FMLLR characteristics as input characteristics of a bottleneck neural network, removing a network layer after the bottleneck layer is trained by the neural network, and finally extracting the bottleneck characteristics subjected to cross entropy training by taking the bottleneck layer as an output layer;
5) Training of BN-SGMM-HMM baseline system: taking bottleneck characteristics extracted from the neural network as input characteristics of an SGMM-HMM acoustic model, and finally forming a BN-SGMM-HMM baseline system;
6) The hardware implementation: the process of Kaldi compiling is put on a virtual machine for compiling, and a file which is finally compiled is stored in a raspberry group; updating control variables contained in the current terminal; finally, confirming whether the cross compiling environment configuration of the raspberry group is completed or not;
7) And transplanting the trained acoustic model file, the voice model word network file and the dictionary file into a raspberry group to input voice, decoding the voice through a decoder carried by Kaldi, and finally outputting the text of the voice to a terminal.
In the step 1), the specific method comprises the following steps:
1.1 Preparing an original corpus, and setting a path of the corpus in a training script;
1.2 Executing a data preparation script, dividing the data into a training set, a testing set and a developing set, and generating a mapping relation between a speaker number and voice, the gender of the speaker and related information of an original voice file;
1.3 After the related information is generated, preparing a dictionary and a corresponding phoneme model, and finishing the data preparation;
1.4 Feature extraction is carried out on the voice signals, the extraction range is a training set, a development set and a test set, and the executed scripts are steps/make_mfcc. Sh and computer_cmvn_stats. Sh;
1.5 In make_mfcc, sh, pre-emphasis, framing, windowing, fast fourier transform, mel transform, log energy, and first-order second-order differential computation to extract dynamic features are needed to transform the original speech into feature vectors;
1.6 After obtaining the characteristics, executing the computer_cmvn_stats.sh file, normalizing the obtained acoustic characteristics by a cepstrum mean variance, and completing the characteristic extraction part.
In the step 2), the specific method comprises the following steps:
2.1 Using the previously trained MFCC features to initialize the GMM model for the monophonins;
2.2 The model training is iterated by adopting an E-M algorithm, and data alignment is carried out;
2.3 The alignment model obtained by the last training is iterated again until the model converges.
7. The BN-SGMM-HMM based low-resource speech recognition method according to claim 1, wherein in the step 3), the specific method is:
3.1 Training on the basis of aligned single-phoneme models, training a corpus at the same time to generate a language dictionary file, a voice path file, a voice and speaker mapping file and a phoneme file of a speech segment, putting the three-phoneme models together to perform similarity clustering cutting, clustering the three-phoneme models with similar pronunciation into one model, and sharing parameters; then, training the triphone model by a method for training the monophone model;
3.2 Performing feature transformation including linear discriminant analysis, maximum likelihood linear transformation and speaker self-training; the speaker self-training is based on FMLLR, and the obtained FMLLR characteristics are used for training of the next neural network;
in the step 4), the specific method comprises the following steps:
4.1 Before the neural network formally trains, adopting unsupervised training of RBM of a limited Boltzmann machine to initialize layer by layer, stacking the trained RBM models, and forming a DBN model by the stacked RBM models; in kaldi, neural network pre-training is accomplished by performing pretrain_dbn.sh; in the pre-training process, all RBMs are trained by using a CD algorithm and matching with a small batch of random gradient descent algorithm, the size of each mini-batch is 256, impulse factors are set to be 0.9, and weight attenuation is not set; the first RBM adopts a Gaussian-Bernoulli unit, the learning rate is 0.005, and the iteration number is 50; the RBM at the back adopts a Bernoulli-Bernoulli unit, the learning rate is 0.08, the iteration times are 25 times, and the number of experimental hidden layer nodes is 1024;
4.2 During fine tuning, adopting a back propagation algorithm, wherein the initial learning rate is 0.008, and the adopted activation function is Sigmoid; during training, the learning rate is kept unchanged in the first 10 periods, from the 11 th period, the learning rate is changed to be half of the original learning rate in each training period, the total iteration is 30 times, and the training is stopped when the difference of the learning rates between the two iterations is smaller than 0.001; if the conditions are not met after 30 times, training is forcibly stopped; the neural network has a structure of 1 input layer, 5 hidden layers and 1 output layer, and 7 layers, wherein the first input layer has 440 nodes, the hidden layers have 1024 nodes except the fourth hidden layer which has 40 nodes, and the output layer has 2016 nodes, which correspond to the status number of triphones after clustering in the GMM-HMM baseline system;
4.3 After the network training is finished, the bottleneck characteristics are extracted, the BN layer in the train is set in the sh process, the numerical value of the node number bn_dim of the bottleneck layer is set, then a make_bn_features.sh script is called, after the characteristics are extracted, the network behind the bottleneck layer is removed, the obtained bottleneck characteristics are subjected to CMVN, and the bottleneck characteristics are extracted, and are used for building an SGMM-HMM baseline system.
8. The BN-SGMM-HMM based low-resource speech recognition method according to claim 1, wherein in the step 5), the specific method is:
5.1 When training the SGMM-HMM model, firstly, training an HMM model to obtain the state binding of the GMM-HMM, carrying out Gaussian clustering on the acoustic model to generate 400 UBM, and then, training the UBM model to adjust parameters by an E-M algorithm; to complete an initialized SGMM model, there are two last steps of E-M training, the first step aligns the GMM-HMM baseline with the Viterbi state; the second step is to use Viterbi alignment to obtain SGMM model; and finally, distinguishing training based on MMI criteria to obtain a final BN-SGMM-HMM model.
The invention has the beneficial effects that:
1. the language identification method adopts a BN-SGMM-HMM model, the parameter scale is lower than that of the traditional voice identification method, and the language identification method can be operated on embedded voice equipment with limited hardware resources.
2. The language identification method is suitable for low-resource voice identification conditions, and can achieve good identification effect under the condition of limited training corpus.
3. The language identification method has low development cost, adopts an open-source Chinese language identification library, and saves the cost of purchasing a corpus and a voice identification API. And the language identification acoustic model is constructed based on kaldi, so that the development period and development difficulty of developers are greatly reduced.
4. The hardware implementation cost of the language identification method is low, and the open source hardware raspberry group 4b is adopted, so that compared with the cost of other development boards and voice chips, the development difficulty and period are reduced, and the chip streaming cost is saved.
Drawings
Fig. 1: MFCC feature extraction flow chart.
Fig. 2: a feature transformation flow chart.
Fig. 3: DBN training flow block diagram.
Fig. 4: back propagation algorithm block diagram
Fig. 5: bottleneck feature extraction graph.
Fig. 6: a flow chart of training of subspace gaussian mixture models.
Fig. 7: BN-SGMM-HMM training patterns.
Fig. 8: cross-compiling a flow chart.
Fig. 9: an online speech recognition system block diagram.
Fig. 10: an overall block diagram of a hardware implementation.
Detailed Description
The following description of the embodiments of the present invention will be made with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention.
A low-resource voice recognition method based on BN-SGMM-HMM comprises the following steps:
early preparation of training data:
1. an original corpus is prepared, and a path of the corpus is set in a training script.
2. Executing a data preparation script, dividing the data into a training set train, a test set est and a development set dev, and generating relevant information such as a mapping relation between a speaker number and voice, the gender of the speaker, an original voice file and the like.
3. After the relevant information is generated, the dictionary and the corresponding phoneme model are prepared. The data preparation is completed to this point in the feature extraction section.
4. The characteristic extraction is carried out on the voice signal, the extraction range is a training set, a development set and a test set, and the executed scripts are steps/make_mfcc. Sh and computer_cmvn_stats. Sh.
5. In the make_mfcc, sh, pre-emphasis, framing, windowing, fast fourier transformation, mel transformation, log energy and first-order second-order differential calculation for extracting dynamic characteristics are needed, and the purpose of the script converts the original voice into a feature vector, so that not only can the key information of the original voice waveform be extracted, but also the calculated amount can be reduced, and redundant information can be removed.
6. After the features are obtained, a computer_cmvn_stats file is executed, and the file is that the obtained acoustic features are subjected to cepstrum mean variance normalization CMVN. The purpose of normalization is to normalize the input acoustic features to conform to a normal distribution, thereby reducing the effect of noise on speech. The feature extraction section is completed so far. The MFCC feature extraction flow is shown in fig. 1.
Single-phoneme model creation of acoustic model:
7. the previously trained MFCC features are used to initialize the GMM model for the monophonins.
8. And iterating the model training by adopting an E-M algorithm, and aligning data.
9. And iterating the alignment model obtained by the last training until the model converges.
Three-phoneme model creation of acoustic model:
10. since the actual pronunciation of a phoneme is influenced by neighboring phonemes in the actual pronunciation, especially in the word or sentence of the co-pronunciation, a context-dependent acoustic model is introduced, and among the context-dependent models, the best-performing is a triphone model whose model is composed of < left neighboring phoneme+center phoneme+right neighboring phoneme >.
11. The triphone training requires training on the basis of aligned monophonic models, and also requires training data and documents such as language dictionaries. Meanwhile, in order to solve model parameter explosion caused by introducing the triphone model, the triphone models are put together to perform similarity clustering cutting, the triphone models with similar pronunciation are clustered into one model, parameters are shared, and the scale of the model parameters is effectively reduced. The process of training the triphone model after this is quite similar to the process of training the monophone model.
12. Meanwhile, in order to improve the recognition rate of the triphone model, feature transformation is carried out, and the feature transformation mainly comprises linear discriminant analysis LDA, maximum likelihood linear transformation MLLT and speaker self-training SAT. The speaker self training is based on feature maximum likelihood regression FMLLR, and the obtained FMLLR features are used for the training of the next neural network. The feature transformation flow is shown in fig. 2.
Training of neural networks
13. Before the neural network is formally trained, the unsupervised training of the RBM of the limited Boltzmann machine is adopted to initialize layer by layer before the neural network is formally trained, and the trained RBM models are stacked to form the DBN model. In kaldi, neural network pre-training is accomplished by performing pretrain_dbn.sh; in the pre-training process, all RBMs are trained by using a CD algorithm and matching with a small batch of random gradient descent algorithm, the size of each mini-batch is 256, impulse factors are set to be 0.9, and weight attenuation is not set; the first RBM adopts a Gaussian-Bernoulli unit, the learning rate is 0.005, and the iteration number is 50; and the RBM at the back adopts a Bernoulli-Bernoulli unit, the learning rate is 0.08, the iteration number is 25, and the number of experimental hidden nodes is 1024.
14. In the fine tuning process, a back propagation algorithm is adopted, the initial learning rate is 0.008, and the adopted activation function is Sigmoid. During training, the learning rate is kept unchanged in the first 10 periods, and from the 11 th period, the learning rate is halved in every training period, and the total number of the training periods is 30. Stopping training when the difference of the learning rates between the two iterations is less than 0.001; if the above conditions are not met after 30 times, training is forcibly stopped. In this context, the neural network has a structure of 1 input layer, 5 hidden layers and 1 output layer, and 7 total layers, wherein the first input layer has 440 nodes, the hidden layers have 1024 nodes except the fourth hidden layer has 40 nodes, and the output layer has 2016 nodes, which corresponds to the status number of triphones after clustering in the GMM-HMM baseline system. The DBN training flow diagram is shown in fig. 3 and the back propagation algorithm diagram is shown in fig. 4.
15. After the network training is finished, the bottleneck characteristics are extracted, the BN layer in the train is set in the training, the numerical value of the number bn_dim of the bottleneck layer is set, then a make_bn_features.sh script is called, after the characteristics are extracted, the network behind the bottleneck layer is removed, and the bottleneck characteristics are extracted after the obtained bottleneck characteristics pass through the CMVN. This feature is used to build the SGMM-HMM baseline system later. Bottleneck feature extraction is shown in fig. 5.
Training of BN-SGMM-HMM baseline system:
16. when the SGMM-HMM model is trained, firstly, an HMM model is required to be trained, the state binding of the GMM-HMM is obtained, the acoustic model is Gaussian clustered to generate 400 UBM, and then the UBM model is trained to require an E-M algorithm to adjust parameters; to complete an initialized SGMM model, there are two last steps of E-M training, the first step aligns the GMM-HMM baseline with the Viterbi state; the second step is to use Viterbi alignment to obtain SGMM model; finally using MMI criterion
And carrying out distinguishing training to obtain a final BN-SGMM-HMM model. Each of the two steps is an iterative process as shown in fig. 6.
17. And when training the SGMM, firstly adopting a Gaussian mixture clustering algorithm to cluster Gaussian mixture in the acoustic model, and generating 400 UBM. And (3) performing E-M algorithm on the generated UBM to obtain an SGMM model, and finally performing distinguishing training based on MMI criteria to obtain a final BN-SGMM-HMM model as shown in figure 7.
The hardware implementation:
18. since the general Kaldi is developed on the PC side, the x86-64 instruction set of Intel is always adopted, and when the raspberry group is realized, the adopted instruction set is the instruction set of ARM, and if the Kaldi is directly put into a flash memory card to be directly compiled on an embedded platform, a lot of dependency packages and compiled files are lacking. In order to ensure the smoothness of the compiling process, the Kaldi compiling process is put on a virtual machine for compiling, and the finally compiled file is stored in a raspberry group.
19. In the patent, a system in a PC virtual machine adopts a Ubuntu_18.04 bit system, an arm-linux-gnueabihf-ras-x 64 cross compiling packet provided by a raspberry group official is used, after the whole cross compiling packet is moved into Ubuntu_18.04, an environment variable for changing the cross compiling packet is introduced through a basherc file in the linux, and after the completion, a control variable contained in a current terminal is updated. And confirming whether the raspberry group cross compiling environment configuration is finished or not through an Arm-linux-gnueabihf-gcc-v instruction. The cross-compilation flow diagram is shown in fig. 8.
20. And transplanting the trained acoustic model file, the voice model word network file and the dictionary file into the raspberry group 4B. Three files corresponding to 'final. Mdl', 'HCLG. Fst' and 'words. Txt' in Kaldi are input voice and decoded by a decoder carried by Kaldi, and finally the text of voice is output to a terminal. A block diagram of an online speech recognition system is shown in fig. 9. An overall block diagram of a hardware implementation is shown in fig. 10.
Example 1:
the test speaker totally comprises 10 persons, five voices are provided, each voice is spoken for 10 times, the number of inserted words, the number of deleted words and the number of substituted words in each voice are counted, and the WER of each sentence is counted. The voices tested at this time are shown in table 1.
Table 1: test statement content table
The statistical results are shown in Table 2, with WER of the speech recognition system maintained substantially below 20% and an average of about 14.92%.
Table 2: test statement structure statistics table

Claims (4)

1. The low-resource voice recognition method based on BN-SGMM-HMM is characterized by comprising the following steps:
1) Preprocessing training data and extracting: setting and diversity are carried out on an original database, and then feature extraction is carried out, so that MFCC features are obtained;
1.1 Preparing an original corpus, and setting a path of the corpus in a training script;
1.2 Executing a data preparation script, dividing the data into a training set, a testing set and a developing set, and generating a mapping relation between a speaker number and voice, the gender of the speaker and related information of an original voice file;
1.3 After the related information is generated, preparing a dictionary and a corresponding phoneme model, and finishing the data preparation;
1.4 Feature extraction is carried out on the voice signals, the extraction range is a training set, a development set and a test set, and the executed scripts are steps/make_mfcc. Sh and computer_cmvn_stats. Sh;
1.5 In make_mfcc, sh, pre-emphasis, framing, windowing, fast fourier transform, mel transform, log energy, and first-order second-order differential computation to extract dynamic features are needed to transform the original speech into feature vectors;
1.6 After obtaining the characteristics, executing a computer_cmvn_stats.sh file, normalizing the obtained acoustic characteristics by a cepstrum mean variance, and completing the characteristic extraction part;
2) Creating a single-phoneme acoustic model:
2.1 Using the previously trained MFCC features to initialize the GMM model for the monophonins;
2.2 The model training is iterated by adopting an E-M algorithm, and data alignment is carried out;
2.3 Performing iteration on the alignment model obtained by the last training until the model converges;
3) Creating a triphone acoustic model: obtaining FMLLR characteristics;
4) Training a neural network: taking the FMLLR characteristics as input characteristics of a bottleneck neural network, removing a network layer after the bottleneck layer is trained by the neural network, and finally extracting the bottleneck characteristics subjected to cross entropy training by taking the bottleneck layer as an output layer;
5) Training of BN-SGMM-HMM: taking bottleneck characteristics extracted from the neural network as input characteristics of an SGMM-HMM acoustic model, and finally forming BN-SGMM-HMM;
6) The hardware implementation: the process of Kaldi compiling is put on a virtual machine for compiling, and a file which is finally compiled is stored in a raspberry group; updating control variables contained in the current terminal; finally, confirming whether the cross compiling environment configuration of the raspberry group is completed or not;
7) And transplanting the trained acoustic model file, the voice model word network file and the dictionary file into a raspberry group to input voice, decoding the voice through a decoder carried by Kaldi, and finally outputting the text of the voice to a terminal.
2. The BN-SGMM-HMM based low-resource speech recognition method according to claim 1, wherein in the step 3), the specific method is:
3.1 Training on the basis of aligned single-phoneme models, training a corpus at the same time to generate a language dictionary file, a voice path file, a voice and speaker mapping file and a phoneme file of a speech segment, putting the three-phoneme models together to perform similarity clustering cutting, clustering the three-phoneme models with similar pronunciation into one model, and sharing parameters; then, training the triphone model by a method for training the monophone model;
3.2 Performing feature transformation including linear discriminant analysis, maximum likelihood linear transformation and speaker self-training; the speaker self-training is based on FMLLR, and the obtained FMLLR characteristics are used for training of the next neural network.
3. The BN-SGMM-HMM based low-resource speech recognition method according to claim 1, wherein in the step 4), the specific method is:
4.1 Before the neural network formally trains, adopting unsupervised training of RBM of a limited Boltzmann machine to initialize layer by layer, stacking the trained RBM models, and forming a DBN model by the stacked RBM models; in kaldi, neural network pre-training is accomplished by performing pretrain_dbn.sh; in the pre-training process, all RBMs are trained by using a CD algorithm and matching with a small batch of random gradient descent algorithm, the size of each mini-batch is 256, impulse factors are set to be 0.9, and weight attenuation is not set; the first RBM adopts a Gaussian-Bernoulli unit, the learning rate is 0.005, and the iteration number is 50; the RBM at the back adopts a Bernoulli-Bernoulli unit, the learning rate is 0.08, the iteration times are 25 times, and the number of experimental hidden layer nodes is 1024;
4.2 During fine tuning, adopting a back propagation algorithm, wherein the initial learning rate is 0.008, and the adopted activation function is Sigmoid; during training, the learning rate is kept unchanged in the first 10 periods, from the 11 th period, the learning rate is changed to be half of the original learning rate in each training period, the total iteration is 30 times, and the training is stopped when the difference of the learning rates between the two iterations is smaller than 0.001; if the conditions are not met after 30 times, training is forcibly stopped; the neural network has a structure of 1 input layer, 5 hidden layers and 1 output layer, and 7 layers, wherein the first input layer has 440 nodes, the hidden layers have 1024 nodes except the fourth hidden layer which has 40 nodes, and the output layer has 2016 nodes, which correspond to the status number of triphones after clustering in the GMM-HMM baseline system;
4.3 After the network training is finished, the bottleneck characteristics are extracted, the BN layer in the train is set in the sh process, the numerical value of the node number bn_dim of the bottleneck layer is set, then a make_bn_features.sh script is called, after the characteristics are extracted, the network behind the bottleneck layer is removed, the obtained bottleneck characteristics are subjected to CMVN, and the bottleneck characteristics are extracted, and are used for building an SGMM-HMM baseline system.
4. The BN-SGMM-HMM based low-resource speech recognition method according to claim 1, wherein in the step 5), the specific method is:
5.1 When training the SGMM-HMM model, firstly, training an HMM model to obtain the state binding of the GMM-HMM, carrying out Gaussian clustering on the acoustic model to generate 400 UBM, and then, training the UBM model to adjust parameters by an E-M algorithm; to complete an initialized SGMM model, there are two last steps of E-M training, the first step aligns the GMM-HMM baseline with the Viterbi state; the second step is to use Viterbi alignment to obtain SGMM model; and finally, distinguishing training based on MMI criteria to obtain a final BN-SGMM-HMM model.
CN202110897247.8A 2021-08-05 2021-08-05 BN-SGMM-HMM-based low-resource voice recognition method Active CN113421555B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110897247.8A CN113421555B (en) 2021-08-05 2021-08-05 BN-SGMM-HMM-based low-resource voice recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110897247.8A CN113421555B (en) 2021-08-05 2021-08-05 BN-SGMM-HMM-based low-resource voice recognition method

Publications (2)

Publication Number Publication Date
CN113421555A CN113421555A (en) 2021-09-21
CN113421555B true CN113421555B (en) 2024-04-12

Family

ID=77718952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110897247.8A Active CN113421555B (en) 2021-08-05 2021-08-05 BN-SGMM-HMM-based low-resource voice recognition method

Country Status (1)

Country Link
CN (1) CN113421555B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2736042A1 (en) * 2012-11-23 2014-05-28 Samsung Electronics Co., Ltd Apparatus and method for constructing multilingual acoustic model and computer readable recording medium for storing program for performing the method
WO2015085197A1 (en) * 2013-12-05 2015-06-11 Nuance Communications, Inc. Method and apparatus for speech recognition using neural networks with speaker adaptation
CN109192199A (en) * 2018-06-30 2019-01-11 中国人民解放军战略支援部队信息工程大学 A kind of data processing method of combination bottleneck characteristic acoustic model
CN109545201A (en) * 2018-12-15 2019-03-29 中国人民解放军战略支援部队信息工程大学 The construction method of acoustic model based on the analysis of deep layer hybrid cytokine

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2736042A1 (en) * 2012-11-23 2014-05-28 Samsung Electronics Co., Ltd Apparatus and method for constructing multilingual acoustic model and computer readable recording medium for storing program for performing the method
WO2015085197A1 (en) * 2013-12-05 2015-06-11 Nuance Communications, Inc. Method and apparatus for speech recognition using neural networks with speaker adaptation
CN109192199A (en) * 2018-06-30 2019-01-11 中国人民解放军战略支援部队信息工程大学 A kind of data processing method of combination bottleneck characteristic acoustic model
CN109545201A (en) * 2018-12-15 2019-03-29 中国人民解放军战略支援部队信息工程大学 The construction method of acoustic model based on the analysis of deep layer hybrid cytokine

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Improved bottleneck features using pretrained deep neural networks;Yu D,Seltzer M L.;《INTERSPEECH》;全文 *
低数据资源条件下基于Bottleneck特征与SGMM模型的语音识别系统;吴蔚澜;蔡猛;田垚;杨晓昊;陈振锋;刘加;夏善红;;中国科学院大学学报(第01期);全文 *
基于Kaldi的AI语音识别在嵌入式系统中的应用研究;彭燕子;柏杰;曹炳尧;宋英雄;;工业控制计算机(第09期);全文 *
基于MTL-DNN系统融合的混合语言模型语音识别方法;范正光;屈丹;李华;张文林;;数据采集与处理(第05期);全文 *

Also Published As

Publication number Publication date
CN113421555A (en) 2021-09-21

Similar Documents

Publication Publication Date Title
Cooper et al. Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings
Xue et al. Fast adaptation of deep neural network based on discriminant codes for speech recognition
Fan et al. Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis
Chauhan et al. Speaker recognition using LPC, MFCC, ZCR features with ANN and SVM classifier for large input database
US10629185B2 (en) Statistical acoustic model adaptation method, acoustic model learning method suitable for statistical acoustic model adaptation, storage medium storing parameters for building deep neural network, and computer program for adapting statistical acoustic model
Reynolds et al. The SuperSID project: Exploiting high-level information for high-accuracy speaker recognition
CN101777347B (en) Model complementary Chinese accent identification method and system
Garcia-Romero et al. Stacked Long-Term TDNN for Spoken Language Recognition.
Vadwala et al. Survey paper on different speech recognition algorithm: challenges and techniques
Liu et al. Improving unsupervised style transfer in end-to-end speech synthesis with end-to-end speech recognition
Wang et al. One-shot voice conversion using star-gan
Dumpala et al. Improved speaker recognition system for stressed speech using deep neural networks
Huang et al. Linear networks based speaker adaptation for speech synthesis
Maghsoodi et al. Speaker recognition with random digit strings using uncertainty normalized HMM-based i-vectors
Li et al. Multi-task learning of structured output layer bidirectional LSTMs for speech synthesis
Du et al. Noise-robust voice conversion with domain adversarial training
Shahamiri et al. An investigation towards speaker identification using a single-sound-frame
Hu et al. Fusion of global statistical and segmental spectral features for speech emotion recognition.
CN113421555B (en) BN-SGMM-HMM-based low-resource voice recognition method
Pan et al. Online speaker adaptation for LVCSR based on attention mechanism
CN111933121B (en) Acoustic model training method and device
Matassoni et al. DNN adaptation for recognition of children speech through automatic utterance selection
Dey et al. Content normalization for text-dependent speaker verification
Su Combining speech and speaker recognition: A joint modeling approach
Yamagishi et al. Improved average-voice-based speech synthesis using gender-mixed modeling and a parameter generation algorithm considering GV

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant