CN112951206A - Tibetan Tibet dialect spoken language identification method based on deep time delay neural network - Google Patents

Tibetan Tibet dialect spoken language identification method based on deep time delay neural network Download PDF

Info

Publication number
CN112951206A
CN112951206A CN202110183564.3A CN202110183564A CN112951206A CN 112951206 A CN112951206 A CN 112951206A CN 202110183564 A CN202110183564 A CN 202110183564A CN 112951206 A CN112951206 A CN 112951206A
Authority
CN
China
Prior art keywords
tibetan
model
language
data set
acoustic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110183564.3A
Other languages
Chinese (zh)
Other versions
CN112951206B (en
Inventor
魏建国
何铭
徐君海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202110183564.3A priority Critical patent/CN112951206B/en
Publication of CN112951206A publication Critical patent/CN112951206A/en
Application granted granted Critical
Publication of CN112951206B publication Critical patent/CN112951206B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/69Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention relates to the technical field of deep learning, signal processing, voice recognition, feature extraction, phonetics and the like, and aims to improve the overall effect of a Tibetan dialect spoken language recognition model aiming at a spoken language application scene of a Tibetan language. The method is mainly applied to the spoken language identification occasion of Tibetan language defense Tibetan dialect.

Description

Tibetan Tibet dialect spoken language identification method based on deep time delay neural network
Technical Field
The invention relates to the technical fields of deep learning, signal processing, voice recognition, feature extraction, pronouncing and the like, combines a data augmentation technology and a deep neural network technology, and pointedly trains and adjusts main acoustic models and language models aiming at application scenes of Tibetan tibetan Tibetan dialect spoken languages, thereby achieving the purpose of building a Tibetan tibetan Tibetan dialect spoken language voice recognition system with better effect.
Background
In the modern times, artificial intelligence has become the leading edge and the hot spot of research in the science and technology industry, and various artificial intelligence technologies gradually begin to fall to the ground and enter the lives of people, so that voice recognition is one of the very important technical fields. The speech recognition technology is a technology for allowing a computer to hear human language and convert the human language into corresponding text. The development of the technology undoubtedly changes the interaction mode of human beings and computers, and brings convenience to the daily life of people.
The speech recognition technology has obtained better effect on languages with more resources such as mandarin chinese, english and the like. However, the development of speech recognition technology is still relatively lagged in language types such as Tibetan, Uygur, local dialect. The reason is that the number of users of the languages such as Tibetan language is rare, so that the corpus resources used for technical research are difficult to obtain, the cost is high, and professional knowledge in the aspects of language and pronunciation is often needed in the process of establishing the voice recognition system, so that the talents for researching the voice recognition technology on the languages are insufficient. In the existing research on Tibetan language identification, since acoustic data and text data are rare, an acoustic model and a language model which are directly trained by using Tibetan language acoustic data are poor in effect. In some technical schemes, acoustic data of other languages, such as Chinese, English and the like, are used for training a basic model, and then acoustic data of Tibetan language is used for adjusting network parameters, so that the performance of the model can be optimized. However, the pronunciation characteristics of the source language are different from those of the Tibetan language, so the effect obtained by the technical scheme still needs to be improved.
The research of the Tibetan language voice recognition technology has important significance for changing the living conditions of residents in Tibetan regions, promoting Tibetan informatization construction, promoting cultural communication among all nationalities and the like. The invention provides a technical scheme for system construction of a Tibetan dialect spoken language identification model based on a deep time delay neural network, aiming at the current situations that Tibetan language resources are few at present and the Tibetan language identification technology is relatively lagged in development.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a novel model building scheme of a Tibetan language voice recognition system, mainly aims at the spoken language application scene of Tibetan language, and improves the overall effect of the Tibetan language spoken language recognition model. According to the technical scheme, the Tibetan dialect spoken language identification method based on the deep time delay neural network is characterized in that an audio data set formed by mixing three Tibetan dialects is adopted, an original audio data set is expanded through a speed disturbance, noise addition and reverberation method, a deep time delay neural network is trained by utilizing the expanded data set based on a chain model of an open source voice identification toolbox kaldi and serves as a Tibetan acoustic model, and a part of the Tibetan dialect in audio data is utilized to carry out secondary training on the acoustic model so as to obtain a deep time delay neural network acoustic model for the Tibetan dialect; based on the existing limited text resources, respectively training N-gram language models of two different content fields by using Tibetan texts of spoken languages and Tibetan texts of news information, interpolating according to the proportion of 1:1, and controlling the size of the model obtained by interpolation by using a pruning technology to obtain a Tibetan N-gram language model; combining the deep time delay neural network acoustic model, the N-gram language model and the pronunciation dictionary to form a decoder, then re-interpolating two language models corresponding to the spoken language content field and the news information content field according to the proportion of 8:2 to obtain a new language model which is biased to the spoken language field, and re-scoring the decoded intermediate result to enable the whole language model to be more biased to the grammatical habit of the spoken language, combining the trained time delay neural network acoustic model, the N-gram language model and the re-scoring language model into a system to obtain the final spoken language field speech recognition model for the Tibetan language Tibetan dialect.
The method comprises the following specific steps:
step one, preparing a Tibetan language audio data set, and expanding the Tibetan language audio data set by using an augmentation technology;
step two, extracting acoustic features: the method comprises the steps of using a Mel frequency cepstrum coefficient MFCC characteristic and a pitch period information pitch characteristic; the method specifically comprises the steps of extracting 13-dimensional low-precision MFCC features and 3-dimensional pitch features when a Gaussian mixture model GMM acoustic model is trained; when a time delay neural network TDNN acoustic model is trained, a 40-dimensional high-precision MFCC feature and a 3-dimensional pitch feature are used, in addition, an i-Vector feature (Identity-Vector) is also used, and the dimension is 100 dimensions;
step three, training an acoustic model, wherein the operation process comprises the following two aspects:
firstly, training a hidden Markov-Gaussian mixture model (HMM-GMM) acoustic model, using extracted 13-dimensional MFCC features plus pitch features for 16 dimensions, and using kaldi to train GMM acoustic models of a monophonic model (monophone) and a triphone model (triphone);
secondly, aligning training data by using a trained triphone GMM model, and providing phoneme-level alignment information for later TDNN model training;
step four, constructing a Tibetan general acoustic model by using a deep time delay neural network structure;
fifthly, performing second training on the basis of the trained acoustic model by using the acoustic data set of the Tibetan language Tibetan dialect part to obtain a final acoustic model of the Tibetan language Tibetan dialect;
step six, training a language model of Tibetan, specifically training an N-gram Tibetan language model by using collected Tibetan text data, wherein N is 5, namely 5-gram, and the operation steps comprise the following two aspects:
firstly, dividing the existing Tibetan text according to the content field, dividing the Tibetan text into a spoken language type and a news information type, respectively training two corresponding 5-gram Tibetan language models by using the two Tibetan text parts, then interpolating the two language models according to a ratio of 1:1, and carrying out appropriate pruning by setting a threshold value of a perplexity (ppl) to control the size of the models, thereby finally obtaining the 1:1 Tibetan language model;
secondly, in order to make the trained Tibetan language model be better applied to the scenes of spoken language, interpolation is performed again on the basis of the two 5-gram Tibetan language models trained by the spoken language text and the news information text, but the proportion is adjusted to be spoken language: the news information class is equal to 8:2, and a new 8:2 Tibetan language model which is biased to the spoken language application scene is generated;
combining the Tibetan dialect pronunciation dictionary with the trained Tibetan dialect acoustic model, the spoken language model and the Tibetan language 5-gram language model interpolated according to the ratio of 1:1 to generate a voice recognition decoder, and decoding a frame of acoustic feature sequence extracted from the audio file to be recognized;
and step eight, re-scoring the intermediate result by using the new Tibetan 5-gram language model obtained by interpolation of the two language models of the spoken language and the news information according to the ratio of 8:2 to obtain the finally identified Tibetan text sequence.
The operation process of the first step is divided into the following aspects:
first, acoustic training data is prepared. The Tibetan audio data set used is divided into two parts: one part is a small audio data set of Tibetan language defense Tibetan dialect, and the time is about 36 hours; the other part is an audio data set containing three Tibetan dialects, namely a defense Tibetan dialect, a well-known dialect and an An-known dialect, the time duration is about 200 hours, and when two original data sets are combined, the 36-hour data set is kept original and expanded by 1 time;
secondly, dividing the obtained whole acoustic data set into a training data set and a testing data set;
thirdly, performing speed disturbance on a training set of an audio data set, performing speed disturbance at 0.9 and 1.1 times by using a tool provided by a kaldi voice recognition tool kit on the basis of 1.0 time of speech speed of the original audio, expanding the data set to 3 times of the original data set, and adding volume disturbance to the audio;
fourthly, conducting noise adding and reverberation processing on the training set of the audio data set, conducting noise adding and reverberation processing on each audio file on the audio data set after the speed disturbance, and combining the audio files after the noise and the reverberation and the original clean audio to form a mixed audio data set.
The fourth step specifically comprises the following steps of 2:
firstly, designing a network structure of a deep neural network acoustic model: the Time Delay Neural Network (TDNN) structure is characterized in that the number of hidden layers is 16, each layer is provided with 625 neurons, and a chain model in a kaldi voice recognition tool box is adopted;
secondly, training a TDNN acoustic model by utilizing a Tibetan acoustic training data set after the previous data augmentation and combining phoneme-level alignment information obtained by the training acoustic data set on the previous triphone GMM model tri5, so that a network structure learns the mapping relation from frame-level acoustic features to corresponding phoneme probabilities.
The invention has the characteristics and beneficial effects that:
the invention relates to a novel Tibetan language and Tibetan dialect speech recognition model technical scheme, which has the following advantages compared with the prior Tibetan language recognition model technical scheme:
firstly, when a Time Delay Neural Network (TDNN) is trained as an acoustic model, a data amplification scheme of 'speed disturbance + noise and reverberation' is adopted on the basis of existing acoustic data, and volume disturbance is added, so that the training data volume is expanded, the diversity of data is increased, overfitting of acoustic model training can be prevented to a certain extent, a larger-scale neural network can be established for classification, and the identification precision and generalization performance of the whole identification system model are increased.
Secondly, a two-stage training scheme is adopted in the invention to train a deep TDNN network as an acoustic model. In the first stage, a data set with three dialects mixed is adopted, and because the difference between the three dialects of the Tibetan language is smaller than that between other languages such as Chinese, English and the like and the Tibetan language, the network parameters can be optimized to a relatively good degree through the training of the large-scale data set; and in the second stage, on the basis of the model parameters trained in the first stage, the parameters are trained by using the part of the Tibetan dialect to carry out targeted optimization, so that the recognition effect on the Tibetan dialect is further improved, and the finally trained model can better recognize the Tibetan dialect. Compared with a model obtained by training only with small-scale data of Tibetan Tibet dialect, the final performance of the technical scheme is better.
Thirdly, in order to solve the problem of insufficient resources of training texts of the language model in the target field of the spoken language, the invention adopts a method for interpolating language models in different fields, establishes a smaller general language model by model pruning to participate in the construction of the decoder, then uses a language model re-scoring technology, and re-selects a decoding path by re-scoring a larger language model which is biased to the spoken language application field, thereby improving the performance of the language model and improving the recognition precision of the whole system. Compared with the technology that a large-scale Tibetan language model which is biased to the spoken language field is used at the beginning and is combined with other parts to form a decoder, the technical scheme has the advantages that the performance reduction is small, and the time and space cost is saved. The performance aspect can be compensated by using a technology of re-scoring the language model biased to the spoken language field, so that the system identification model can be conveniently optimized in a pertinence way towards a certain field, and the expandability of the Tibetan language identification system in the identification field is greatly enhanced.
Description of the drawings:
FIG. 1 is a Tibetan toilet Tibetan dialect speech recognition system framework.
Fig. 2 illustrates a TDNN acoustic model used in the present embodiment.
FIG. 3 is a Tibetan language model training scheme.
Fig. 4 training scheme of TDNN acoustic model.
Detailed Description
Mainly aims at the spoken language application scene of Tibetan defending Tibetan dialects. The method adopts a deep time delay neural network, combines a hidden Markov model, establishes an acoustic model part based on a chain model of a kaldi voice recognition toolbox, trains an N-gram language model by using limited Tibetan text resources, and improves the overall effect of a Tibetan language spoken language recognition model of Tibetan toilet by using technologies of language model interpolation, language model reprinting and the like in different fields. The innovation of the present invention is presented in two aspects. In the aspect of acoustic model building, aiming at the characteristic that a Tibetan dialect speech data set is small, the Tibetan dialect recognition model adopts an audio data set formed by mixing three Tibetan dialects, expands an original audio data set by a speed disturbance, noise addition and reverberation method, trains a deep time delay neural network by utilizing the expanded data set based on a chain model of kaldi, and is used as a general Tibetan acoustic model in the first stage; and then, training the acoustic model for the second time by utilizing the part of the Tibetan dialect in the audio data to obtain the acoustic model with better effect on the Tibetan dialect. In the aspect of language models, the invention adopts a method for interpolating language models in different fields, based on the existing limited text resources, respectively trains N-gram language models in two different fields (such as daily spoken language contents and news information contents) by using Tibetan language texts in spoken languages and Tibetan language texts in news information, interpolates according to the proportion of 1:1, controls the size of the models by using a pruning technology to obtain a small and universal language model, combines the model with an acoustic model, a pronunciation dictionary and the like to form a decoder, and then adopts 8:2 proportional interpolation to obtain a large language model which is inclined to the spoken language field to perform reprinting and dividing, so that the whole language model is more inclined to the grammars in the spoken languages, and further improves the performance of the language model. And combining the trained time delay neural network acoustic model, the N-gram language model and the re-typing module into a system to obtain a final spoken language field speech recognition model for Tibetan language. The system model performs decoding test on the test set of the Tibetan language Tibet dialect, and can obtain better performance than the traditional method.
The technical scheme of the Tibetan Tibet dialect spoken language identification model based on the deep time delay neural network is characterized by mainly comprising the following 8 steps:
step one, preparing a Tibetan language audio data set and expanding the Tibetan language audio data set by using an augmentation technology. The operation process is mainly divided into the following aspects:
second, acoustic training data is prepared. The Tibetan audio data set used by the invention is divided into two parts: one part is a small audio data set of Tibetan language defense Tibetan dialect, and the time is about 36 hours; the other part is an audio data set containing three Tibetan dialects, namely a defense Tibetan dialect, a health dialect and an An dialect, and the time duration is about 200 hours. Since the 36-hour data set is small in size and different in content from the 200-hour data set content domain, in order to prevent the neural network training from excessively biasing to the 200-hour data set content domain, the 36-hour data set is expanded by 1 time while the two original data sets are merged.
Secondly, the whole acoustic data set obtained above is divided into a training data set and a testing data set.
Third, a training set of audio data sets is subjected to velocity perturbation. On the basis of 1.0 time of speech speed of the original audio, the tool provided by the kaldi voice recognition tool box is used for carrying out 0.9 and 1.1 time speed disturbance, and the data set is expanded to 3 times of the original data set. In addition, volume disturbances are added to the audio.
Fourth, a training set of audio data sets is subjected to noise plus reverberation processing. And on the audio data set after the speed disturbance, adding noise and reverberation to each audio file, and combining the audio after the noise and the reverberation are added with the original clean audio to form a mixed audio data set.
And step two, extracting acoustic features. Since the tibetan dialect is a tonal language, we use the MFCC feature plus the pitch feature here. When a GMM acoustic model is trained, 13-dimensional low-precision MFCC features and 3-dimensional pitch features are extracted; when training the TDNN acoustic model, 40-dimensional high-precision MFCC features plus 3-dimensional pitch features are used. In addition, we also use the i-vector feature, with dimensions of 100 dimensions.
And step three, training an acoustic model. The operation process comprises the following two aspects:
first, the HMM-GMM acoustic model is trained. The technical scheme of the Tibetan language identification system provided by the invention uses the extracted 13-dimensional MFCC features and pitch features, and has 16 dimensions. The GMM acoustic models of the monophonic (monophone) and triphone (triphone) models were trained using kaldi.
Secondly, aligning training data by using the trained triphone GMM model, and providing phoneme-level alignment information for later TDNN model training.
And step four, constructing a Tibetan general acoustic model by using a deep time delay neural network structure. The operation steps comprise 2 aspects:
first, a network structure of a deep neural network acoustic model is designed. The acoustic model of the invention adopts a Time Delay Neural Network (TDNN) structure, the number of hidden layers is 16, each layer has 625 neurons, and a chain model in a kaldi voice recognition toolbox is adopted.
Secondly, a TDNN acoustic model is trained by utilizing Tibetan acoustic training data sets (two data sets containing 3 Tibetan dialects of defense Tibetan dialects, Anmo dialects and Kangfang dialects) after the augmentation of the front data and combining phoneme-level alignment information obtained by the training acoustic data sets on a previous triphone GMM model (marked as a tri5 model), so that a network structure learns the mapping relation from frame-level acoustic features to corresponding phoneme probabilities.
And step five, performing second training on the trained acoustic model by using the acoustic data set of the Tibetan dialect part of the Tibetan language. The training still keeps the original acoustic model structure, namely a 16-layer TDNN structure, wherein each layer is 625 nodes. In this training, since the initial model is a model previously trained using data of three Tibetan dialects, the effect is better than that of a model initialized randomly from zero, and here, the learning rate is appropriately reduced at the time of training.
And step six, training a language model of the Tibetan language. The invention trains the N-gram Tibetan language model by using the collected Tibetan text data. The value of N in the model is 5, namely 5-gram. The operation steps comprise the following two aspects:
secondly, the existing Tibetan text is divided according to the content field and can be divided into spoken language and news information. And then, respectively training two corresponding 5-gram Tibetan language models by using the Tibetan texts of the two parts. The two language models are then interpolated at a 1:1 ratio and appropriately pruned by setting a threshold for the perplexity (ppl) to control the model size.
Secondly, in order to make the trained Tibetan language model better applicable to the scenes of spoken language, interpolation is performed again on the basis of the two language models trained by the spoken language text and the news information text, but the proportion is adjusted to be spoken language: the news information class is equal to 8:2, and a new Tibetan language model which is biased to the spoken language application scene is generated.
And step seven, combining the trained Tibetan Tibet dialect acoustic model, the Tibetan language model, the Tibet dialect pronunciation dictionary and the like to generate a voice recognition decoder, and decoding a frame of acoustic feature sequence extracted from the audio file to be recognized.
And step eight, using the language model which is interpolated according to the ratio of 8:2 of the spoken language type and the news information type, and performing language model re-scoring on the intermediate result obtained by decoding to obtain the finally identified Tibetan text sequence.
The main contents of the technical scheme for constructing the spoken language voice recognition system of the Tibetan toilet Tibetan dialect are as above.
The invention relates to a technical scheme of a Tibetan Tibet dialect spoken language identification model based on a deep time delay neural network, wherein the construction work is mainly based on a Linux system experimental environment and a kaldi voice identification toolbox, and a GPU is required to be used in some steps to accelerate the operation speed. The specific implementation method comprises the following steps:
step one, a Tibetan language audio data set is prepared and then expanded by using an augmentation technology. The operation process is mainly divided into the following aspects:
first, acoustic training data is prepared. The Tibetan audio data set used in the model training experiment is divided into two parts: one part is a small audio data set of Tibetan language defense Tibetan dialect, and the time is about 36 hours; the other part is an audio data set containing three Tibetan dialects, namely a defense Tibetan dialect, a health dialect and an An dialect, and the time duration is about 200 hours. Since the 36-hour data set is small in size and different in content from the 200-hour data set content domain, in order to prevent the neural network training from excessively biasing to the 200-hour data set content domain, the 36-hour data set is expanded by 1 time while the two original data sets are merged. Care should be taken in preparing experimental data to remove data which is not standard, has poor recording quality and cannot correspond to audio and transcribed text one by one.
Secondly, the whole acoustic data set is divided into a training data set and a testing data set, and the training set and the testing set cannot be overlapped. The test set selects partial sentences of the tibetan dialect, and all voices of two tibetan dialect users (male and female) are randomly selected as the test set, wherein about 2000 short sentences are selected. Note that since the later trained GMM acoustic models only use the training set before data augmentation, a list of training data sets before data augmentation should be kept here.
Third, the audio data set of the training set is subjected to velocity perturbation. On the basis of 1.0 time of speech speed of the original audio, the tool provided by the kaldi voice recognition tool box is used for carrying out 0.9 and 1.1 time speed disturbance, and the data set is expanded to 3 times of the original data set. In addition, volume disturbances are added to the audio file.
Fourthly, the noise and reverberation processing is carried out on the audio data set of the training set. And on the audio data set after the speed disturbance, adding noise and reverberation to each audio wav file, and combining the audio to which the noise and the reverberation are added with the original clean audio to form a mixed audio data set. As used herein, noise is from a fixed noisy data audio library, and noise may be either foreground noise or background noise; the reverberation comes from a database generated by simulation in different room sizes. The additive noise and reverberation to a clean Tibetan audio file is all randomly assigned.
And step two, extracting the acoustic features of the audio data set. Since the tibetan dialect is a tonal language, we use the MFCC feature plus the pitch feature here. When a GMM acoustic model is trained, 13-dimensional low-precision MFCC features and 3-dimensional pitch features are extracted; when training the TDNN acoustic model, 40-dimensional high-precision MFCC features plus 3-dimensional pitch features are used. In addition, we also use the i-vector feature, with dimensions of 100 dimensions.
And step three, training an acoustic model. The operation process comprises the following two aspects:
first, the HMM-GMM acoustic model is trained. The technical scheme of the Tibetan language recognition system provided by the invention is characterized in that the extracted 13-dimensional MFCC features and pitch features are used for training the GMM acoustic models of the single-phoneme model and the three-phoneme model in 16 dimensions. The triphone model comprises the following components: the tri1 model (adding first and second order difference information of features), the tri2 model (using LDA, MLLT), the tri3 model adding speaker adaptation (using LDA, MLLT + SAT), the tri4 model (building a larger SAT model), and the tri5 model (training a larger GMM model using the quick training script in kaldi). The data used here is a clean data set without speed perturbations and added noise and reverberation.
Secondly, the training data is aligned by using a trained triphone GMM model (tri5 model), and phoneme-level alignment information is provided for the later TDNN model training.
And step four, constructing a Tibetan general acoustic model by using a deep time delay neural network structure. The operation steps comprise 2 aspects:
first, a network structure of a deep neural network acoustic model is designed. The acoustic model of the invention adopts a Time Delay Neural Network (TDNN) structure, the number of hidden layers is 16, each layer has 625 neurons, and a chain model in a kaldi voice recognition toolbox is adopted.
Secondly, a TDNN acoustic model is trained by using Tibetan acoustic training data sets (two data sets containing 3 Tibetan dialects including Tibetan dialects such as Tibetan dialects, Anmo dialects and Kangfang dialects) after the augmentation of the front data and combining phoneme-level alignment information obtained by the training acoustic data sets on the preceding triphone GMM model tri5, so that a network structure learns the mapping relation from frame-level acoustic features to corresponding phoneme probabilities.
And step five, performing second training on the trained acoustic model by using the acoustic data set of the Tibetan dialect part of the Tibetan language. The training still keeps the original acoustic model structure, namely a 16-layer TDNN structure, wherein each layer is 625 nodes. In this training, the initial model is a model previously trained using data of three Tibetan dialects, and therefore the effect is better than that of a model initialized randomly from zero. Here, the learning rate needs to be reduced appropriately at the time of training.
And step six, training a language model of the Tibetan language. The collected Tibetan text data is used for training an N-gram Tibetan language model (N in the model is 5, namely 5-gram). The operation steps comprise the following two aspects:
first, the existing Tibetan text is divided into spoken language and news information according to the content field. Then, the two Tibetan texts are used for training two corresponding 5-gram Tibetan language models respectively by using an SRILM tool (a language model training tool). Then, using an ngram tool in the SRILM to interpolate the two language models according to the proportion of 1:1, and properly pruning the sizes of the control models (controlling the size of the arpa file of the models to be about 1G);
secondly, in order to make the trained Tibetan language model be better applied to the scenes of spoken languages, interpolation is performed again on the basis of the two language models trained by the spoken language texts and the news information texts, but the proportion is adjusted to be "spoken language": the "news information class" is equal to 8:2, and a new Tibetan language model is generated.
And step seven, combining the trained Tibetan Tibet dialect acoustic model, the Tibetan language model (1:1 interpolation), the Tibet dialect pronunciation dictionary and the like to generate a voice recognition decoder, and decoding a frame of acoustic feature sequence extracted from the audio file to be recognized.
And step eight, using the language model which carries out interpolation according to the proportion that the spoken language class and the news information class are 8:2, carrying out language model re-scoring on the intermediate result obtained by decoding, and selecting a path again according to the scoring to obtain the finally identified Tibetan text sequence.
The above is a specific implementation mode of the model construction of the spoken language voice recognition system of the Tibetan toilet Tibetan dialect. The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (4)

1. A Tibetan dialect spoken language identification method based on a deep time delay neural network is characterized in that an audio data set mixed by three Tibetan dialects is adopted, an original audio data set is expanded through a speed disturbance, noise addition and reverberation method, a deep time delay neural network is trained by utilizing the expanded data set based on a chain model of an open source voice identification toolbox kaldi and is used as a Tibetan acoustic model, and a part of the Tibetan dialect in the audio data is utilized to carry out secondary training on the acoustic model so as to obtain a deep time delay neural network acoustic model for the Tibetan dialect; based on the existing limited text resources, respectively training N-gram language models of two different content fields by using Tibetan texts of spoken languages and Tibetan texts of news information, interpolating according to the proportion of 1:1, and controlling the size of the model obtained by interpolation by using a pruning technology to obtain a Tibetan N-gram language model; combining the deep time delay neural network acoustic model, the N-gram language model and the pronunciation dictionary to form a decoder, then re-interpolating two language models corresponding to the spoken language content field and the news information content field according to the proportion of 8:2 to obtain a new language model which is biased to the spoken language field, and re-scoring the decoded intermediate result to enable the whole language model to be more biased to the grammatical habit of the spoken language, combining the trained time delay neural network acoustic model, the N-gram language model and the re-scoring language model into a system to obtain the final spoken language field speech recognition model for the Tibetan language Tibetan dialect.
2. The method for spoken language identification of Tibetan Tibet dialect based on deep time-delay neural network as claimed in claim 1, which comprises the following steps:
step one, preparing a Tibetan language audio data set, and expanding the Tibetan language audio data set by using an augmentation technology;
step two, extracting acoustic features: the method comprises the steps of using a Mel frequency cepstrum coefficient MFCC characteristic and a pitch period information pitch characteristic; the method specifically comprises the steps of extracting 13-dimensional low-precision MFCC features and 3-dimensional pitch features when a Gaussian mixture model GMM acoustic model is trained; when a time delay neural network TDNN acoustic model is trained, a 40-dimensional high-precision MFCC feature and a 3-dimensional pitch feature are used, in addition, an i-Vector feature (Identity-Vector) is also used, and the dimension is 100 dimensions;
step three, training an acoustic model, wherein the operation process comprises the following two aspects:
firstly, training a hidden Markov-Gaussian mixture model (HMM-GMM) acoustic model, using extracted 13-dimensional MFCC features plus pitch features for 16 dimensions, and using kaldi to train GMM acoustic models of a monophonic model (monophone) and a triphone model (triphone);
secondly, aligning training data by using a trained triphone GMM model, and providing phoneme-level alignment information for later TDNN model training;
step four, constructing a Tibetan general acoustic model by using a deep time delay neural network structure;
fifthly, performing second training on the basis of the trained acoustic model by using the acoustic data set of the Tibetan language Tibetan dialect part to obtain a final acoustic model of the Tibetan language Tibetan dialect;
step six, training a language model of Tibetan, specifically training an N-gram Tibetan language model by using collected Tibetan text data, wherein N is 5, namely 5-gram, and the operation steps comprise the following two aspects:
firstly, dividing the existing Tibetan text according to the content field, dividing the Tibetan text into a spoken language type and a news information type, respectively training two corresponding 5-gram Tibetan language models by using the two Tibetan text parts, then interpolating the two language models according to a ratio of 1:1, and carrying out appropriate pruning by setting a threshold value of a perplexity (ppl) to control the size of the models, thereby finally obtaining the 1:1 Tibetan language model;
secondly, in order to make the trained Tibetan language model be better applied to the scenes of spoken language, interpolation is performed again on the basis of the two 5-gram Tibetan language models trained by the spoken language text and the news information text, but the proportion is adjusted to be spoken language: the news information class is equal to 8:2, and a new 8:2 Tibetan language model which is biased to the spoken language application scene is generated;
combining the Tibetan dialect pronunciation dictionary with the trained Tibetan dialect acoustic model, the spoken language model and the Tibetan language 5-gram language model interpolated according to the ratio of 1:1 to generate a voice recognition decoder, and decoding a frame of acoustic feature sequence extracted from the audio file to be recognized;
and step eight, re-scoring the intermediate result by using the new Tibetan 5-gram language model obtained by interpolation of the two language models of the spoken language and the news information according to the ratio of 8:2 to obtain the finally identified Tibetan text sequence.
3. The method for spoken Tibetan language, Tibet dialect based on deep time-delay neural network as claimed in claim 2, wherein the operation process of the step one is divided into the following aspects:
first, acoustic training data is prepared. The Tibetan audio data set used is divided into two parts: one part is a small audio data set of Tibetan language defense Tibetan dialect, and the time is about 36 hours; the other part is an audio data set containing three Tibetan dialects, namely a defense Tibetan dialect, a well-known dialect and an An-known dialect, the time duration is about 200 hours, and when two original data sets are combined, the 36-hour data set is kept original and expanded by 1 time;
secondly, dividing the obtained whole acoustic data set into a training data set and a testing data set;
thirdly, performing speed disturbance on a training set of an audio data set, performing speed disturbance at 0.9 and 1.1 times by using a tool provided by a kaldi voice recognition tool kit on the basis of 1.0 time of speech speed of the original audio, expanding the data set to 3 times of the original data set, and adding volume disturbance to the audio;
fourthly, conducting noise adding and reverberation processing on the training set of the audio data set, conducting noise adding and reverberation processing on each audio file on the audio data set after the speed disturbance, and combining the audio files after the noise and the reverberation and the original clean audio to form a mixed audio data set.
4. The method for spoken language identification of Tibetan Tibet dialect based on deep time-delay neural network as claimed in claim 2, wherein the fourth step specifically includes 2 steps:
firstly, designing a network structure of a deep neural network acoustic model: the Time Delay Neural Network (TDNN) structure is characterized in that the number of hidden layers is 16, each layer is provided with 625 neurons, and a chain model in a kaldi voice recognition tool box is adopted;
secondly, training a TDNN acoustic model by utilizing a Tibetan acoustic training data set after the previous data augmentation and combining phoneme-level alignment information obtained by the training acoustic data set on the previous triphone GMM model tri5, so that a network structure learns the mapping relation from frame-level acoustic features to corresponding phoneme probabilities.
CN202110183564.3A 2021-02-08 2021-02-08 Tibetan Tibet dialect spoken language identification method based on deep time delay neural network Active CN112951206B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110183564.3A CN112951206B (en) 2021-02-08 2021-02-08 Tibetan Tibet dialect spoken language identification method based on deep time delay neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110183564.3A CN112951206B (en) 2021-02-08 2021-02-08 Tibetan Tibet dialect spoken language identification method based on deep time delay neural network

Publications (2)

Publication Number Publication Date
CN112951206A true CN112951206A (en) 2021-06-11
CN112951206B CN112951206B (en) 2023-03-17

Family

ID=76245414

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110183564.3A Active CN112951206B (en) 2021-02-08 2021-02-08 Tibetan Tibet dialect spoken language identification method based on deep time delay neural network

Country Status (1)

Country Link
CN (1) CN112951206B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113782001A (en) * 2021-11-12 2021-12-10 深圳市北科瑞声科技股份有限公司 Specific field voice recognition method and device, electronic equipment and storage medium
CN113823262A (en) * 2021-11-16 2021-12-21 腾讯科技(深圳)有限公司 Voice recognition method and device, electronic equipment and storage medium
CN114360523A (en) * 2022-03-21 2022-04-15 深圳亿智时代科技有限公司 Keyword dataset acquisition and model training methods, devices, equipment and medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290766A (en) * 2007-04-20 2008-10-22 西北民族大学 Syllable splitting method of Tibetan language of Anduo
US8583432B1 (en) * 2012-07-18 2013-11-12 International Business Machines Corporation Dialect-specific acoustic language modeling and speech recognition
CN104036774A (en) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 Method and system for recognizing Tibetan dialects
CN105426355A (en) * 2015-10-28 2016-03-23 北京理工大学 Syllabic size based method and apparatus for identifying Tibetan syntax chunk
CN108109615A (en) * 2017-12-21 2018-06-01 内蒙古工业大学 A kind of construction and application method of the Mongol acoustic model based on DNN
CN110600032A (en) * 2018-05-23 2019-12-20 北京语智科技有限公司 Voice recognition method and device
US20200020320A1 (en) * 2019-06-18 2020-01-16 Lg Electronics Inc. Dialect phoneme adaptive training system and method
CN111696522A (en) * 2020-05-12 2020-09-22 天津大学 Tibetan language voice recognition method based on HMM and DNN
CN112233653A (en) * 2020-12-10 2021-01-15 北京远鉴信息技术有限公司 Method, device and equipment for training multi-dialect accent mandarin speech recognition model
CN112242134A (en) * 2019-07-01 2021-01-19 北京邮电大学 Speech synthesis method and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290766A (en) * 2007-04-20 2008-10-22 西北民族大学 Syllable splitting method of Tibetan language of Anduo
US8583432B1 (en) * 2012-07-18 2013-11-12 International Business Machines Corporation Dialect-specific acoustic language modeling and speech recognition
CN104036774A (en) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 Method and system for recognizing Tibetan dialects
CN105426355A (en) * 2015-10-28 2016-03-23 北京理工大学 Syllabic size based method and apparatus for identifying Tibetan syntax chunk
CN108109615A (en) * 2017-12-21 2018-06-01 内蒙古工业大学 A kind of construction and application method of the Mongol acoustic model based on DNN
CN110600032A (en) * 2018-05-23 2019-12-20 北京语智科技有限公司 Voice recognition method and device
US20200020320A1 (en) * 2019-06-18 2020-01-16 Lg Electronics Inc. Dialect phoneme adaptive training system and method
CN112242134A (en) * 2019-07-01 2021-01-19 北京邮电大学 Speech synthesis method and device
CN111696522A (en) * 2020-05-12 2020-09-22 天津大学 Tibetan language voice recognition method based on HMM and DNN
CN112233653A (en) * 2020-12-10 2021-01-15 北京远鉴信息技术有限公司 Method, device and equipment for training multi-dialect accent mandarin speech recognition model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JINGHAO YAN ET AL.: "《Tibetan acoustic model research based on TDNN》", 《2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC)》 *
孙婧雯: "《基于深度学习的藏语安多方言语音识别的研究》", 《中国优秀博硕士学位论文全文数据库(硕士) 哲学与人文科学辑》 *
袁胜龙等: "基于深层神经网络的藏语识别", 《模式识别与人工智能》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113782001A (en) * 2021-11-12 2021-12-10 深圳市北科瑞声科技股份有限公司 Specific field voice recognition method and device, electronic equipment and storage medium
CN113782001B (en) * 2021-11-12 2022-03-08 深圳市北科瑞声科技股份有限公司 Specific field voice recognition method and device, electronic equipment and storage medium
CN113823262A (en) * 2021-11-16 2021-12-21 腾讯科技(深圳)有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113823262B (en) * 2021-11-16 2022-02-11 腾讯科技(深圳)有限公司 Voice recognition method and device, electronic equipment and storage medium
CN114360523A (en) * 2022-03-21 2022-04-15 深圳亿智时代科技有限公司 Keyword dataset acquisition and model training methods, devices, equipment and medium

Also Published As

Publication number Publication date
CN112951206B (en) 2023-03-17

Similar Documents

Publication Publication Date Title
CN112951206B (en) Tibetan Tibet dialect spoken language identification method based on deep time delay neural network
CN112767958B (en) Zero-order learning-based cross-language tone conversion system and method
EP4118641A1 (en) Speech recognition using unspoken text and speech synthesis
Womack et al. N-channel hidden Markov models for combined stressed speech classification and recognition
KR20070098094A (en) An acoustic model adaptation method based on pronunciation variability analysis for foreign speech recognition and apparatus thereof
US20070294082A1 (en) Voice Recognition Method and System Adapted to the Characteristics of Non-Native Speakers
Abushariah et al. Modern standard Arabic speech corpus for implementing and evaluating automatic continuous speech recognition systems
Imseng et al. Exploiting un-transcribed foreign data for speech recognition in well-resourced languages
Thomas et al. Towards reducing the need for speech training data to build spoken language understanding systems
Nagano et al. Data augmentation based on vowel stretch for improving children's speech recognition
Wang et al. A study on acoustic modeling for child speech based on multi-task learning
Chen et al. Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features
Watts et al. Synthesis of child speech with HMM adaptation and voice conversion
CN111933121B (en) Acoustic model training method and device
Lonergan et al. Cross-dialect lexicon optimisation for an endangered language ASR system: the case of Irish
Guo et al. A DNN-based Mandarin-Tibetan cross-lingual speech synthesis
Rebai et al. LinTO Platform: A Smart Open Voice Assistant for Business Environments
Shahin et al. UNSW System Description for the Shared Task on Automatic Speech Recognition for Non-Native Children's Speech.
Aura et al. Analysis of the error pattern of hmm based bangla asr
Cincarek et al. Development of preschool children subsystem for ASR and Q&A in a real-environment speech-oriented guidance task
Govender et al. Objective measures to improve the selection of training speakers in HMM-based child speech synthesis
US20240119942A1 (en) Self-learning end-to-end automatic speech recognition
Nguyen et al. A Linguistic-based Transfer Learning Approach for Low-resource Bahnar Text-to-Speech
Mohammad et al. Phonetically rich and balanced text and speech corpora for Arabic language
Govender et al. Multi-MelGAN voice conversion for the creation of under-resourced child speech synthesis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant