CN109147774B - Improved time-delay neural network acoustic model - Google Patents

Improved time-delay neural network acoustic model Download PDF

Info

Publication number
CN109147774B
CN109147774B CN201811090966.3A CN201811090966A CN109147774B CN 109147774 B CN109147774 B CN 109147774B CN 201811090966 A CN201811090966 A CN 201811090966A CN 109147774 B CN109147774 B CN 109147774B
Authority
CN
China
Prior art keywords
tdnn
acoustic model
neural network
attention module
improved
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811090966.3A
Other languages
Chinese (zh)
Other versions
CN109147774A (en
Inventor
陈凯斌
张伟彬
徐向民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201811090966.3A priority Critical patent/CN109147774B/en
Publication of CN109147774A publication Critical patent/CN109147774A/en
Application granted granted Critical
Publication of CN109147774B publication Critical patent/CN109147774B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Complex Calculations (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The invention belongs to the technical field of voice recognition, and relates to an improved time-delay neural network acoustic model, which comprises the following steps: building a basic TDNN network; adding an attention module between two adjacent hidden layers to obtain an improved TDNN network; and training the improved TDNN to obtain a final acoustic model. The attention module is composed of affine transformation and a weighting function, the output of a previous hidden layer is used as input, a feature weight value of the input is extracted, the extracted weight value is used as the original input feature to be weighted, and the weighted feature is obtained. According to the invention, under the consideration of the factors of the model modeling capability, the context information extraction capability, the size of the model and the like, the relative importance of the interlayer features is effectively and explicitly modeled by weighting the hidden layer features of the neural network in a multi-layer manner, so that the performance of the TDNN acoustic model is improved, and the overall performance of the speech recognition system is improved.

Description

Improved time-delay neural network acoustic model
Technical Field
The invention belongs to the technical field of voice recognition, and relates to a time-delay neural network acoustic model.
Background
Since the first speech recognition system in the world of the 50's last century, the core branch of speech recognition has undergone a gradual evolution from template matching to the creation of statistical models during the first decade of the 21's century. What has been most classically significant in the speech recognition field up to now is a method of combining Hidden Markov Model (HMM) and Gaussian Mixture Model (GMM), i.e. using Hidden Markov Model to dynamically Model the speech signal, describe the time-domain jump of the pronunciation state, and using Gaussian Mixture Model to perform feature distribution fitting on each pronunciation state, because this method makes good use of the short-time stationary characteristic of the speech signal, it becomes the core technology of acoustic modeling in speech recognition in the past decades.
Since 2009, Deep learning in the field of machine learning was introduced to speech recognition acoustic model training, compared with the HMM-GMM technology described above, in many real-world large-vocabulary speech recognition tasks, the Deep learning speech recognition acoustic model reduces the recognition error rate by 30% or more, because Deep Neural Networks (DNNs) have stronger nonlinear expression capability, and therefore, it is used to replace a mixed gaussian model to perform feature distribution fitting on pronunciation states, so that the performance of the acoustic model can be better.
A complete speech recognition system may consist of the following parts: front-end processing, acoustic models, language models and decoders, acoustic models have a significant impact on overall performance. Improving the recognition performance of the acoustic model considers the modeling capability, performance and extraction capability of the model for context information on one hand, and the computational complexity and size of the model on the other hand, because these affect the decoding speed and related hardware resource requirements of the system.
Based on the development of deep learning, a delay deep neural network model (TDNN) is applied to acoustic modeling and achieves a good effect, but the TDNN model is not explicitly modeled in terms of relative importance of interlayer features, so that the performance of the acoustic model based on the delay neural network needs to be improved in this respect.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an improved Time Delay Neural Networks (TDNN) acoustic model, under the condition of considering the model modeling capacity, the context information extraction capacity, the size of the model and other factors, the relative importance of the interlayer features is effectively and explicitly modeled by weighting the hidden layer features of the Neural network in a multi-layer mode, the performance of the acoustic model is improved, and the overall performance of the voice recognition system is improved.
An improved delayed neural network (TDNN) acoustic model, comprising:
a. building a basic TDNN network; the basic TDNN network comprises: the device comprises an input layer, a hidden layer and an output layer, wherein the hidden layer is provided with a plurality of layers and carries out same-layer parameter sharing and frame skipping sampling on the hidden layer;
b. adding an attention module between two adjacent hidden layers to obtain an improved TDNN network;
c. and training the improved TDNN to obtain a final acoustic model.
Preferably, the step b attention module mathematical formula is expressed as follows:
Figure BDA0001804318590000021
a(x)=nonL(w·x)
wherein x is the input of the attention module, i.e. the output of the previous hidden layer; y is the output of the attention module, i.e. the input of the next hidden layer;
Figure BDA0001804318590000022
representing an element-by-element multiplication operation; w is the parameter matrix used by the module for affine transformation and nonL is the weighting function.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the parameters of affine transformation in the attention module adopt the same layer sharing mechanism as TDNN, the affine transformation in the attention module is synchronous with the frame skipping of the original TDNN, and the parameter increment is not large from the whole model, and the influence on the size of the whole model is not large.
2. The weighting function in the attention model can effectively and explicitly model the relative importance of the interlayer features, and the modeling capability of the model is enhanced.
3. According to the invention, the size and the computational complexity of the model and the effective modeling of the TDNN on the context information are comprehensively considered, and finally, compared with the original TDNN, the acoustic model achieves certain performance improvement on each data set.
Drawings
FIG. 1 is a schematic structural diagram of an attention module of the present invention;
FIG. 2 is a diagram of an improved TDNN architecture in accordance with an embodiment of the present invention;
FIG. 3 is a graph of feature weighting effects in one embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
An improved time-delay neural network (TDNN) acoustic model is characterized in that a specific module (also called attention layer, attention layer or attention module) is added among a plurality of hidden layers of the TDNN, original input features are weighted by the specific module, and the weighted features are sent to the next hidden layer.
The attention module is composed of affine transformation and a weighting function, the output of a previous hidden layer is used as input, a characteristic weight value of the input is extracted, the extracted weight value is used as the original input characteristic to be weighted (element-by-element multiplication operation), and the weighted characteristic is obtained. The attention module can be effectively combined with the TDNN, so that the performance of the TDNN acoustic model is effectively improved on the premise of not introducing excessive parameters and extra calculated amount, and the accuracy of voice recognition is further improved.
In this embodiment, the improved time-delay neural network (TDNN) acoustic model, as shown in fig. 1 to 3, may specifically include the following steps:
s1, extracting Mel cepstrum coefficients (MFCC) from the related voice data set as acoustic features, and gradually training a better HMM-GMM model to provide relatively accurate frame-level labeling for the training of the time-delay neural network.
The training sequence is as follows: a monophonic model, a triphone model with linear discriminant analysis and maximum likelihood linear regression, and a triphone model with speaker adaptation. This is to perform forced alignment on the training data to provide relatively accurate frame-level labels (i.e. the pronunciation state of triphone corresponding to each frame of speech, or the state of hidden markov model) for the following training of the delayed neural network.
And S2, building a basic TDNN network.
The basic TDNN network comprises: the device comprises an input layer, a hidden layer and an output layer, wherein the hidden layer is provided with a plurality of layers, and same-layer parameter sharing and frame skipping sampling are carried out on the hidden layer.
And S3, adding an attention module between two adjacent hidden layers on the basis of the TDNN network of S2 to obtain the improved TDNN network.
In the hierarchical structure of the deep neural network, each layer can be regarded as a feature mapping of input features, each node unit in the layer represents one feature, and each feature in each layer has different importance, some are important key features, and some are irrelevant. The attention module added in the present invention is shown in fig. 1, and is to explicitly calculate the importance of each feature before each layer of features is transmitted to the next layer, and to input the weighted features to the next layer according to the importance. The attention module is mathematically expressed as follows:
Figure BDA0001804318590000031
a(x)=nonL(w·x)
wherein x is the input of the attention module, i.e. the output of the previous hidden layer; y is the output of the attention module, i.e. the input of the next hidden layer;
Figure BDA0001804318590000032
representing an element-by-element multiplication operation; w is the parameter matrix used by the module for affine transformation and nonL is the weighting function.
The attention module has the following features:
1) a parameter matrix required by affine transformation in the attention module adopts the same layer sharing mechanism as that of a basic TDNN network;
2) affine transformation in the attention module adopts a frame skipping sampling mechanism same as that of a basic TDNN network;
3) the weighting function in the attention module can adopt various functions to obtain the weight value;
4) and obtaining the weighted features representing the relative importance among the features by adopting a form of element-by-element multiplication weighting.
The weighting function may use activation functions and related combinations or simple optimizations commonly used in deep learning, such as identity, sigmoid, tanh, relu, log-sigmoid, relu + renorm, softmax, log-softmax, and the like.
In this embodiment, log-softmax is used as the weighting function. Wherein softmax can amplify the difference between features in the process of extracting the weight, and the normalized characteristic makes it possible to fully consider the relation between the features, rather than individually looking at each feature; in addition, the use of a log function solves the numerical problem posed by softmax itself in this problem, effectively giving relative importance between features. Thus, in this embodiment, the log-softmax function is the preferred weighting function.
And S4, training the improved TDNN to obtain a final acoustic model.
The improved TDNN network architecture is shown in fig. 2, and in this embodiment, the improved TDNN network is trained, specifically: initializing the improved TDNN network, inputting the improved TDNN network as the speech features (MFCCs) containing several frames of context collected in step S1, outputting the speech features as the pronunciation state of triphones corresponding to the current input, and then training the improved TDNN network according to the gradient descent method to obtain the final TDNN acoustic model for speech recognition.
The effect of feature weighting introduced to the attention module is shown in fig. 3, where fig. 3 shows the difference between the TDNN network at a certain input and the weighting obtained at a certain layer, and only the first 45 dimensions of the layer are taken.
Further, the language model and the obtained final TDNN acoustic model are combined to form a decoding graph, and a decoding algorithm (Token paging) is used for decoding the voice data of the test set to perform performance test.
The technical method of the present invention, which can be easily applied to other network structures by those skilled in the art, should be considered as a variation of the present invention. The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (6)

1. An improved time-lapse neural network acoustic model that explicitly models the relative importance of inter-layer features by weighting neural network hidden layer features at multiple levels, comprising:
a. extracting Mel cepstrum coefficients from related voice data sets to serve as acoustic features, and gradually training a better HMM-GMM model;
b. building a basic TDNN network; the basic TDNN network comprises: the device comprises an input layer, a hidden layer and an output layer, wherein the hidden layer is provided with a plurality of layers and carries out same-layer parameter sharing and frame skipping sampling on the hidden layer;
c. adding an attention module between two adjacent hidden layers to obtain an improved TDNN network;
d. training the improved TDNN to obtain a final acoustic model;
the mathematical formula of the attention module in the step c is as follows:
Figure FDA0003022322460000011
a(x)=nonL(w·x)
wherein x is the input of the attention module, i.e. the output of the previous hidden layer; y is the output of the attention module, i.e. the input of the next hidden layer;
Figure FDA0003022322460000012
representing an element-by-element multiplication operation; w is the parameter matrix used by the module for affine transformation and nonL is the weighting function.
2. The time-lapse neural network acoustic model of claim 1, wherein a parameter matrix required for affine transformation in the attention module employs the same layer sharing mechanism as that of a basic TDNN network.
3. The time-lapse neural network acoustic model of claim 2, wherein affine transformation in the attention module employs the same frame-skipping sampling mechanism as that of the underlying TDNN network.
4. The acoustic model of time-lapse neural network of claim 2, wherein the weighting function in the attention module adopts a plurality of functions to obtain the weight value.
5. The acoustic model for a time-lapse neural network of claim 4, wherein the weighting function in the attention module is a log-softmax function.
6. The delayed neural network acoustic model of any one of claims 1-5, wherein the modified TDNN is trained using a gradient descent method.
CN201811090966.3A 2018-09-19 2018-09-19 Improved time-delay neural network acoustic model Active CN109147774B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811090966.3A CN109147774B (en) 2018-09-19 2018-09-19 Improved time-delay neural network acoustic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811090966.3A CN109147774B (en) 2018-09-19 2018-09-19 Improved time-delay neural network acoustic model

Publications (2)

Publication Number Publication Date
CN109147774A CN109147774A (en) 2019-01-04
CN109147774B true CN109147774B (en) 2021-07-20

Family

ID=64814874

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811090966.3A Active CN109147774B (en) 2018-09-19 2018-09-19 Improved time-delay neural network acoustic model

Country Status (1)

Country Link
CN (1) CN109147774B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109801635A (en) * 2019-01-31 2019-05-24 北京声智科技有限公司 A kind of vocal print feature extracting method and device based on attention mechanism
CN110689906A (en) * 2019-11-05 2020-01-14 江苏网进科技股份有限公司 Law enforcement detection method and system based on voice processing technology
CN114664292B (en) * 2020-12-22 2023-08-01 马上消费金融股份有限公司 Model training method, speech recognition method, device, equipment and readable storage medium
CN112735388B (en) * 2020-12-28 2021-11-09 马上消费金融股份有限公司 Network model training method, voice recognition processing method and related equipment
CN113270104B (en) * 2021-07-19 2021-10-15 深圳市思特克电子技术开发有限公司 Artificial intelligence processing method and system for voice
CN114360517B (en) * 2021-12-17 2023-04-18 天翼爱音乐文化科技有限公司 Audio processing method and device in complex environment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105741838A (en) * 2016-01-20 2016-07-06 百度在线网络技术(北京)有限公司 Voice wakeup method and voice wakeup device
CN108022587A (en) * 2017-12-15 2018-05-11 深圳市声扬科技有限公司 Audio recognition method, device, computer equipment and storage medium
CN108269568A (en) * 2017-01-03 2018-07-10 中国科学院声学研究所 A kind of acoustic training model method based on CTC
CN108269569A (en) * 2017-01-04 2018-07-10 三星电子株式会社 Audio recognition method and equipment
CN108492273A (en) * 2018-03-28 2018-09-04 深圳市唯特视科技有限公司 A kind of image generating method based on from attention model
CN108549658A (en) * 2018-03-12 2018-09-18 浙江大学 A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105741838A (en) * 2016-01-20 2016-07-06 百度在线网络技术(北京)有限公司 Voice wakeup method and voice wakeup device
CN108269568A (en) * 2017-01-03 2018-07-10 中国科学院声学研究所 A kind of acoustic training model method based on CTC
CN108269569A (en) * 2017-01-04 2018-07-10 三星电子株式会社 Audio recognition method and equipment
CN108022587A (en) * 2017-12-15 2018-05-11 深圳市声扬科技有限公司 Audio recognition method, device, computer equipment and storage medium
CN108549658A (en) * 2018-03-12 2018-09-18 浙江大学 A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree
CN108492273A (en) * 2018-03-28 2018-09-04 深圳市唯特视科技有限公司 A kind of image generating method based on from attention model

Also Published As

Publication number Publication date
CN109147774A (en) 2019-01-04

Similar Documents

Publication Publication Date Title
CN109147774B (en) Improved time-delay neural network acoustic model
CN109817246B (en) Emotion recognition model training method, emotion recognition device, emotion recognition equipment and storage medium
Tu et al. Speech enhancement based on teacher–student deep learning using improved speech presence probability for noise-robust speech recognition
Zhang et al. Deep belief networks based voice activity detection
CN107633842A (en) Audio recognition method, device, computer equipment and storage medium
CN111429889A (en) Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention
CN107331384A (en) Audio recognition method, device, computer equipment and storage medium
CN111276131A (en) Multi-class acoustic feature integration method and system based on deep neural network
CN110246488B (en) Voice conversion method and device of semi-optimized cycleGAN model
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN107093422B (en) Voice recognition method and voice recognition system
CN113506562B (en) End-to-end voice synthesis method and system based on fusion of acoustic features and text emotional features
CN110853630B (en) Lightweight speech recognition method facing edge calculation
Bhosale et al. End-to-End Spoken Language Understanding: Bootstrapping in Low Resource Scenarios.
CN111599339B (en) Speech splicing synthesis method, system, equipment and medium with high naturalness
Garg et al. Survey on acoustic modeling and feature extraction for speech recognition
Kannadaguli et al. A comparison of Bayesian and HMM based approaches in machine learning for emotion detection in native Kannada speaker
Bi et al. Deep feed-forward sequential memory networks for speech synthesis
CN115249479A (en) BRNN-based power grid dispatching complex speech recognition method, system and terminal
Sharma et al. Automatic speech recognition systems: challenges and recent implementation trends
Kadyan et al. Training augmentation with TANDEM acoustic modelling in Punjabi adult speech recognition system
Kannadaguli et al. Comparison of hidden markov model and artificial neural network based machine learning techniques using DDMFCC vectors for emotion recognition in Kannada
CN112951270B (en) Voice fluency detection method and device and electronic equipment
CN114579724A (en) Seamless connection method and system for virtual human under various scenes
Yuan et al. Vector quantization codebook design method for speech recognition based on genetic algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant