CN109637526A - The adaptive approach of DNN acoustic model based on personal identification feature - Google Patents
The adaptive approach of DNN acoustic model based on personal identification feature Download PDFInfo
- Publication number
- CN109637526A CN109637526A CN201910016412.7A CN201910016412A CN109637526A CN 109637526 A CN109637526 A CN 109637526A CN 201910016412 A CN201910016412 A CN 201910016412A CN 109637526 A CN109637526 A CN 109637526A
- Authority
- CN
- China
- Prior art keywords
- feature
- dnn
- model
- training
- acoustic model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/144—Training of HMMs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Abstract
The invention discloses a kind of adaptive approach of DNN acoustic model based on personal identification feature.It solves and easily occurs over-fitting in adaptive training, personal identification characterization ability is poor, the low problem of robustness.Specific implementation has: extracting personal identification feature, uses MFCC feature as the DNN mode input of nonspecific speaker;Build GMM-HMM speech recognition system;Build the DNN-HMM baseline system of the DNN acoustic model with multiple hidden layers;Personal identification feature adaptive training is successively carried out to DNN acoustic model, obtains a DNN acoustic model to speaker dependent with adaptive ability.It is decomposed in personal identification feature extraction using weight matrix of the VAD technology to DNN model the last one hidden layer and replaces primitive character.The present invention takes full advantage of personal data of speaking on a small quantity and adjusts raising speaker dependent's recognition accuracy to model parameter.Complexity is low, and recognition performance is obviously improved.For intelligence system relevant to speech recognition or communication, medical treatment, vehicle-mounted etc..
Description
Technical field
The invention belongs to fields of communication technology, relate generally to the speaker characteristic extractive technique of personal identification i-vector,
The adaptive approach of specifically a kind of DNN acoustic model based on personal identification (i-vector) feature, speaks for nonspecific
People's speech recognition.
Background technique
In recent years, deep neural network DNN achieves very big success in speech recognition, models in Speech acoustics
In, it is mixed relative to traditional stealthy Markov-Gauss based on stealthy Markov-deep neural network HMM-DNN system
Molding type HMM-GMM system has better acoustics distinction, and the earth improves speech recognition performance, and DNN becomes mainstream
Acoustic model.
However the problem of still having in above two system is speaker's voice and the target speaker of training data
Voice mismatches, i.e. hypothesis training data and test data obeys same distribution.But this hypothesis is not deposit in real life
, main cause is that training data is limited, and cannot cover all application scenarios, causes training and the condition tested not
Matching causes system identification accuracy rate to decline.
Solve the problems, such as that model and test speaker are unmatched usually using speaker adaptation technology in system.Speaker
Adaptively it is divided into model domain adaptively and property field is adaptive, the model domain speaker adaptation based on DNN acoustic model is exactly
Model parameter is directly adjusted using the self-adapting data of target speaker.But since DNN model hidden layer is more, model parameter
It is huge, overfitting problem is easy to appear relative to a small amount of self-adapting data.
Property field based on DNN acoustic model is adaptively that line is added in trained nonspecific speaker SI model
Property transform layer the linear transformation layer is only adaptively adjusted to different speakers in the adaptive stage.This method is in certain journey
The influence of over-fitting is reduced on degree, but only a certain layer is carried out adaptively, and recognition accuracy does not improve.
The largely research in terms of deep neural network is adaptive has been done by many research institutions.In these methods
In, the speaker adaptation method based on personal identification i-vector feature receives significant attention, and basic thought is to will acquire
Personal identification i-vector feature and being originally inputted after feature does simple concatenation DNN acoustic model is trained.This side
Method is simple and easy to be compatible with other adaptive approach, equally applicable under noise conditions.This method be using low-dimensional with speak
The unrelated MFCC of people is as input feature vector, and this feature has preferable characterization ability and certain robustness, but it belongs to bottom
Characteristic present ability is limited, and robustness is not good enough in the presence of a harsh environment, is easily protected from environmental, and leads to the personal identification i- extracted
The adaptive ability of vector feature is impacted.
SVD technique is decomposed to a certain layer in the network of the extraction of personal identification i-vector feature using singular value matrix
It is handled.The problem of the method overcome the declines of frame classification accuracy rate, recognition performance has a certain upgrade, but identity is not
It is good, because cannot be made full use of a small amount of only in input layer by improved i-vector and original input feature vector simple concatenation
Speaker information influences system identification performance.
Some research institution's selection stronger bottleneck features of characterization ability replace MFCC feature to obtain a person
Part i-vector, which is better than bottom MFCC feature in terms of characterization ability and robustness, but is extracting bottleneck
It needs to introduce bottleneck layers in DNN structure when bottleneck feature, this method reduces the frame classification of DNN is accurate
Rate.
In conclusion current Research on adaptive method has taken biggish success, but it is few due to not making full use of
The self-adapting data of amount, and the complicated network structure, computation complexity is high, and stability is also not good enough.
The parameter iteration updating decision of the acoustic model of adaptive approach proposed by the present invention, computation complexity is low, identification system
The adaptive ability of system is strong.
Summary of the invention
In view of the deficiencies of the prior art, the present invention proposes that a kind of complexity is low, and recognition performance is preferably based on personal identification
The adaptive approach of the DNN acoustic model of feature.
The present invention is a kind of adaptive approach of DNN acoustic model based on personal identification (i-vector) feature, special
Sign is, comprises the following steps that
1) the personal identification i-vector feature of speaker dependent is extracted;Use the MFCC feature training of nonspecific speaker
One DNN model;The weight of the last one hidden layer of the DNN model is divided using singular value matrix decomposition technique (SVD)
Solution;It replaces original MFCC feature to DNN model retraining using the feature after decomposition, obtains one for extracting low-dimensional feature
DNN model;It is low to this using universal background model (UBM) after low-dimensional feature with the nonspecific speaker of the DNN model extraction
Dimensional feature is trained and is aligned, and obtains personal identification (i-vector) feature of nonspecific speaker, this feature with one to
Amount indicates;When the personal identification feature of speaker dependent to be extracted, substituted in nonspecific speaker's participation with speaker dependent
Operation is stated, realizes the personal identification i-vector feature extraction to speaker dependent;
2) GMM-HMM speech recognition system is built;To traditional acoustic model --- gauss hybrid models GMM is built
Mould, specific implementation step include:
13 dimension low-dimensional features 2a) are extracted using mel-frequency cepstrum coefficient (MFCC) method to training data in corpus,
And first-order difference and second differnce are asked to every one-dimensional characteristic, obtain the MFCC feature of 39 dimensions;
2b) to the MFCC feature of 39 dimensions, is pre-processed using cepstral mean normalized square mean (CMVN), obtain its variance normalizing
Change feature;
2c) left and right extension is carried out to variance normalization characteristic as unit of frame, the feature of superelevation dimension space is obtained, by line
Property discriminant analysis (LDA) convert the feature of superelevation dimension space dropped into lower-dimensional subspace, obtain low-dimensional feature, and carry out it is maximum seemingly
Right linear transformation (MLLT), obtains the decorrelation feature based on maximum-likelihood criterion;
Feature space maximum likelihood 2d) is carried out to decorrelation feature and linearly returns (fMLLR) transformation, is obtained with code book mean value
The feature of vector representation is called fMLLR feature;
2e) with the probability distribution of the linear combination fitting voice data of k diagonal covariance gauss of distribution function, height is obtained
This mixed model GMM;Use fMLLR feature as the input feature vector of gauss hybrid models GMM, using maximum mutual information criterion
(MMI) each Gaussian component distribution weight in gauss hybrid models is trained, is obtained by LDA+MLLT+fMLLR
The HMM-GMM speech recognition system of reason.
3) the nonspecific speaker DNN-HMM baseline of speech recognition of a DNN acoustic model with multiple hidden layers is constructed
System;In trained GMM-HMM identifying system, training data is forced to be aligned, it is corresponding to obtain each frame voice
True tag, to there is the DNN acoustic training model of supervision;Every one-dimensional characteristic of fMLLR feature to extraction or so is made after expanding frame
For the input of DNN acoustic model, using in corpus training set data and cross validation collection data carry out initialization training, it is complete
The modeling of DNN acoustic model in pairs with distinction training;Obtain the non-spy of a DNN acoustic model with multiple hidden layers
Determine people's speech recognition DNN-HMM baseline system.
4) it is adaptive successively to carry out personal identification feature for DNN acoustic model;In nonspecific speaker's speech recognition DNN-HMM
It is adaptive to the progress of DNN acoustic model using the personal identification i-vector feature with speaker dependent's distinction in baseline system
It should train, specifically successively increase self-adapting data in each hidden layer of DNN acoustic model and be trained, self-adapting data is
Speaker dependent's personal identification i-vector feature of extraction, in the adaptive stage, using cross entropy criterion to adaptive weighting
It is trained with common weight, obtains a DNN acoustic model to speaker dependent with adaptive ability.
The present invention decomposes the extraction for carrying out characterization speaker's identity feature i-vector using singular value matrix, optimizes i-
Vector performance, so that the recognition performance of final system has further promotion.
Compared with prior art, the present invention having the advantage that
1. the i-vector feature that the singular value matrix decomposition technique SVD that the present invention uses is extracted, makes compared to traditional
It uses MFCC feature to extract i-vector as input, there is preferably characterization ability and robustness.Compared to utilization
Bottleneck feature extracts i-vector, avoid it is in DNN structure more introduce bottleneck layer, overcome DNN frame and divide
The problem of class accuracy rate declines, further improves system identification performance.
2. the i-vector feature of extraction is sequentially introduced into each hidden layer of DNN acoustic model, using intersection by the present invention
Entropy criterion, using the probability distribution of prediction with actual probability distribution cross entropy as objective function, every frame voice is distinguished
Property training, learn the weight for the adaptation layer that all speakers share.Adaptation layer right value update and common DNN right value update
Method is the same, using error backpropagation algorithm and gradient descent method.Computation complexity is low, and training process is simple, adaptively
Performance is more preferable.
3. 1 to 5 layer self-adapting layers are added in the present invention on DNN acoustic model, on a small quantity adaptive can be fully utilized
Data realize global self-adaptive step random search method, and target speaker i-vector information can be successively mapped to the feature of DNN model
In, it realizes the speaker's different information removed in feature, retains semantic information, reduce the difference between speaker, system identification
Performance boost is significant.
Detailed description of the invention
Fig. 1 is the DNN acoustic model adaptive training structure chart of the method for the present invention;
Fig. 2 is GMM-HMM identifying system frame;
Fig. 3 is DNN-HMM baseline system frame;
Fig. 4 is the test result figure of the different adaptive numbers of plies under TIMIT corpus;
Fig. 5 is the test result figure of the different adaptive numbers of plies under Switchboard corpus.
Specific embodiment
With reference to the accompanying drawings and examples, it elaborates to the present invention
Embodiment 1
In recent years, speaker adaptation technology is more and more paid attention to, and it is adaptive that adaptive approach is divided into model domain
Adaptive with property field, adaptive technique application is very mature in Hidden Markov-Gaussian Mixture HMM-GMM system, but
It is to be difficult to be applied directly to Hidden Markov-deep neural network HMM-DNN system.A large amount of passes have been done by many research institutions
In the research of the adaptive aspect of deep neural network.In these methods, the speaker adaptation method based on i-vector very by
Favor.But current Research on adaptive method makes full use of a small amount of self-adapting data, and the complicated network structure not yet,
Computation complexity is high, and stability is not good enough.The present invention is exactly to the i-vector adaptive approach based on deep neural network DNN
Research and discussion is carried out, proposes a kind of adaptive approach of DNN acoustic model based on personal identification (i-vector) feature, ginseng
See Fig. 1, comprises the following steps that
1) the personal identification i-vector feature of speaker dependent is extracted: using the MFCC feature training of nonspecific speaker
One DNN model;The weight of the last one hidden layer of the DNN model is divided using singular value matrix decomposition technique (SVD)
Solution;It replaces original MFCC feature to DNN model retraining using the feature after decomposition, obtains one for extracting low-dimensional feature
DNN model;It is low to this using universal background model (UBM) after low-dimensional feature with the nonspecific speaker of the DNN model extraction
Dimensional feature is trained and is aligned, and obtains personal identification (i-vector) feature of nonspecific speaker, this feature with one to
Amount indicates.When the personal identification feature of speaker dependent to be extracted, substituted in nonspecific speaker's participation with speaker dependent
Operation is stated, realizes the personal identification i-vector feature extraction to speaker dependent.
2) GMM-HMM speech recognition system is built: to traditional acoustic model --- gauss hybrid models GMM is built
Mould, specific implementation step include:
13 dimension low-dimensional features 2a) are extracted using mel-frequency cepstrum coefficient (MFCC) method to training data in corpus,
And first-order difference and second differnce are asked to every one-dimensional characteristic, spliced, finally obtains the MFCC feature of 39 dimensions.In this example
Using the data in open source corpus.
2b) to the MFCC feature of 39 dimensions, is pre-processed using cepstral mean normalized square mean (CMVN), obtain its variance normalizing
Change feature.
2c) left and right extension is carried out to variance normalization characteristic as unit of frame, the feature of superelevation dimension space is obtained, by line
Property discriminant analysis (LDA) convert the feature of superelevation dimension space dropped into lower-dimensional subspace, obtain low-dimensional feature, and carry out it is maximum seemingly
Right linear transformation (MLLT), obtains the decorrelation feature based on maximum-likelihood criterion.In this example to variance normalization characteristic with
Frame is that unit carries out left and right extension, because every frame voice is short-term stationarity signal, probably there is 15 milliseconds, it is contemplated that before and after frames voice
The relevance of information and the current frame voice, needs to carry out front and back extension.
Feature space maximum likelihood 2d) is carried out to decorrelation feature and linearly returns (fMLLR) transformation, is obtained with code book mean value
The feature of vector representation is called fMLLR feature.
2e) with the probability distribution of the linear combination fitting voice data of k diagonal covariance gauss of distribution function, height is obtained
This mixed model GMM.Use fMLLR feature as the input feature vector of gauss hybrid models GMM, gauss hybrid models are also referred to as acoustics
Model, the present invention carry out each Gaussian component distribution weight in gauss hybrid models using maximum mutual information criterion (MMI)
Training obtains the HMM-GMM speech recognition system handled by LDA+MLLT+fMLLR.
3) the nonspecific speaker DNN-HMM baseline of speech recognition of a DNN acoustic model with multiple hidden layers is constructed
System: in trained GMM-HMM identifying system, training data is forced to be aligned, it is corresponding to obtain each frame voice
True tag, to there is the DNN acoustic training model of supervision;Every one-dimensional characteristic of fMLLR feature to extraction or so is made after expanding frame
For the input of DNN acoustic model, using in corpus training set data and cross validation collection data carry out initialization training, it is complete
The modeling of DNN acoustic model in pairs with distinction training;Obtain the non-spy of a DNN acoustic model with multiple hidden layers
Determine people's speech recognition DNN-HMM baseline system.
4) it is adaptive successively to carry out personal identification feature for DNN acoustic model: in nonspecific speaker's speech recognition DNN-HMM
It is adaptive to the progress of DNN acoustic model using the personal identification i-vector feature with speaker dependent's distinction in baseline system
It should train, specifically successively increase self-adapting data in each hidden layer of DNN acoustic model and be trained, self-adapting data is
Speaker dependent's personal identification i-vector feature of extraction.In the adaptive stage, using cross entropy criterion to adaptive weighting
It is trained with common weight, obtains a DNN acoustic model to speaker dependent with adaptive ability.
The present invention is for easily there is over-fitting in prior art adaptive training, i-vector characterization ability is poor, robustness
Low problem puts forward the overall plan for solving the problems, such as this.
Thinking of the present invention is: first in trained Hidden Markov-gauss hybrid models HMM-GMM identifying system,
Training data in library is subjected to pressure alignment on this system, obtains the data mark for carrying out DNN acoustic training model.Simultaneously
Using pre-training and fine-tuning to the parameter initialization in DNN acoustic model, obtains one and include 5 hidden layers
Neural network identify baseline system to get to a Hidden Markov-deep neural network HMM-DNN.Then, using in language
The universal background model UBM and total transformation matrices T that the entire trained concentration training in material library obtains extract each speaker's sentence
I-vector vector.Traditional i-vector is extracted, and is often used MFCC as input feature vector.It is different from traditional extracting method,
The training and extraction of i-vector feature are carried out in the present invention using singularity value decomposition SVD.Finally, will be extracted based on SVD
I-vector feature with traditional method for extracting i-vector be added DNN acoustic model multilayer be trained, pass through experiment pair
Than obtaining the DNN acoustic model that stability is high, recognition performance is best.
The present invention can be used for intelligent domestic robot relevant to speech recognition, communication, vehicle-mounted, medical, home services, language
The fields such as sound customer service.
Embodiment 2
The adaptive approach of DNN acoustic model based on personal identification (i-vector) feature is the same as embodiment 1, present invention step
Extraction personal identification i-vector feature described in rapid 1, including have the following steps
1a) the low-dimensional feature MFCC for 39 dimensions extracted using the voice data of the test set in open source corpus, including its
Single order and second order feature, one DNN model extracted for nonspecific speaker characteristic of training;
1b) nonspecific speaker characteristic trained in step 1a) is extracted using singular value matrix decomposition technique SVD
The last one hidden layer weight matrix of DNN model is decomposed, and replaces original weight matrix with it.
DNN model training 1c) is carried out using back-propagation algorithm (BP) and gradient descent method, then with trained
DNN model carries out the low-dimensional feature extraction of nonspecific speaker.
It 1d) is extracted with voice data of the trained DNN model to all speakers in training set and test set special
Then sign is trained and is aligned realization with universal background model (UBM) and extracts i-vector feature to speaker dependent's sentence.
The present invention carries out singular value matrix decomposition in the DNN acoustic model for nonspecific speaker MFCC feature extraction
Processing, particularly as being decomposed to the last one hidden layer weight matrix of the DNN model, then extracts nonspecific speaker's
Low-dimensional feature solves trained overfitting problem, while improving the robustness of identifying system and the table of i-vector feature
Sign ability.
Embodiment 3
The adaptive approach of DNN acoustic model based on personal identification (i-vector) feature is with embodiment 1-2, in step
The extraction of speaker's identity vector i-vector, expression formula are characterized described in (1c) and (1d) are as follows:
M=m+Tx+e
Wherein, M indicates that the GMM mean value super vector of speaker dependent, m indicate the mean value super vector of UBM, and T expression one is total
Feature space, x indicate extract characterization personal identification i-vector feature, e expression residual noise item.
The example is easy to obtain universal background model UBM based on the training data in corpus, passes through expectation maximization
(EM) algorithm obtains total transformation matrices T, and the personal identification i-vcetor feature extracted by this kind of method has and speaks well
People's distinction represents the different information between speaker, and dimension is low, and parameter is few when adaptive training, adaptive speed
Fast advantage.
The present invention considers the ambient noise of speech recognition in personal identification extraction process, removes to speech recognition very
Adverse environmental factors reduce voice distortion degree, so that the identity that speech recognition system has still obtained under noisy environment
Can, improve the robustness of identifying system, stability and noise resisting ability.
Embodiment 4
The adaptive approach of DNN acoustic model based on personal identification (i-vector) feature is the same as embodiment 1-3, step 4)
Described in DNN acoustic model successively to carry out personal identification feature adaptive, specifically comprise the following steps
4a) wherein the training of personal identification i-vector feature and extraction and application singular value matrix decompose SVD technique, to
It decomposes, and replaces original in the weight matrix of the last one hidden layer of the low-dimensional feature DNN model for extracting nonspecific speaker
Weight matrix, extract the low-dimensional feature of nonspecific speaker.
I- 4b) is extracted to each sentence of training set and test set using universal background model (UBM) and total transformation matrices T
Vector feature vector, that extracts certain dimension can characterize identity information personal identification i-vector feature.
The update for 4c) passing through adaptive weighting V obtains the best DNN acoustic model of performance;By the i- based on SVD
Vector feature and original MFCC phonetic feature are separately added into 1 to 5 layers of DNN acoustic model, by minimizing target loss
Function is iterated update Optimized model to parameter adaptive weight V and common weight W, and by comparing obtaining, performance is best
DNN acoustic model.
4d) based on sigmoid activation primitive derivation formula it is simple and error back propagation BP algorithm carries out parameter more
Newly, objective function can indicate the gradient of parameter with error, so that the update of auto-adaptive parameter V and common weight parameter W
Method is the same, and when just having started to train, learning rate α is small with 1 positive number, determines the speed that parameter updates, and when initial training is learned
The setting of habit rate it is bigger, when system newly can not reduce learning rate in promotion or Slow lifting, avoid falling into locally optimal solution.
Computation complexity is low, and iteration updating decision, speed-adaptive is fast, and recognition performance is greatly improved, and system stability is high.
The present invention sequentially adds speaker dependent's identity vector i- in the hidden layer of the DNN acoustic model of nonspecific speaker
Vector auxiliary information carries out adaptive training, except speaker's different information in feature, retains semantic information, this kind of method instruction
Adaptive acoustic model is got, in the limited situation of self-adapting data, the recognition accuracy of target speaker is improved, also mentions
The high recognition performance of system.
The present invention solves the problems, such as that the characterization ability difference of existing low-dimensional feature MFCC and robustness be not good enough, and uses table
Sign ability and robustness are better than the bottleneck feature of MFCC to substitute bring frame classification accuracy rate decline problem.
Embodiment 5
The adaptive approach of DNN acoustic model based on personal identification (i-vector) feature is the same as embodiment 1-4, step
The update of adaptive weighting V described in (4c), expression formula are as follows:
Wherein, C indicates cross entropy loss function,Indicating i-th layer of un-activation amount, α indicates learning rate,When indicating t
I-th layer of error is carved, P indicates hidden layer number in DNN model.
It is proposed by the present invention with the adaptive acoustic model of speaker dependent, based on sigmoid activation primitive derivation
Formula is simple and error back propagation BP algorithm carries out parameter update, and objective function indicates the gradient of parameter with error,
So that auto-adaptive parameter V is as the update method of common weight parameter W, computation complexity is low, iteration updating decision, speed-adaptive
Fastly, recognition performance is greatly improved, and system stability is high.
Speaker adaptation plays a key role in speech recognition, it is using a small amount of personal data of speaking to feature
It carries out transformation or could be adjusted to improve the recognition accuracy of speaker dependent to model parameter.In Speech acoustics modeling,
DNN becomes the acoustic model of mainstream.But there are still over-fittings, cannot make full use of a small amount of self-adapting data, cause to identify
System is influenced vulnerable to environment, and stability is low, and recognition performance is bad.For existing problem, the present invention is proposed based on a person
The i-vector feature of the adaptive approach of the DNN acoustic model of part feature, the characterization speaker's different information that will acquire is layer-by-layer
DNN acoustic model hidden layer is added, model parameter is adjusted, optimizes network model.It is low to obtain a complexity, recognition performance
Best DNN acoustic model, so that the recognition performance of system is obviously improved.
The example with more specific data, embodiment that the present invention is further described 6 is given below
The adaptive approach of DNN acoustic model based on personal identification (i-vector) feature with embodiment 1-5,
Speaker adaptation method proposed by the present invention based on deep neural network DNN is provided referring to Fig. 1 to DNN sound
The network structure that model carries out adaptive learning is learned, the i-vector feature of characterization speaker's identity involved in realization process
Training and extraction introduce 1 to 5 layer self-adapting layers in DNN acoustic model and are trained.It is modeled on small data set TIMIT
Research, specifically comprises the following steps:
1) training and extraction of i-vector feature extractor are carried out using the training data in corpus TIMIT, specifically
Step includes:
462 speakers of training set all voice data training one 1a) are utilized to extract for nonspecific speaker characteristic
DNN model, for extracting the low-dimensional feature of nonspecific speaker.
1b) the last one hidden layer in the DNN network model that SVD method obtains step 1a) is decomposed using singular value matrix
Weight matrix is decomposed, and replaces original weight to DNN model retraining with it.
1c) applying step 1b) in train DNN model extraction speaker dependent's low-dimensional feature.
1d) every of all speakers is talked about by universal background model UBM and total transformation matrix T using the low-dimensional feature
Extract the i-vector feature of 100 dimensions.Wherein the training of UBM model is the process of a parameter Estimation, is said with a large amount of background
Talk about people under maximum-likelihood criterion using expectation maximization carry out estimation training obtain one it is unrelated with speaker, channel is unrelated
Gauss hybrid models.The dimension of its distribution function and the dimension of acoustic feature are consistent.
2) GMM-HMM speech recognition system is built, specific steps include:
MFCC feature extraction 2a) is carried out to training set data in TIMIT, here using the mel-frequency cepstrum coefficient of 13 dimensions
MFCC and its single order and second differnce, in total 39 dimension MFCC feature.
2b) to all voice data of speaker each in training set, using cepstral mean normalized square mean CMVN to 2a)
In feature pre-processed, training three factor HMM-GMM systems.
To 2c) pass through 2b) feature or so 4 frames of extension of processing, superelevation tieed up by linear discriminant analysis (LDA) transformation empty
Between feature turn to drop to lower dimensional space, obtain the low-dimensional features of 40 dimensions, the effect of LDA is to minimize the difference in the human world of speaking in class
With maximize the difference in the human world of speaking outside class, and carry out maximum likelihood line transformation MLLT, obtain more matched with acoustic model
Decorrelation feature.
2d) fMLLR technology is linearly finally returned to the decorrelation feature in step 2c) to using feature space maximum likelihood
It is normalized, obtains the feature indicated with code book mean value vector, be called fMLLR feature, this feature and Gaussian Mixture
Model GM M has good matching.
2e) gauss hybrid models GMM is obtained by the weighted sum of 2154 gaussian probability distribution functions, each Gaussian function
Number reduces gauss hybrid models parameter update complexity using diagonal covariance.TIMIT corpus is fitted with gauss hybrid models GMM
In library in training set voice data probability distribution.Use fMLLR feature as the input feature vector of gauss hybrid models GMM, Gauss
Mixed model also becomes acoustic model, using maximum mutual information criterion (MMI) to each Gaussian component in gauss hybrid models
Distribution weight is trained, and obtains the HMM-GMM speech recognition system handled by LDA+MLLT+fMLLR.
3) DNN-HMM baseline system is built, specific steps include:
3a) pass through trained HMM-GMM identifying system in 2), by training set data in TIMIT corpus in system
Pressure alignment is carried out, the corresponding true tag of each frame voice (i.e. context-sensitive HMM state tag), i.e. baseline are obtained
DNN training data mark, for there is the DNN acoustic training model of supervision.
The fMLLR feature or so of 40 dimensions generated in 2d) 3b) is expanded into 5 frames, DNN acoustic model input layer is 440
A, hidden layer is 5 layers, and each hidden layer includes 1024 nodes, and activation primitive is sigmoid function.Softmax layers of node
Number is context-sensitive HMM status number 2492.Baseline DNN is trained using pre-training and fine-tuning.
When study converts weight W and bigoted vector B, DNN is trained with error backpropagation algorithm (BP), with pre-estimation probability
As objective function, the mini-batch size of stochastic gradient descent method is set cross entropy between distribution and actual probability distribution
It is 256, the momentum factor is 0.9, and initial learning rate is 0.05, halve when development set frame classification accuracy reduces, until
Convergence or lower than setting threshold value.Final DNN acoustic model structure 440-1024-1024-1024-1024-1024-
2492, training set and cross validation collection account for the 95% and 5% of training data respectively.The baseline system error rate WER of fMLLR feature
It is 16.3%.
4) DNN acoustic model successively carries out that personal identification feature is adaptive, and specific steps include:
4a) 1) the middle 100 dimension i-vectors that obtain characterization speaker characteristic are separately added into 1 to the 5 of DNN acoustic model
Layer.
The input layer that 4b) joined the DNN acoustic model of speaker adaptation is 440, subsequent hidden node
It is 1124, activation primitive is sigmoid function.The node of output layer is 2492.Pre- is equally used to the system
Training and fine-tuning are trained.It is identical with the training method of common weight W in training adaptive weighting V,
It is trained with poor back-propagation algorithm (BP), using the cross entropy between pre-estimation probability distribution and actual probability distribution as mesh
Scalar functions, the mini-batch size of stochastic gradient descent method are set as 256, and the momentum factor is 0.9, and initial learning rate is
0.05, when development set frame classification accuracy reduce when halve, until restrain or lower than setting threshold value.Final DNN acoustics
Model structure 440-1124-1024/1124-1024/1124-1024/1124-1024/1124-2492, training set and intersection are tested
Card collection accounts for the 95% and 5% of training data respectively.
The present invention carries out weight by objective function combination error backpropagation algorithm and gradient descent method of cross entropy
Study.In training process, common weight W is identical with the training method of adaptive weighting V, calculates simply, and complexity is low, realizes fast
Fast speaker adaptation.It is not simply to be spelled the i-vector feature of extraction with DNN input feature vector in the adaptive stage
It connects, but DNN model is successively added in i-vector feature, each layer is adaptively adjusted, it is accurate to improve system identification
Rate.
5) using the test set in TIMIT corpus including 24 speaker dependent's data, to trained in step 4)
Good adaptive DNN acoustic model is tested, and evaluation index is the Word Error Rate WER in automatic speech recognition.Based on SVD skill
The i-vector extraction of art is tested respectively with original i-vector extracting method, obtains that test results are shown in figure 4,
Fig. 4 is the test result figure of the different adaptive numbers of plies under TIMIT corpus, and horizontal axis is the increased adaptation layer of DNN acoustic model
Number, vertical pivot is Word Error Rate, there is two curves, and uppermost curve is the experimental result that traditional adaptive approach obtains, under
Layer curve is the experimental result proposed by the present invention obtained based on speaker dependent's identity i-vector feature adaptive approach, right
According to two curves, the chart is bright, and when introducing two layers of adaptive transformation to DNN acoustic model, recognition performance of the invention is best,
Identification error rate has dropped 12.27%.
The example is tested on open source corpus TIMIT, the experimental results showed that using proposed by the present invention to DNN sound
It learns model and carries out adaptive training method, have biggish decline compared to the identification error rate of traditional adaptive approach system,
It can be seen that when introducing adaptive two layers before to DNN acoustic model, the DNN acoustics with adaptive ability that obtains at this time
Model is optimal, the recognition accuracy highest of system, and anti-interference ability is best.
Embodiment 7
Based on the speaker adaptation method of deep neural network DNN with embodiment 1-6, referring to Fig. 1, provide to DNN sound
The network structure that model carries out adaptive learning is learned, the i-vector feature of characterization speaker's identity involved in realization process
Training and extraction introduce 1 to 5 layer self-adapting layers in DNN acoustic model and are trained.In large-scale dataset Switchboard
Upper carry out Modeling Research, specifically comprises the following steps:
1) training and proposing for i-vector feature extractor is carried out using the training data in corpus Switchboard
It takes, specific steps include:
1a) extract nonspecific speaker's using the training set 309 hours voice data comprising 1540 different speakers
MFCC feature, training DNN model, is extracted for nonspecific speaker characteristic.
It is 1b) hidden to the last one in DNN network model obtained in step 1a) using singular value matrix decomposition SVD technique
The weight matrix of layer is decomposed, and replaces original weight matrix to be trained with it.
1c) applying step 1b) in trained DNN network extract low-dimensional feature.
1d) every of all speakers is talked about by universal background model UBM and total transformation matrix T using the low-dimensional feature
Extract 100 dimension i-vector vector characteristics.
2) GMM-HMM speech recognition system is built, gauss hybrid models GMM is modeled, specific steps include:
MFCC feature extraction 2a) is carried out using the voice data of training set in Switchboard, here using the plum of 13 dimensions
Your frequency cepstral coefficient MFCC and its single order and second differnce, in total the MFCC feature of 39 dimensions.
2b) to all voice data of speaker each in training set, using cepstral mean normalized square mean CMVN to 2a)
In feature pre-processed, the preferable feature of noise robustness is obtained, for training three factor HMM-GMM systems.
To 2c) pass through 2b) feature or so 4 frames of extension of processing, by linear discriminant analysis LDA transformation by higher dimensional space spy
Sign drops to lower-dimensional subspace, obtains the low-dimensional feature of 40 dimensions, and carries out maximum likelihood line transformation MLLT, obtains and acoustic model
More matched feature.
2d) finally linearly return fMLLR technology to by 2c to using feature space maximum likelihood) with acoustic model more
The feature matched is normalized, and this feature is indicated with code book mean value vector, referred to as fMLLR feature.
2e) with the probability distribution of the voice data of training set in gauss hybrid models Smoothing fit Switchboard, the height
This mixed model is obtained by the linear combination of 4712 diagonal covariance gauss of distribution function;Use fMLLR feature as height
The input feature vector of this mixed model GMM, gauss hybrid models also become acoustic model, right using maximum mutual information criterion (MMI)
Each Gaussian component distribution weight in gauss hybrid models is trained, and obtains the HMM- handled by LDA+MLLT+fMLLR
GMM speech recognition system.
3) DNN-HMM baseline system is built, specific steps include:
3a) pass through trained HMM-GMM identifying system in 2), training data is subjected to pressure alignment in system, is obtained
To the corresponding true tag of each frame voice (i.e. context-sensitive HMM state tag), i.e. baseline DNN training data marks.
The fMLLR feature or so of 40 dimensions generated in 2d) 3b) is expanded into 5 frames, DNN input layer is 440, hidden layer
It is 5 layers, each hidden layer includes 1024 nodes, and activation primitive is sigmoid function.Softmax layers of number of nodes is upper and lower
Text is HMM status number 8991 relevant.Baseline DNN carries out pre-training using pre-training and fine-tuning.Learning
Convert weight W and bigoted vector B when, DNN is trained with error backpropagation algorithm (BP), with pre-estimation probability distribution with
As objective function, the mini-batch size of stochastic gradient descent method is set as cross entropy between actual probability distribution
1024, the momentum factor is 0.9, and initial learning rate is 0.05, is halved when development set frame classification accuracy reduces, Zhi Daoshou
Hold back or lower than setting threshold value.Final DNN acoustic model structure 440-2048-2048-2048-2048-2048-8991,
Training set and cross validation collection account for the 95% and 5% of training data respectively.The baseline system error rate WER of fMLLR feature is
10.83%.
4) DNN acoustic model successively carries out that personal identification feature is adaptive, and specific steps include:
4a) 1) the middle 100 dimension i-vector features that obtain characterization speaker dependent identity information are separately added into DNN acoustics
1 to 5 hidden layers of model.
The input layer that 4b) joined the DNN of speaker adaptation is 440, and subsequent hidden node is 1124,
Activation primitive is sigmoid function.The node of output layer is 8991.To the system equally use pre-training and
Fine-tuning is trained.It is identical with the training method of common weight W in training adaptive weighting V, with poor reversed biography
Algorithm (BP) is broadcast to be trained, using the cross entropy between pre-estimation probability distribution and actual probability distribution as objective function, with
The mini-batch size of machine gradient descent method is set as 1024, and the momentum factor is 0.9, and initial learning rate is 0.05, when
Development set frame classification accuracy reduce when halve, until restrain or lower than setting threshold value.Final DNN acoustic model structure
For 440-2148-2048/2148-2048/2148-2048/2148-2048/2148-8991, training set and cross validation collection point
The 95% and 5% of training data is not accounted for.
5) using in NIST2000S data set comprising 1831 test datas to trained Adaptable System in 4)
Model is tested, and evaluation index is the Word Error Rate WER in automatic speech recognition.I-vector feature based on SVD technique
Extracting method is tested respectively with traditional i-vector extracting method, obtains that test results are shown in figure 5, and Fig. 5 is
The test result figure of the different adaptive numbers of plies under Switchboard corpus.Horizontal axis is that DNN acoustic model is increased certainly in Fig. 5
The number of plies is adapted to, vertical pivot is Word Error Rate, there are two test curves, and uppermost curve is the reality that traditional adaptive approach obtains
It tests as a result, bottom plot line is the reality proposed by the present invention obtained based on speaker dependent's identity i-vector feature adaptive approach
Test result.Two curves are compareed, the chart is bright, when introducing two layers of adaptive transformation to DNN acoustic model, identification of the invention
Performance is best, and identification error rate has dropped 11.3%.
The example is tested on open source corpus Switchboard, the experimental results showed that using proposed by the present invention
Adaptive training method is carried out to DNN acoustic model, is had compared to the identification error rate of traditional adaptive approach system larger
Decline, when introducing adaptive to two layers before DNN acoustic model, the DNN acoustic mode with adaptive ability that obtains at this time
Type is optimal, the recognition accuracy highest of system, and stability is best.
Low-dimensional feature is extracted as the input feature vector for extracting i-vector using SVD technique in the present invention, so that extract
I-vector has stronger characterization ability, and robustness is unaffected in a noisy environment.In each of DNN acoustic model
Hidden layer carries out adaptive training, can make full use of a small amount of self-adapting data, and target speaker's i-vector information is layer-by-layer
It is mapped in the feature of DNN acoustic model, reduces the influence between speaker's difference, the performance of further lifting system.
DNN acoustic model adaptive approach disclosed by the invention based on personal identification feature.Solves adaptive training
It is easy to appear over-fitting in the process, i-vector characterization ability is poor, the low technical problem of system robustness.Specific implementation has: mentioning
Personal identification feature is taken, using the input of the MFCC feature DNN model unrelated as speaker;Build GMM-HMM speech recognition
System;Build the DNN-HMM baseline system of the DNN acoustic model with multiple hidden layers;Individual is successively carried out to DNN acoustic model
Identity characteristic adaptive training obtains a DNN acoustic model to target speaker with adaptive ability.In personal identification
It is decomposed and is replaced using weight matrix of the VAD technology to the last one hidden layer of the DNN model for feature extraction in feature extraction
Primitive character;DNN acoustic model hidden layer adaptive training is successively added in i-vector feature.Utilize a small amount of personal data of speaking
Transformation is carried out to feature or adjusts the recognition accuracy of raising target speaker to model parameter.Complexity of the present invention is low, knows
Other performance is good, and recognition performance is obviously improved.For intelligent robot relevant to speech recognition, communication, vehicle-mounted voice system, doctor
It treats, voice customer service etc..
Claims (6)
1. a kind of adaptive approach of the DNN acoustic model based on personal identification feature, which is characterized in that include following step
It is rapid:
1) the personal identification feature of speaker dependent is extracted;Use MFCC feature one DNN model of training of nonspecific speaker;
It is decomposed using weight of the singular value matrix decomposition technique to the last one hidden layer of the DNN model;Utilize the spy after decomposition
Sign replaces original MFCC feature to DNN model retraining, obtains one for extracting the DNN model of low-dimensional feature;With the DNN mould
After the low-dimensional feature for the nonspecific speaker that type extracts, the low-dimensional feature is trained and is aligned using universal background model,
The personal identification feature of nonspecific speaker is obtained, this feature is indicated with a vector;As the individual of speaker dependent to be extracted
When identity characteristic, nonspecific speaker is substituted with speaker dependent and participates in aforesaid operations, realize the person to speaker dependent
Part feature extraction;
2) GMM-HMM speech recognition system is built;To traditional acoustic model --- gauss hybrid models GMM is modeled, tool
Body realizes that step includes:
13 dimension low-dimensional features 2a) are extracted using mel-frequency cepstrum coefficient method to training data in corpus, and to every one-dimensional
Feature asks first-order difference and second differnce, obtains the MFCC feature of 39 dimensions;
2b) to the MFCC feature of 39 dimensions, is pre-processed using cepstral mean normalized square mean, obtain its normalized square mean feature;
2c) left and right extension is carried out to variance normalization characteristic as unit of frame, obtain the feature of superelevation dimension space, by linearly sentencing
The feature of superelevation dimension space is dropped to lower-dimensional subspace by other analytic transformation, obtains low-dimensional feature, and carry out maximum likelihood line change
It changes, obtains the decorrelation feature based on maximum-likelihood criterion;
Feature space maximum likelihood 2d) is carried out to decorrelation feature and linearly returns transformation, obtains being indicated with code book mean value vector
Feature is called fMLLR feature;
2e) with the probability distribution of the linear combination fitting voice data of k diagonal covariance gauss of distribution function, it is mixed to obtain Gauss
Molding type GMM;Use fMLLR feature as the input feature vector of gauss hybrid models GMM, it is mixed to Gauss using maximum mutual information criterion
Each Gaussian component distribution weight in molding type is trained, and obtains the HMM-GMM language handled by LDA+MLLT+fMLLR
Sound identifying system.
3) the nonspecific speaker DNN-HMM baseline system of speech recognition of a DNN acoustic model with multiple hidden layers is constructed
In trained GMM-HMM identifying system, training data is forced to be aligned, obtains the corresponding true mark of each frame voice
Label, to there is the DNN acoustic training model of supervision;Every one-dimensional characteristic of fMLLR feature to extraction or so is used as DNN sound after expanding frame
The input for learning model, using in corpus training set data and cross validation collection data carry out initialization training, complete to tool
There is the modeling of the DNN acoustic model of distinction training;Obtain the nonspecific human speech of a DNN acoustic model with multiple hidden layers
Sound identifies DNN-HMM baseline system.
4) it is adaptive successively to carry out personal identification feature for DNN acoustic model;In nonspecific speaker's speech recognition DNN-HMM baseline
Adaptive training is carried out to DNN acoustic model using the personal identification feature with speaker dependent's distinction in system, specifically
It is to successively increase self-adapting data in each hidden layer of DNN acoustic model to be trained, self-adapting data is the specific of extraction
Speaker's personal identification feature is trained the common weight of self-adaptive weight sum using cross entropy criterion in the adaptive stage,
Obtain a DNN acoustic model to speaker dependent with adaptive ability.
2. the adaptive approach of the DNN acoustic model based on personal identification feature according to claim 1, which is characterized in that
The personal identification feature of speaker dependent is extracted described in step 1, including is had the following steps
1a) the low-dimensional feature MFCC for 39 dimensions extracted using the voice data of the test set in corpus, including its single order and two
Rank feature, training one is used for the DNN model of nonspecific speaker's low-dimensional feature extraction;
1b) the DNN model extracted using singular value matrix decomposition technique SVD to trained nonspecific speaker characteristic is last
One hidden layer weight matrix is decomposed, and replaces original weight matrix with it;
DNN model training 1c) is carried out using back-propagation algorithm and gradient descent method, then with trained DNN model
Carry out the low-dimensional feature extraction of nonspecific speaker;
1d) trained DNN model extracts low-dimensional to the voice data of all speakers in training set and test set for application
Then feature is trained and is aligned realization with universal background model and extracts i-vector feature to speaker dependent's sentence.
3. the adaptive approach of the DNN acoustic model according to claim 1 or claim 2 based on personal identification feature, feature exist
In the extraction of characterization speaker dependent's identity vector i-vector described in (1c) and (1d), expression formula in step are as follows:
M=m+Tx+e
Wherein, M indicates that the GMM mean value super vector of speaker dependent, m indicate that the mean value super vector of UBM, T indicate a total spy
Space is levied, x indicates the i-vector feature for the characterization personal identification extracted, and e indicates residual noise item.
4. the adaptive approach of the DNN acoustic model based on personal identification feature according to claim 1, which is characterized in that
DNN-HMM baseline system is built described in step 3, specific implementation step includes:
3a) in trained HMM-GMM identifying system, training data is subjected to pressure alignment, obtains each frame voice
Corresponding true tag, i.e., context-sensitive HMM state tag, to there is the DNN acoustic training model of supervision;
3b) the every one-dimensional characteristics of fMLLR feature of the 40 of extraction dimensions are carried out after left and right extension frame as the defeated of DNN acoustic model
Enter, the number of nodes of hidden layer depends on the size of training set, and activation primitive is sigmoid function, and output layer, which uses, has classification
The softmax function of ability, output layer number of nodes are context-sensitive HMM status number;
3c) DNN acoustic training model;Using in same corpus training set data and cross validation collection data use pre-
The training that training and fine-tuning once initializes DNN acoustic model is completed to distinction training
The modeling of DNN acoustic model, training set and cross validation collection account for the 95% and 5% of training data respectively,
3d) after the completion of training, nonspecific speaker's speech recognition of a DNN acoustic model with multiple hidden layers is finally obtained
DNN-HMM baseline system, the system have acoustic model, language model, decoder, know for unspecified person speaker voice
Not.
5. the adaptive approach of the DNN acoustic model based on personal identification feature according to claim 1, which is characterized in that
It is adaptive that DNN acoustic model described in step 4) successively carries out personal identification feature, specifically comprises the following steps 4a) it is wherein a
The training of personal part i-vector feature and extraction and application singular value matrix decompose SVD technique, to for extracting nonspecific speak
The weight matrix of the last one hidden layer of the low-dimensional feature DNN model of people decomposes, and replaces original weight matrix, extracts non-
The low-dimensional feature of speaker dependent;
I- 4b) is extracted to each sentence of training set and test set using universal background model (UBM) and total transformation matrices T
Vector feature vector, that extracts certain dimension can characterize identity information personal identification i-vector feature;
The update for 4c) passing through adaptive weighting V obtains the best DNN acoustic model of performance;I-vector based on SVD is special
The MFCC phonetic feature for seeking peace original is separately added into 1 to 5 layers of DNN acoustic model, by minimizing target loss function, to ginseng
Number adaptive weighting V and common weight W is iterated update Optimized model, by comparing the best DNN acoustics of acquisition performance
Model.
6. the adaptive approach of the DNN acoustic model based on personal identification feature according to claim 1, which is characterized in that
The update of adaptive weighting V described in step (4c), expression formula are as follows:
Wherein, C indicates cross entropy loss function,Indicating i-th layer of un-activation amount, α indicates learning rate,Indicate t moment i-th
The error of layer, P indicate hidden layer number in DNN acoustic model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910016412.7A CN109637526A (en) | 2019-01-08 | 2019-01-08 | The adaptive approach of DNN acoustic model based on personal identification feature |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910016412.7A CN109637526A (en) | 2019-01-08 | 2019-01-08 | The adaptive approach of DNN acoustic model based on personal identification feature |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109637526A true CN109637526A (en) | 2019-04-16 |
Family
ID=66060156
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910016412.7A Pending CN109637526A (en) | 2019-01-08 | 2019-01-08 | The adaptive approach of DNN acoustic model based on personal identification feature |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109637526A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110349573A (en) * | 2019-07-04 | 2019-10-18 | 广州云从信息科技有限公司 | A kind of audio recognition method, device, machine readable media and equipment |
CN110648654A (en) * | 2019-10-09 | 2020-01-03 | 国家电网有限公司客户服务中心 | Speech recognition enhancement method and device introducing language vectors |
CN111243579A (en) * | 2020-01-19 | 2020-06-05 | 清华大学 | Time domain single-channel multi-speaker voice recognition method and system |
CN111261146A (en) * | 2020-01-16 | 2020-06-09 | 腾讯科技(深圳)有限公司 | Speech recognition and model training method, device and computer readable storage medium |
CN112599121A (en) * | 2020-12-03 | 2021-04-02 | 天津大学 | Speaker self-adaption method based on auxiliary data regularization |
CN112908305A (en) * | 2021-01-30 | 2021-06-04 | 云知声智能科技股份有限公司 | Method and equipment for improving accuracy of voice recognition |
CN112992126A (en) * | 2021-04-22 | 2021-06-18 | 北京远鉴信息技术有限公司 | Voice authenticity verification method and device, electronic equipment and readable storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105745700A (en) * | 2013-11-27 | 2016-07-06 | 国立研究开发法人情报通信研究机构 | Statistical-acoustic-model adaptation method, acoustic-model learning method suitable for statistical-acoustic-model adaptation, storage medium in which parameters for building deep neural network are stored, and computer program for adapting statistical acoustic model |
CN105895104A (en) * | 2014-05-04 | 2016-08-24 | 讯飞智元信息科技有限公司 | Adaptive speaker identification method and system |
CN107146601A (en) * | 2017-04-07 | 2017-09-08 | 南京邮电大学 | A kind of rear end i vector Enhancement Methods for Speaker Recognition System |
CN109119072A (en) * | 2018-09-28 | 2019-01-01 | 中国民航大学 | Civil aviaton's land sky call acoustic model construction method based on DNN-HMM |
-
2019
- 2019-01-08 CN CN201910016412.7A patent/CN109637526A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105745700A (en) * | 2013-11-27 | 2016-07-06 | 国立研究开发法人情报通信研究机构 | Statistical-acoustic-model adaptation method, acoustic-model learning method suitable for statistical-acoustic-model adaptation, storage medium in which parameters for building deep neural network are stored, and computer program for adapting statistical acoustic model |
CN105895104A (en) * | 2014-05-04 | 2016-08-24 | 讯飞智元信息科技有限公司 | Adaptive speaker identification method and system |
CN107146601A (en) * | 2017-04-07 | 2017-09-08 | 南京邮电大学 | A kind of rear end i vector Enhancement Methods for Speaker Recognition System |
CN109119072A (en) * | 2018-09-28 | 2019-01-01 | 中国民航大学 | Civil aviaton's land sky call acoustic model construction method based on DNN-HMM |
Non-Patent Citations (2)
Title |
---|
梁玉龙等: "基于改进i-vector的说话人感知训练方法研究", 《计算机工程》 * |
金超等: "语音识别中神经网络声学模型的说话人自适应研究", 《计算机应用与软件》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110349573A (en) * | 2019-07-04 | 2019-10-18 | 广州云从信息科技有限公司 | A kind of audio recognition method, device, machine readable media and equipment |
CN110648654A (en) * | 2019-10-09 | 2020-01-03 | 国家电网有限公司客户服务中心 | Speech recognition enhancement method and device introducing language vectors |
CN111261146A (en) * | 2020-01-16 | 2020-06-09 | 腾讯科技(深圳)有限公司 | Speech recognition and model training method, device and computer readable storage medium |
CN111243579A (en) * | 2020-01-19 | 2020-06-05 | 清华大学 | Time domain single-channel multi-speaker voice recognition method and system |
CN112599121A (en) * | 2020-12-03 | 2021-04-02 | 天津大学 | Speaker self-adaption method based on auxiliary data regularization |
CN112908305A (en) * | 2021-01-30 | 2021-06-04 | 云知声智能科技股份有限公司 | Method and equipment for improving accuracy of voice recognition |
CN112992126A (en) * | 2021-04-22 | 2021-06-18 | 北京远鉴信息技术有限公司 | Voice authenticity verification method and device, electronic equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109637526A (en) | The adaptive approach of DNN acoustic model based on personal identification feature | |
CN108806667A (en) | The method for synchronously recognizing of voice and mood based on neural network | |
CN110289003A (en) | A kind of method of Application on Voiceprint Recognition, the method for model training and server | |
CN109767759A (en) | End-to-end speech recognition methods based on modified CLDNN structure | |
CN107680582A (en) | Acoustic training model method, audio recognition method, device, equipment and medium | |
CN105139864B (en) | Audio recognition method and device | |
CN108564940A (en) | Audio recognition method, server and computer readable storage medium | |
CN108281137A (en) | A kind of universal phonetic under whole tone element frame wakes up recognition methods and system | |
CN109119072A (en) | Civil aviaton's land sky call acoustic model construction method based on DNN-HMM | |
CN108172218A (en) | A kind of pronunciation modeling method and device | |
CN107492382A (en) | Voiceprint extracting method and device based on neutral net | |
CN108109613A (en) | For the audio training of Intelligent dialogue voice platform and recognition methods and electronic equipment | |
CN110321418A (en) | A kind of field based on deep learning, intention assessment and slot fill method | |
CN105575394A (en) | Voiceprint identification method based on global change space and deep learning hybrid modeling | |
CN109887484A (en) | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device | |
CN110459225A (en) | A kind of speaker identification system based on CNN fusion feature | |
CN109637545A (en) | Based on one-dimensional convolution asymmetric double to the method for recognizing sound-groove of long memory network in short-term | |
CN108986798B (en) | Processing method, device and the equipment of voice data | |
CN109313892A (en) | Steady language identification method and system | |
CN108962247B (en) | Multi-dimensional voice information recognition system and method based on progressive neural network | |
CN109754790A (en) | A kind of speech recognition system and method based on mixing acoustic model | |
CN108922515A (en) | Speech model training method, audio recognition method, device, equipment and medium | |
CN110223714A (en) | A kind of voice-based Emotion identification method | |
CN109192199A (en) | A kind of data processing method of combination bottleneck characteristic acoustic model | |
CN108694949A (en) | Method for distinguishing speek person and its device based on reorder super vector and residual error network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190416 |
|
WD01 | Invention patent application deemed withdrawn after publication |