CN109637526A

CN109637526A - The adaptive approach of DNN acoustic model based on personal identification feature

Info

Publication number: CN109637526A
Application number: CN201910016412.7A
Authority: CN
Inventors: 李颖; 闫贝贝; 郭旭东
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-01-08
Filing date: 2019-01-08
Publication date: 2019-04-16

Abstract

The invention discloses a kind of adaptive approach of DNN acoustic model based on personal identification feature.It solves and easily occurs over-fitting in adaptive training, personal identification characterization ability is poor, the low problem of robustness.Specific implementation has: extracting personal identification feature, uses MFCC feature as the DNN mode input of nonspecific speaker；Build GMM-HMM speech recognition system；Build the DNN-HMM baseline system of the DNN acoustic model with multiple hidden layers；Personal identification feature adaptive training is successively carried out to DNN acoustic model, obtains a DNN acoustic model to speaker dependent with adaptive ability.It is decomposed in personal identification feature extraction using weight matrix of the VAD technology to DNN model the last one hidden layer and replaces primitive character.The present invention takes full advantage of personal data of speaking on a small quantity and adjusts raising speaker dependent's recognition accuracy to model parameter.Complexity is low, and recognition performance is obviously improved.For intelligence system relevant to speech recognition or communication, medical treatment, vehicle-mounted etc..

Description

The adaptive approach of DNN acoustic model based on personal identification feature

Technical field

The invention belongs to fields of communication technology, relate generally to the speaker characteristic extractive technique of personal identification i-vector, The adaptive approach of specifically a kind of DNN acoustic model based on personal identification (i-vector) feature, speaks for nonspecific People's speech recognition.

Background technique

In recent years, deep neural network DNN achieves very big success in speech recognition, models in Speech acoustics In, it is mixed relative to traditional stealthy Markov-Gauss based on stealthy Markov-deep neural network HMM-DNN system Molding type HMM-GMM system has better acoustics distinction, and the earth improves speech recognition performance, and DNN becomes mainstream Acoustic model.

However the problem of still having in above two system is speaker's voice and the target speaker of training data Voice mismatches, i.e. hypothesis training data and test data obeys same distribution.But this hypothesis is not deposit in real life , main cause is that training data is limited, and cannot cover all application scenarios, causes training and the condition tested not Matching causes system identification accuracy rate to decline.

Solve the problems, such as that model and test speaker are unmatched usually using speaker adaptation technology in system.Speaker Adaptively it is divided into model domain adaptively and property field is adaptive, the model domain speaker adaptation based on DNN acoustic model is exactly Model parameter is directly adjusted using the self-adapting data of target speaker.But since DNN model hidden layer is more, model parameter It is huge, overfitting problem is easy to appear relative to a small amount of self-adapting data.

Property field based on DNN acoustic model is adaptively that line is added in trained nonspecific speaker SI model Property transform layer the linear transformation layer is only adaptively adjusted to different speakers in the adaptive stage.This method is in certain journey The influence of over-fitting is reduced on degree, but only a certain layer is carried out adaptively, and recognition accuracy does not improve.

The largely research in terms of deep neural network is adaptive has been done by many research institutions.In these methods In, the speaker adaptation method based on personal identification i-vector feature receives significant attention, and basic thought is to will acquire Personal identification i-vector feature and being originally inputted after feature does simple concatenation DNN acoustic model is trained.This side Method is simple and easy to be compatible with other adaptive approach, equally applicable under noise conditions.This method be using low-dimensional with speak The unrelated MFCC of people is as input feature vector, and this feature has preferable characterization ability and certain robustness, but it belongs to bottom Characteristic present ability is limited, and robustness is not good enough in the presence of a harsh environment, is easily protected from environmental, and leads to the personal identification i- extracted The adaptive ability of vector feature is impacted.

SVD technique is decomposed to a certain layer in the network of the extraction of personal identification i-vector feature using singular value matrix It is handled.The problem of the method overcome the declines of frame classification accuracy rate, recognition performance has a certain upgrade, but identity is not It is good, because cannot be made full use of a small amount of only in input layer by improved i-vector and original input feature vector simple concatenation Speaker information influences system identification performance.

Some research institution's selection stronger bottleneck features of characterization ability replace MFCC feature to obtain a person Part i-vector, which is better than bottom MFCC feature in terms of characterization ability and robustness, but is extracting bottleneck It needs to introduce bottleneck layers in DNN structure when bottleneck feature, this method reduces the frame classification of DNN is accurate Rate.

In conclusion current Research on adaptive method has taken biggish success, but it is few due to not making full use of The self-adapting data of amount, and the complicated network structure, computation complexity is high, and stability is also not good enough.

The parameter iteration updating decision of the acoustic model of adaptive approach proposed by the present invention, computation complexity is low, identification system The adaptive ability of system is strong.

Summary of the invention

In view of the deficiencies of the prior art, the present invention proposes that a kind of complexity is low, and recognition performance is preferably based on personal identification The adaptive approach of the DNN acoustic model of feature.

The present invention is a kind of adaptive approach of DNN acoustic model based on personal identification (i-vector) feature, special Sign is, comprises the following steps that

1) the personal identification i-vector feature of speaker dependent is extracted；Use the MFCC feature training of nonspecific speaker One DNN model；The weight of the last one hidden layer of the DNN model is divided using singular value matrix decomposition technique (SVD) Solution；It replaces original MFCC feature to DNN model retraining using the feature after decomposition, obtains one for extracting low-dimensional feature DNN model；It is low to this using universal background model (UBM) after low-dimensional feature with the nonspecific speaker of the DNN model extraction Dimensional feature is trained and is aligned, and obtains personal identification (i-vector) feature of nonspecific speaker, this feature with one to Amount indicates；When the personal identification feature of speaker dependent to be extracted, substituted in nonspecific speaker's participation with speaker dependent Operation is stated, realizes the personal identification i-vector feature extraction to speaker dependent；

2) GMM-HMM speech recognition system is built；To traditional acoustic model --- gauss hybrid models GMM is built Mould, specific implementation step include:

13 dimension low-dimensional features 2a) are extracted using mel-frequency cepstrum coefficient (MFCC) method to training data in corpus, And first-order difference and second differnce are asked to every one-dimensional characteristic, obtain the MFCC feature of 39 dimensions；

2b) to the MFCC feature of 39 dimensions, is pre-processed using cepstral mean normalized square mean (CMVN), obtain its variance normalizing Change feature；

2c) left and right extension is carried out to variance normalization characteristic as unit of frame, the feature of superelevation dimension space is obtained, by line Property discriminant analysis (LDA) convert the feature of superelevation dimension space dropped into lower-dimensional subspace, obtain low-dimensional feature, and carry out it is maximum seemingly Right linear transformation (MLLT), obtains the decorrelation feature based on maximum-likelihood criterion；

Feature space maximum likelihood 2d) is carried out to decorrelation feature and linearly returns (fMLLR) transformation, is obtained with code book mean value The feature of vector representation is called fMLLR feature；

2e) with the probability distribution of the linear combination fitting voice data of k diagonal covariance gauss of distribution function, height is obtained This mixed model GMM；Use fMLLR feature as the input feature vector of gauss hybrid models GMM, using maximum mutual information criterion (MMI) each Gaussian component distribution weight in gauss hybrid models is trained, is obtained by LDA+MLLT+fMLLR The HMM-GMM speech recognition system of reason.

3) the nonspecific speaker DNN-HMM baseline of speech recognition of a DNN acoustic model with multiple hidden layers is constructed System；In trained GMM-HMM identifying system, training data is forced to be aligned, it is corresponding to obtain each frame voice True tag, to there is the DNN acoustic training model of supervision；Every one-dimensional characteristic of fMLLR feature to extraction or so is made after expanding frame For the input of DNN acoustic model, using in corpus training set data and cross validation collection data carry out initialization training, it is complete The modeling of DNN acoustic model in pairs with distinction training；Obtain the non-spy of a DNN acoustic model with multiple hidden layers Determine people's speech recognition DNN-HMM baseline system.

4) it is adaptive successively to carry out personal identification feature for DNN acoustic model；In nonspecific speaker's speech recognition DNN-HMM It is adaptive to the progress of DNN acoustic model using the personal identification i-vector feature with speaker dependent's distinction in baseline system It should train, specifically successively increase self-adapting data in each hidden layer of DNN acoustic model and be trained, self-adapting data is Speaker dependent's personal identification i-vector feature of extraction, in the adaptive stage, using cross entropy criterion to adaptive weighting It is trained with common weight, obtains a DNN acoustic model to speaker dependent with adaptive ability.

The present invention decomposes the extraction for carrying out characterization speaker's identity feature i-vector using singular value matrix, optimizes i- Vector performance, so that the recognition performance of final system has further promotion.

Compared with prior art, the present invention having the advantage that

1. the i-vector feature that the singular value matrix decomposition technique SVD that the present invention uses is extracted, makes compared to traditional It uses MFCC feature to extract i-vector as input, there is preferably characterization ability and robustness.Compared to utilization Bottleneck feature extracts i-vector, avoid it is in DNN structure more introduce bottleneck layer, overcome DNN frame and divide The problem of class accuracy rate declines, further improves system identification performance.

2. the i-vector feature of extraction is sequentially introduced into each hidden layer of DNN acoustic model, using intersection by the present invention Entropy criterion, using the probability distribution of prediction with actual probability distribution cross entropy as objective function, every frame voice is distinguished Property training, learn the weight for the adaptation layer that all speakers share.Adaptation layer right value update and common DNN right value update Method is the same, using error backpropagation algorithm and gradient descent method.Computation complexity is low, and training process is simple, adaptively Performance is more preferable.

3. 1 to 5 layer self-adapting layers are added in the present invention on DNN acoustic model, on a small quantity adaptive can be fully utilized Data realize global self-adaptive step random search method, and target speaker i-vector information can be successively mapped to the feature of DNN model In, it realizes the speaker's different information removed in feature, retains semantic information, reduce the difference between speaker, system identification Performance boost is significant.

Detailed description of the invention

Fig. 1 is the DNN acoustic model adaptive training structure chart of the method for the present invention；

Fig. 2 is GMM-HMM identifying system frame；

Fig. 3 is DNN-HMM baseline system frame；

Fig. 4 is the test result figure of the different adaptive numbers of plies under TIMIT corpus；

Fig. 5 is the test result figure of the different adaptive numbers of plies under Switchboard corpus.

Specific embodiment

With reference to the accompanying drawings and examples, it elaborates to the present invention

Embodiment 1

In recent years, speaker adaptation technology is more and more paid attention to, and it is adaptive that adaptive approach is divided into model domain Adaptive with property field, adaptive technique application is very mature in Hidden Markov-Gaussian Mixture HMM-GMM system, but It is to be difficult to be applied directly to Hidden Markov-deep neural network HMM-DNN system.A large amount of passes have been done by many research institutions In the research of the adaptive aspect of deep neural network.In these methods, the speaker adaptation method based on i-vector very by Favor.But current Research on adaptive method makes full use of a small amount of self-adapting data, and the complicated network structure not yet, Computation complexity is high, and stability is not good enough.The present invention is exactly to the i-vector adaptive approach based on deep neural network DNN Research and discussion is carried out, proposes a kind of adaptive approach of DNN acoustic model based on personal identification (i-vector) feature, ginseng See Fig. 1, comprises the following steps that

1) the personal identification i-vector feature of speaker dependent is extracted: using the MFCC feature training of nonspecific speaker One DNN model；The weight of the last one hidden layer of the DNN model is divided using singular value matrix decomposition technique (SVD) Solution；It replaces original MFCC feature to DNN model retraining using the feature after decomposition, obtains one for extracting low-dimensional feature DNN model；It is low to this using universal background model (UBM) after low-dimensional feature with the nonspecific speaker of the DNN model extraction Dimensional feature is trained and is aligned, and obtains personal identification (i-vector) feature of nonspecific speaker, this feature with one to Amount indicates.When the personal identification feature of speaker dependent to be extracted, substituted in nonspecific speaker's participation with speaker dependent Operation is stated, realizes the personal identification i-vector feature extraction to speaker dependent.

2) GMM-HMM speech recognition system is built: to traditional acoustic model --- gauss hybrid models GMM is built Mould, specific implementation step include:

13 dimension low-dimensional features 2a) are extracted using mel-frequency cepstrum coefficient (MFCC) method to training data in corpus, And first-order difference and second differnce are asked to every one-dimensional characteristic, spliced, finally obtains the MFCC feature of 39 dimensions.In this example Using the data in open source corpus.

2b) to the MFCC feature of 39 dimensions, is pre-processed using cepstral mean normalized square mean (CMVN), obtain its variance normalizing Change feature.

2c) left and right extension is carried out to variance normalization characteristic as unit of frame, the feature of superelevation dimension space is obtained, by line Property discriminant analysis (LDA) convert the feature of superelevation dimension space dropped into lower-dimensional subspace, obtain low-dimensional feature, and carry out it is maximum seemingly Right linear transformation (MLLT), obtains the decorrelation feature based on maximum-likelihood criterion.In this example to variance normalization characteristic with Frame is that unit carries out left and right extension, because every frame voice is short-term stationarity signal, probably there is 15 milliseconds, it is contemplated that before and after frames voice The relevance of information and the current frame voice, needs to carry out front and back extension.

Feature space maximum likelihood 2d) is carried out to decorrelation feature and linearly returns (fMLLR) transformation, is obtained with code book mean value The feature of vector representation is called fMLLR feature.

2e) with the probability distribution of the linear combination fitting voice data of k diagonal covariance gauss of distribution function, height is obtained This mixed model GMM.Use fMLLR feature as the input feature vector of gauss hybrid models GMM, gauss hybrid models are also referred to as acoustics Model, the present invention carry out each Gaussian component distribution weight in gauss hybrid models using maximum mutual information criterion (MMI) Training obtains the HMM-GMM speech recognition system handled by LDA+MLLT+fMLLR.

3) the nonspecific speaker DNN-HMM baseline of speech recognition of a DNN acoustic model with multiple hidden layers is constructed System: in trained GMM-HMM identifying system, training data is forced to be aligned, it is corresponding to obtain each frame voice True tag, to there is the DNN acoustic training model of supervision；Every one-dimensional characteristic of fMLLR feature to extraction or so is made after expanding frame For the input of DNN acoustic model, using in corpus training set data and cross validation collection data carry out initialization training, it is complete The modeling of DNN acoustic model in pairs with distinction training；Obtain the non-spy of a DNN acoustic model with multiple hidden layers Determine people's speech recognition DNN-HMM baseline system.

4) it is adaptive successively to carry out personal identification feature for DNN acoustic model: in nonspecific speaker's speech recognition DNN-HMM It is adaptive to the progress of DNN acoustic model using the personal identification i-vector feature with speaker dependent's distinction in baseline system It should train, specifically successively increase self-adapting data in each hidden layer of DNN acoustic model and be trained, self-adapting data is Speaker dependent's personal identification i-vector feature of extraction.In the adaptive stage, using cross entropy criterion to adaptive weighting It is trained with common weight, obtains a DNN acoustic model to speaker dependent with adaptive ability.

The present invention is for easily there is over-fitting in prior art adaptive training, i-vector characterization ability is poor, robustness Low problem puts forward the overall plan for solving the problems, such as this.

Thinking of the present invention is: first in trained Hidden Markov-gauss hybrid models HMM-GMM identifying system, Training data in library is subjected to pressure alignment on this system, obtains the data mark for carrying out DNN acoustic training model.Simultaneously Using pre-training and fine-tuning to the parameter initialization in DNN acoustic model, obtains one and include 5 hidden layers Neural network identify baseline system to get to a Hidden Markov-deep neural network HMM-DNN.Then, using in language The universal background model UBM and total transformation matrices T that the entire trained concentration training in material library obtains extract each speaker's sentence I-vector vector.Traditional i-vector is extracted, and is often used MFCC as input feature vector.It is different from traditional extracting method, The training and extraction of i-vector feature are carried out in the present invention using singularity value decomposition SVD.Finally, will be extracted based on SVD I-vector feature with traditional method for extracting i-vector be added DNN acoustic model multilayer be trained, pass through experiment pair Than obtaining the DNN acoustic model that stability is high, recognition performance is best.

The present invention can be used for intelligent domestic robot relevant to speech recognition, communication, vehicle-mounted, medical, home services, language The fields such as sound customer service.

Embodiment 2

The adaptive approach of DNN acoustic model based on personal identification (i-vector) feature is the same as embodiment 1, present invention step Extraction personal identification i-vector feature described in rapid 1, including have the following steps

1a) the low-dimensional feature MFCC for 39 dimensions extracted using the voice data of the test set in open source corpus, including its Single order and second order feature, one DNN model extracted for nonspecific speaker characteristic of training；

1b) nonspecific speaker characteristic trained in step 1a) is extracted using singular value matrix decomposition technique SVD The last one hidden layer weight matrix of DNN model is decomposed, and replaces original weight matrix with it.

DNN model training 1c) is carried out using back-propagation algorithm (BP) and gradient descent method, then with trained DNN model carries out the low-dimensional feature extraction of nonspecific speaker.

It 1d) is extracted with voice data of the trained DNN model to all speakers in training set and test set special Then sign is trained and is aligned realization with universal background model (UBM) and extracts i-vector feature to speaker dependent's sentence.

The present invention carries out singular value matrix decomposition in the DNN acoustic model for nonspecific speaker MFCC feature extraction Processing, particularly as being decomposed to the last one hidden layer weight matrix of the DNN model, then extracts nonspecific speaker's Low-dimensional feature solves trained overfitting problem, while improving the robustness of identifying system and the table of i-vector feature Sign ability.

Embodiment 3

The adaptive approach of DNN acoustic model based on personal identification (i-vector) feature is with embodiment 1-2, in step The extraction of speaker's identity vector i-vector, expression formula are characterized described in (1c) and (1d) are as follows:

M=m+Tx+e

Wherein, M indicates that the GMM mean value super vector of speaker dependent, m indicate the mean value super vector of UBM, and T expression one is total Feature space, x indicate extract characterization personal identification i-vector feature, e expression residual noise item.

The example is easy to obtain universal background model UBM based on the training data in corpus, passes through expectation maximization (EM) algorithm obtains total transformation matrices T, and the personal identification i-vcetor feature extracted by this kind of method has and speaks well People's distinction represents the different information between speaker, and dimension is low, and parameter is few when adaptive training, adaptive speed Fast advantage.

The present invention considers the ambient noise of speech recognition in personal identification extraction process, removes to speech recognition very Adverse environmental factors reduce voice distortion degree, so that the identity that speech recognition system has still obtained under noisy environment Can, improve the robustness of identifying system, stability and noise resisting ability.

Embodiment 4

The adaptive approach of DNN acoustic model based on personal identification (i-vector) feature is the same as embodiment 1-3, step 4) Described in DNN acoustic model successively to carry out personal identification feature adaptive, specifically comprise the following steps

4a) wherein the training of personal identification i-vector feature and extraction and application singular value matrix decompose SVD technique, to It decomposes, and replaces original in the weight matrix of the last one hidden layer of the low-dimensional feature DNN model for extracting nonspecific speaker Weight matrix, extract the low-dimensional feature of nonspecific speaker.

I- 4b) is extracted to each sentence of training set and test set using universal background model (UBM) and total transformation matrices T Vector feature vector, that extracts certain dimension can characterize identity information personal identification i-vector feature.

The update for 4c) passing through adaptive weighting V obtains the best DNN acoustic model of performance；By the i- based on SVD Vector feature and original MFCC phonetic feature are separately added into 1 to 5 layers of DNN acoustic model, by minimizing target loss Function is iterated update Optimized model to parameter adaptive weight V and common weight W, and by comparing obtaining, performance is best DNN acoustic model.

4d) based on sigmoid activation primitive derivation formula it is simple and error back propagation BP algorithm carries out parameter more Newly, objective function can indicate the gradient of parameter with error, so that the update of auto-adaptive parameter V and common weight parameter W Method is the same, and when just having started to train, learning rate α is small with 1 positive number, determines the speed that parameter updates, and when initial training is learned The setting of habit rate it is bigger, when system newly can not reduce learning rate in promotion or Slow lifting, avoid falling into locally optimal solution. Computation complexity is low, and iteration updating decision, speed-adaptive is fast, and recognition performance is greatly improved, and system stability is high.

The present invention sequentially adds speaker dependent's identity vector i- in the hidden layer of the DNN acoustic model of nonspecific speaker Vector auxiliary information carries out adaptive training, except speaker's different information in feature, retains semantic information, this kind of method instruction Adaptive acoustic model is got, in the limited situation of self-adapting data, the recognition accuracy of target speaker is improved, also mentions The high recognition performance of system.

The present invention solves the problems, such as that the characterization ability difference of existing low-dimensional feature MFCC and robustness be not good enough, and uses table Sign ability and robustness are better than the bottleneck feature of MFCC to substitute bring frame classification accuracy rate decline problem.

Embodiment 5

The adaptive approach of DNN acoustic model based on personal identification (i-vector) feature is the same as embodiment 1-4, step The update of adaptive weighting V described in (4c), expression formula are as follows:

Wherein, C indicates cross entropy loss function,Indicating i-th layer of un-activation amount, α indicates learning rate,When indicating t I-th layer of error is carved, P indicates hidden layer number in DNN model.

It is proposed by the present invention with the adaptive acoustic model of speaker dependent, based on sigmoid activation primitive derivation Formula is simple and error back propagation BP algorithm carries out parameter update, and objective function indicates the gradient of parameter with error, So that auto-adaptive parameter V is as the update method of common weight parameter W, computation complexity is low, iteration updating decision, speed-adaptive Fastly, recognition performance is greatly improved, and system stability is high.

Speaker adaptation plays a key role in speech recognition, it is using a small amount of personal data of speaking to feature It carries out transformation or could be adjusted to improve the recognition accuracy of speaker dependent to model parameter.In Speech acoustics modeling, DNN becomes the acoustic model of mainstream.But there are still over-fittings, cannot make full use of a small amount of self-adapting data, cause to identify System is influenced vulnerable to environment, and stability is low, and recognition performance is bad.For existing problem, the present invention is proposed based on a person The i-vector feature of the adaptive approach of the DNN acoustic model of part feature, the characterization speaker's different information that will acquire is layer-by-layer DNN acoustic model hidden layer is added, model parameter is adjusted, optimizes network model.It is low to obtain a complexity, recognition performance Best DNN acoustic model, so that the recognition performance of system is obviously improved.

The example with more specific data, embodiment that the present invention is further described 6 is given below

The adaptive approach of DNN acoustic model based on personal identification (i-vector) feature with embodiment 1-5,

Speaker adaptation method proposed by the present invention based on deep neural network DNN is provided referring to Fig. 1 to DNN sound The network structure that model carries out adaptive learning is learned, the i-vector feature of characterization speaker's identity involved in realization process Training and extraction introduce 1 to 5 layer self-adapting layers in DNN acoustic model and are trained.It is modeled on small data set TIMIT Research, specifically comprises the following steps:

1) training and extraction of i-vector feature extractor are carried out using the training data in corpus TIMIT, specifically Step includes:

462 speakers of training set all voice data training one 1a) are utilized to extract for nonspecific speaker characteristic DNN model, for extracting the low-dimensional feature of nonspecific speaker.

1b) the last one hidden layer in the DNN network model that SVD method obtains step 1a) is decomposed using singular value matrix Weight matrix is decomposed, and replaces original weight to DNN model retraining with it.

1c) applying step 1b) in train DNN model extraction speaker dependent's low-dimensional feature.

1d) every of all speakers is talked about by universal background model UBM and total transformation matrix T using the low-dimensional feature Extract the i-vector feature of 100 dimensions.Wherein the training of UBM model is the process of a parameter Estimation, is said with a large amount of background Talk about people under maximum-likelihood criterion using expectation maximization carry out estimation training obtain one it is unrelated with speaker, channel is unrelated Gauss hybrid models.The dimension of its distribution function and the dimension of acoustic feature are consistent.

2) GMM-HMM speech recognition system is built, specific steps include:

MFCC feature extraction 2a) is carried out to training set data in TIMIT, here using the mel-frequency cepstrum coefficient of 13 dimensions MFCC and its single order and second differnce, in total 39 dimension MFCC feature.

2b) to all voice data of speaker each in training set, using cepstral mean normalized square mean CMVN to 2a) In feature pre-processed, training three factor HMM-GMM systems.

To 2c) pass through 2b) feature or so 4 frames of extension of processing, superelevation tieed up by linear discriminant analysis (LDA) transformation empty Between feature turn to drop to lower dimensional space, obtain the low-dimensional features of 40 dimensions, the effect of LDA is to minimize the difference in the human world of speaking in class With maximize the difference in the human world of speaking outside class, and carry out maximum likelihood line transformation MLLT, obtain more matched with acoustic model Decorrelation feature.

2d) fMLLR technology is linearly finally returned to the decorrelation feature in step 2c) to using feature space maximum likelihood It is normalized, obtains the feature indicated with code book mean value vector, be called fMLLR feature, this feature and Gaussian Mixture Model GM M has good matching.

2e) gauss hybrid models GMM is obtained by the weighted sum of 2154 gaussian probability distribution functions, each Gaussian function Number reduces gauss hybrid models parameter update complexity using diagonal covariance.TIMIT corpus is fitted with gauss hybrid models GMM In library in training set voice data probability distribution.Use fMLLR feature as the input feature vector of gauss hybrid models GMM, Gauss Mixed model also becomes acoustic model, using maximum mutual information criterion (MMI) to each Gaussian component in gauss hybrid models Distribution weight is trained, and obtains the HMM-GMM speech recognition system handled by LDA+MLLT+fMLLR.

3) DNN-HMM baseline system is built, specific steps include:

3a) pass through trained HMM-GMM identifying system in 2), by training set data in TIMIT corpus in system Pressure alignment is carried out, the corresponding true tag of each frame voice (i.e. context-sensitive HMM state tag), i.e. baseline are obtained DNN training data mark, for there is the DNN acoustic training model of supervision.

The fMLLR feature or so of 40 dimensions generated in 2d) 3b) is expanded into 5 frames, DNN acoustic model input layer is 440 A, hidden layer is 5 layers, and each hidden layer includes 1024 nodes, and activation primitive is sigmoid function.Softmax layers of node Number is context-sensitive HMM status number 2492.Baseline DNN is trained using pre-training and fine-tuning. When study converts weight W and bigoted vector B, DNN is trained with error backpropagation algorithm (BP), with pre-estimation probability As objective function, the mini-batch size of stochastic gradient descent method is set cross entropy between distribution and actual probability distribution It is 256, the momentum factor is 0.9, and initial learning rate is 0.05, halve when development set frame classification accuracy reduces, until Convergence or lower than setting threshold value.Final DNN acoustic model structure 440-1024-1024-1024-1024-1024- 2492, training set and cross validation collection account for the 95% and 5% of training data respectively.The baseline system error rate WER of fMLLR feature It is 16.3%.

4) DNN acoustic model successively carries out that personal identification feature is adaptive, and specific steps include:

4a) 1) the middle 100 dimension i-vectors that obtain characterization speaker characteristic are separately added into 1 to the 5 of DNN acoustic model Layer.

The input layer that 4b) joined the DNN acoustic model of speaker adaptation is 440, subsequent hidden node It is 1124, activation primitive is sigmoid function.The node of output layer is 2492.Pre- is equally used to the system Training and fine-tuning are trained.It is identical with the training method of common weight W in training adaptive weighting V, It is trained with poor back-propagation algorithm (BP), using the cross entropy between pre-estimation probability distribution and actual probability distribution as mesh Scalar functions, the mini-batch size of stochastic gradient descent method are set as 256, and the momentum factor is 0.9, and initial learning rate is 0.05, when development set frame classification accuracy reduce when halve, until restrain or lower than setting threshold value.Final DNN acoustics Model structure 440-1124-1024/1124-1024/1124-1024/1124-1024/1124-2492, training set and intersection are tested Card collection accounts for the 95% and 5% of training data respectively.

The present invention carries out weight by objective function combination error backpropagation algorithm and gradient descent method of cross entropy Study.In training process, common weight W is identical with the training method of adaptive weighting V, calculates simply, and complexity is low, realizes fast Fast speaker adaptation.It is not simply to be spelled the i-vector feature of extraction with DNN input feature vector in the adaptive stage It connects, but DNN model is successively added in i-vector feature, each layer is adaptively adjusted, it is accurate to improve system identification Rate.

5) using the test set in TIMIT corpus including 24 speaker dependent's data, to trained in step 4) Good adaptive DNN acoustic model is tested, and evaluation index is the Word Error Rate WER in automatic speech recognition.Based on SVD skill The i-vector extraction of art is tested respectively with original i-vector extracting method, obtains that test results are shown in figure 4, Fig. 4 is the test result figure of the different adaptive numbers of plies under TIMIT corpus, and horizontal axis is the increased adaptation layer of DNN acoustic model Number, vertical pivot is Word Error Rate, there is two curves, and uppermost curve is the experimental result that traditional adaptive approach obtains, under Layer curve is the experimental result proposed by the present invention obtained based on speaker dependent's identity i-vector feature adaptive approach, right According to two curves, the chart is bright, and when introducing two layers of adaptive transformation to DNN acoustic model, recognition performance of the invention is best, Identification error rate has dropped 12.27%.

The example is tested on open source corpus TIMIT, the experimental results showed that using proposed by the present invention to DNN sound It learns model and carries out adaptive training method, have biggish decline compared to the identification error rate of traditional adaptive approach system, It can be seen that when introducing adaptive two layers before to DNN acoustic model, the DNN acoustics with adaptive ability that obtains at this time Model is optimal, the recognition accuracy highest of system, and anti-interference ability is best.

Embodiment 7

Based on the speaker adaptation method of deep neural network DNN with embodiment 1-6, referring to Fig. 1, provide to DNN sound The network structure that model carries out adaptive learning is learned, the i-vector feature of characterization speaker's identity involved in realization process Training and extraction introduce 1 to 5 layer self-adapting layers in DNN acoustic model and are trained.In large-scale dataset Switchboard Upper carry out Modeling Research, specifically comprises the following steps:

1) training and proposing for i-vector feature extractor is carried out using the training data in corpus Switchboard It takes, specific steps include:

1a) extract nonspecific speaker's using the training set 309 hours voice data comprising 1540 different speakers MFCC feature, training DNN model, is extracted for nonspecific speaker characteristic.

It is 1b) hidden to the last one in DNN network model obtained in step 1a) using singular value matrix decomposition SVD technique The weight matrix of layer is decomposed, and replaces original weight matrix to be trained with it.

1c) applying step 1b) in trained DNN network extract low-dimensional feature.

1d) every of all speakers is talked about by universal background model UBM and total transformation matrix T using the low-dimensional feature Extract 100 dimension i-vector vector characteristics.

2) GMM-HMM speech recognition system is built, gauss hybrid models GMM is modeled, specific steps include:

MFCC feature extraction 2a) is carried out using the voice data of training set in Switchboard, here using the plum of 13 dimensions Your frequency cepstral coefficient MFCC and its single order and second differnce, in total the MFCC feature of 39 dimensions.

2b) to all voice data of speaker each in training set, using cepstral mean normalized square mean CMVN to 2a) In feature pre-processed, the preferable feature of noise robustness is obtained, for training three factor HMM-GMM systems.

To 2c) pass through 2b) feature or so 4 frames of extension of processing, by linear discriminant analysis LDA transformation by higher dimensional space spy Sign drops to lower-dimensional subspace, obtains the low-dimensional feature of 40 dimensions, and carries out maximum likelihood line transformation MLLT, obtains and acoustic model More matched feature.

2d) finally linearly return fMLLR technology to by 2c to using feature space maximum likelihood) with acoustic model more The feature matched is normalized, and this feature is indicated with code book mean value vector, referred to as fMLLR feature.

2e) with the probability distribution of the voice data of training set in gauss hybrid models Smoothing fit Switchboard, the height This mixed model is obtained by the linear combination of 4712 diagonal covariance gauss of distribution function；Use fMLLR feature as height The input feature vector of this mixed model GMM, gauss hybrid models also become acoustic model, right using maximum mutual information criterion (MMI) Each Gaussian component distribution weight in gauss hybrid models is trained, and obtains the HMM- handled by LDA+MLLT+fMLLR GMM speech recognition system.

3) DNN-HMM baseline system is built, specific steps include:

3a) pass through trained HMM-GMM identifying system in 2), training data is subjected to pressure alignment in system, is obtained To the corresponding true tag of each frame voice (i.e. context-sensitive HMM state tag), i.e. baseline DNN training data marks.

The fMLLR feature or so of 40 dimensions generated in 2d) 3b) is expanded into 5 frames, DNN input layer is 440, hidden layer It is 5 layers, each hidden layer includes 1024 nodes, and activation primitive is sigmoid function.Softmax layers of number of nodes is upper and lower Text is HMM status number 8991 relevant.Baseline DNN carries out pre-training using pre-training and fine-tuning.Learning Convert weight W and bigoted vector B when, DNN is trained with error backpropagation algorithm (BP), with pre-estimation probability distribution with As objective function, the mini-batch size of stochastic gradient descent method is set as cross entropy between actual probability distribution 1024, the momentum factor is 0.9, and initial learning rate is 0.05, is halved when development set frame classification accuracy reduces, Zhi Daoshou Hold back or lower than setting threshold value.Final DNN acoustic model structure 440-2048-2048-2048-2048-2048-8991, Training set and cross validation collection account for the 95% and 5% of training data respectively.The baseline system error rate WER of fMLLR feature is 10.83%.

4a) 1) the middle 100 dimension i-vector features that obtain characterization speaker dependent identity information are separately added into DNN acoustics 1 to 5 hidden layers of model.

The input layer that 4b) joined the DNN of speaker adaptation is 440, and subsequent hidden node is 1124, Activation primitive is sigmoid function.The node of output layer is 8991.To the system equally use pre-training and Fine-tuning is trained.It is identical with the training method of common weight W in training adaptive weighting V, with poor reversed biography Algorithm (BP) is broadcast to be trained, using the cross entropy between pre-estimation probability distribution and actual probability distribution as objective function, with The mini-batch size of machine gradient descent method is set as 1024, and the momentum factor is 0.9, and initial learning rate is 0.05, when Development set frame classification accuracy reduce when halve, until restrain or lower than setting threshold value.Final DNN acoustic model structure For 440-2148-2048/2148-2048/2148-2048/2148-2048/2148-8991, training set and cross validation collection point The 95% and 5% of training data is not accounted for.

5) using in NIST2000S data set comprising 1831 test datas to trained Adaptable System in 4) Model is tested, and evaluation index is the Word Error Rate WER in automatic speech recognition.I-vector feature based on SVD technique Extracting method is tested respectively with traditional i-vector extracting method, obtains that test results are shown in figure 5, and Fig. 5 is The test result figure of the different adaptive numbers of plies under Switchboard corpus.Horizontal axis is that DNN acoustic model is increased certainly in Fig. 5 The number of plies is adapted to, vertical pivot is Word Error Rate, there are two test curves, and uppermost curve is the reality that traditional adaptive approach obtains It tests as a result, bottom plot line is the reality proposed by the present invention obtained based on speaker dependent's identity i-vector feature adaptive approach Test result.Two curves are compareed, the chart is bright, when introducing two layers of adaptive transformation to DNN acoustic model, identification of the invention Performance is best, and identification error rate has dropped 11.3%.

The example is tested on open source corpus Switchboard, the experimental results showed that using proposed by the present invention Adaptive training method is carried out to DNN acoustic model, is had compared to the identification error rate of traditional adaptive approach system larger Decline, when introducing adaptive to two layers before DNN acoustic model, the DNN acoustic mode with adaptive ability that obtains at this time Type is optimal, the recognition accuracy highest of system, and stability is best.

Low-dimensional feature is extracted as the input feature vector for extracting i-vector using SVD technique in the present invention, so that extract I-vector has stronger characterization ability, and robustness is unaffected in a noisy environment.In each of DNN acoustic model Hidden layer carries out adaptive training, can make full use of a small amount of self-adapting data, and target speaker's i-vector information is layer-by-layer It is mapped in the feature of DNN acoustic model, reduces the influence between speaker's difference, the performance of further lifting system.

DNN acoustic model adaptive approach disclosed by the invention based on personal identification feature.Solves adaptive training It is easy to appear over-fitting in the process, i-vector characterization ability is poor, the low technical problem of system robustness.Specific implementation has: mentioning Personal identification feature is taken, using the input of the MFCC feature DNN model unrelated as speaker；Build GMM-HMM speech recognition System；Build the DNN-HMM baseline system of the DNN acoustic model with multiple hidden layers；Individual is successively carried out to DNN acoustic model Identity characteristic adaptive training obtains a DNN acoustic model to target speaker with adaptive ability.In personal identification It is decomposed and is replaced using weight matrix of the VAD technology to the last one hidden layer of the DNN model for feature extraction in feature extraction Primitive character；DNN acoustic model hidden layer adaptive training is successively added in i-vector feature.Utilize a small amount of personal data of speaking Transformation is carried out to feature or adjusts the recognition accuracy of raising target speaker to model parameter.Complexity of the present invention is low, knows Other performance is good, and recognition performance is obviously improved.For intelligent robot relevant to speech recognition, communication, vehicle-mounted voice system, doctor It treats, voice customer service etc..

Claims

1. a kind of adaptive approach of the DNN acoustic model based on personal identification feature, which is characterized in that include following step It is rapid:

1) the personal identification feature of speaker dependent is extracted；Use MFCC feature one DNN model of training of nonspecific speaker； It is decomposed using weight of the singular value matrix decomposition technique to the last one hidden layer of the DNN model；Utilize the spy after decomposition Sign replaces original MFCC feature to DNN model retraining, obtains one for extracting the DNN model of low-dimensional feature；With the DNN mould After the low-dimensional feature for the nonspecific speaker that type extracts, the low-dimensional feature is trained and is aligned using universal background model, The personal identification feature of nonspecific speaker is obtained, this feature is indicated with a vector；As the individual of speaker dependent to be extracted When identity characteristic, nonspecific speaker is substituted with speaker dependent and participates in aforesaid operations, realize the person to speaker dependent Part feature extraction；

2) GMM-HMM speech recognition system is built；To traditional acoustic model --- gauss hybrid models GMM is modeled, tool Body realizes that step includes:

13 dimension low-dimensional features 2a) are extracted using mel-frequency cepstrum coefficient method to training data in corpus, and to every one-dimensional Feature asks first-order difference and second differnce, obtains the MFCC feature of 39 dimensions；

2b) to the MFCC feature of 39 dimensions, is pre-processed using cepstral mean normalized square mean, obtain its normalized square mean feature；

2c) left and right extension is carried out to variance normalization characteristic as unit of frame, obtain the feature of superelevation dimension space, by linearly sentencing The feature of superelevation dimension space is dropped to lower-dimensional subspace by other analytic transformation, obtains low-dimensional feature, and carry out maximum likelihood line change It changes, obtains the decorrelation feature based on maximum-likelihood criterion；

Feature space maximum likelihood 2d) is carried out to decorrelation feature and linearly returns transformation, obtains being indicated with code book mean value vector Feature is called fMLLR feature；

2e) with the probability distribution of the linear combination fitting voice data of k diagonal covariance gauss of distribution function, it is mixed to obtain Gauss Molding type GMM；Use fMLLR feature as the input feature vector of gauss hybrid models GMM, it is mixed to Gauss using maximum mutual information criterion Each Gaussian component distribution weight in molding type is trained, and obtains the HMM-GMM language handled by LDA+MLLT+fMLLR Sound identifying system.

3) the nonspecific speaker DNN-HMM baseline system of speech recognition of a DNN acoustic model with multiple hidden layers is constructed In trained GMM-HMM identifying system, training data is forced to be aligned, obtains the corresponding true mark of each frame voice Label, to there is the DNN acoustic training model of supervision；Every one-dimensional characteristic of fMLLR feature to extraction or so is used as DNN sound after expanding frame The input for learning model, using in corpus training set data and cross validation collection data carry out initialization training, complete to tool There is the modeling of the DNN acoustic model of distinction training；Obtain the nonspecific human speech of a DNN acoustic model with multiple hidden layers Sound identifies DNN-HMM baseline system.

4) it is adaptive successively to carry out personal identification feature for DNN acoustic model；In nonspecific speaker's speech recognition DNN-HMM baseline Adaptive training is carried out to DNN acoustic model using the personal identification feature with speaker dependent's distinction in system, specifically It is to successively increase self-adapting data in each hidden layer of DNN acoustic model to be trained, self-adapting data is the specific of extraction Speaker's personal identification feature is trained the common weight of self-adaptive weight sum using cross entropy criterion in the adaptive stage, Obtain a DNN acoustic model to speaker dependent with adaptive ability.

2. the adaptive approach of the DNN acoustic model based on personal identification feature according to claim 1, which is characterized in that The personal identification feature of speaker dependent is extracted described in step 1, including is had the following steps

1a) the low-dimensional feature MFCC for 39 dimensions extracted using the voice data of the test set in corpus, including its single order and two Rank feature, training one is used for the DNN model of nonspecific speaker's low-dimensional feature extraction；

1b) the DNN model extracted using singular value matrix decomposition technique SVD to trained nonspecific speaker characteristic is last One hidden layer weight matrix is decomposed, and replaces original weight matrix with it；

DNN model training 1c) is carried out using back-propagation algorithm and gradient descent method, then with trained DNN model Carry out the low-dimensional feature extraction of nonspecific speaker；

1d) trained DNN model extracts low-dimensional to the voice data of all speakers in training set and test set for application Then feature is trained and is aligned realization with universal background model and extracts i-vector feature to speaker dependent's sentence.

3. the adaptive approach of the DNN acoustic model according to claim 1 or claim 2 based on personal identification feature, feature exist In the extraction of characterization speaker dependent's identity vector i-vector described in (1c) and (1d), expression formula in step are as follows:

M=m+Tx+e

Wherein, M indicates that the GMM mean value super vector of speaker dependent, m indicate that the mean value super vector of UBM, T indicate a total spy Space is levied, x indicates the i-vector feature for the characterization personal identification extracted, and e indicates residual noise item.

4. the adaptive approach of the DNN acoustic model based on personal identification feature according to claim 1, which is characterized in that DNN-HMM baseline system is built described in step 3, specific implementation step includes:

3a) in trained HMM-GMM identifying system, training data is subjected to pressure alignment, obtains each frame voice Corresponding true tag, i.e., context-sensitive HMM state tag, to there is the DNN acoustic training model of supervision；

3b) the every one-dimensional characteristics of fMLLR feature of the 40 of extraction dimensions are carried out after left and right extension frame as the defeated of DNN acoustic model Enter, the number of nodes of hidden layer depends on the size of training set, and activation primitive is sigmoid function, and output layer, which uses, has classification The softmax function of ability, output layer number of nodes are context-sensitive HMM status number；

3c) DNN acoustic training model；Using in same corpus training set data and cross validation collection data use pre- The training that training and fine-tuning once initializes DNN acoustic model is completed to distinction training The modeling of DNN acoustic model, training set and cross validation collection account for the 95% and 5% of training data respectively,

3d) after the completion of training, nonspecific speaker's speech recognition of a DNN acoustic model with multiple hidden layers is finally obtained DNN-HMM baseline system, the system have acoustic model, language model, decoder, know for unspecified person speaker voice Not.

5. the adaptive approach of the DNN acoustic model based on personal identification feature according to claim 1, which is characterized in that It is adaptive that DNN acoustic model described in step 4) successively carries out personal identification feature, specifically comprises the following steps 4a) it is wherein a The training of personal part i-vector feature and extraction and application singular value matrix decompose SVD technique, to for extracting nonspecific speak The weight matrix of the last one hidden layer of the low-dimensional feature DNN model of people decomposes, and replaces original weight matrix, extracts non- The low-dimensional feature of speaker dependent；

I- 4b) is extracted to each sentence of training set and test set using universal background model (UBM) and total transformation matrices T Vector feature vector, that extracts certain dimension can characterize identity information personal identification i-vector feature；

The update for 4c) passing through adaptive weighting V obtains the best DNN acoustic model of performance；I-vector based on SVD is special The MFCC phonetic feature for seeking peace original is separately added into 1 to 5 layers of DNN acoustic model, by minimizing target loss function, to ginseng Number adaptive weighting V and common weight W is iterated update Optimized model, by comparing the best DNN acoustics of acquisition performance Model.

6. the adaptive approach of the DNN acoustic model based on personal identification feature according to claim 1, which is characterized in that The update of adaptive weighting V described in step (4c), expression formula are as follows:

Wherein, C indicates cross entropy loss function,Indicating i-th layer of un-activation amount, α indicates learning rate,Indicate t moment i-th The error of layer, P indicate hidden layer number in DNN acoustic model.