CN110211574A

CN110211574A - Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism

Info

Publication number: CN110211574A
Application number: CN201910477034.2A
Authority: CN
Inventors: 韩纪庆; 唐海桃; 郑铁然; 郑贵滨
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2019-06-03
Filing date: 2019-06-03
Publication date: 2019-09-06
Anticipated expiration: 2039-06-03
Also published as: CN110211574B

Abstract

Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism, belongs to modeling method field.The problems such as that there are recognition performances is poor for traditional attention model, and attention scale is single.A kind of speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism, bottleneck characteristic is extracted as front end using depth confidence network, the robustness of model can be increased, and rear end is then using the multiple dimensioned bull attention model being made of the convolution kernel of different scale, phoneme, syllable, the other speech primitive of word grade are modeled respectively, calculate Recognition with Recurrent Neural Network hidden layer status switch and output sequence one by one；Decoding network corresponding to attention network using every head calculates element of the output sequence at position, finally adopts all output sequences and is integrated into a new output sequence.The present invention can be improved the recognition effect of speech recognition system.

Description

It is established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method

Technical field

The present invention relates to a kind of training patterns of technical field of voice recognition, and in particular to one kind is by extracting bottleneck characteristic Increase the robustness of model, and multiple dimensioned bull model is established to carry out phoneme, syllable, the other speech primitive of word grade Modeling, the improved method to improve its recognition performance.

Background technique

Voice signal is one of most universal and common signal in human society, it is people's expression, exchange and propagates letter The important channel of breath.In the epoch of current information explosion, magnanimity voice is not being generated all the time in internet and telephone channel Data, in order to which more extensive voice signal is identified, classified and is retrieved, automatic speech recognition (Automatic Speech Recognition, ASR) demand become more urgent.With based on traditional hidden Markov model (Hidden Markov Model, HMM) speech recognition system compares, and speech recognition system is acoustic model, Pronounceable dictionary and language end to end Sound model is all fused in a set of neural network, is also more convenient for while allowing model to become more succinct directly excellent to parameter Change；End to end model has stronger versatility, has not only reduced the dependence gained knowledge to professional language, but also reduce speech recognition System builds difficulty；The training data of end to end model does not need alignment information, directly can directly use the voice with mark In training.Attention model, which is got rid of, exports independence assumption to model, quantifies pair of phoneme and feature by paying attention to force coefficient Homogeneous relation has centainly explanatory.This patent will be unfolded about the research that end-to-end speech identifies from attention model, focus The problems such as cycle of training existing for speech recognition system currently based on attention model is long and recognition performance is short of.For attention Power model causes that its parameter is in large scale and internal structure is complicated due to a large amount of Recognition with Recurrent Neural Network units of internal applications, therefore The problem that training effectiveness is low is caused, although can be by increasing hardware computing resource solving come part, from algorithm Layer viewpoint is still a difficulties.Attention model, which thoroughly abandons etic priori knowledge, causes it to lack effectively Initiation parameter, cause parameter convergence rate slow, this is also to increase the major reason of training time, thus how by model It is one with phonetic feature integration significantly to work.There are the originals of gap with conventional model in performance for attention model Cause are as follows: training caused by corpus resource is limited is insufficient, generate feature and phoneme (grapheme) alignment relation is not accurate enough, model pair The robustness of noise is poor.In view of the above problems, needing to further investigate this model, seek reasonable solution.

Summary of the invention

The purpose of the present invention is to solve existing traditional attention model there are recognition performances poor, attention scales Single problem, and propose a kind of speech recognition modeling foundation side based on bottleneck characteristic and multiple dimensioned bull attention mechanism Method.

A kind of speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism, the method The following steps are included:

Step 1: utilizing the sample FBank speech feature vector X=(x of input₁,x₂,...,x_T) to the RBM net in DBN Network carries out unsupervised training, obtains three first layers connection weight matrix W in initialization coding network₁、W₂、W₃, by this three layers of connection weights Weight matrix and one layer of random initializtion weight output layer W₄The bottleneck characteristic based on DBN for forming coding network front end extracts net Network；RBM network representation is limited Boltzmann machine, and full name in English is Restricted Boltzmann Machine, abbreviation RBM； DBN indicates depth confidence network, and full name in English is Deep Belief Network, abbreviation DBN；FBank indicates filter group； Sample FBank speech feature vector X=(x₁,x₂,...,x_T) in, x₁Indicate the characteristic sequence of the first moment fbank, x₂Indicate the The characteristic sequence of two moment fbank, x_TThe characteristic sequence for indicating the T moment fbank is finally constituted by these characteristic sequences Speech feature vector；Fbank feature is mel-frequency cepstrum coefficient (Mel-Frequency Cepstral Coefficients) MFCC does not do the result of discrete cosine transform；

Step 2: sample drawn FBank speech feature vector X=(x₁,x₂,...,x_T) it is input to four layers of bottle of step 1 In neck feature extraction network, bottleneck characteristic is obtained using forward calculation and extracts network output v；Then, it is extracted according to bottleneck characteristic The reality output of the output v and sample training data X of network extract network to bottleneck characteristic using BP algorithm and have carried out supervision instruction Practice, the bottleneck characteristic to calculate four layers extracts every layer of weight of networkAnd biasingGradient, then to weightWith it is inclined It setsIt is iterated update；The Training that step 2 carries out, can calculate faster, more acurrate；

Remove Step 3: four layers of bottleneck characteristic of step 2 are extracted network the last layer output layer, only retains three first layers It is re-used as the front end of coding network, the bottleneck characteristic sequence (v that front end is exported₁,v₂,...,v_n) input as RNN, this Locate the rear end that RNN is coding network, it is made of two groups of multilayer RNN based on GRU, respectively feedforward network and backward network；

The FBank speech feature vector X then inputted extracts network by coding network front end bottleneck characteristic and obtains bottleneck spy Levy sequence (v₁,v₂,...,v_n), the bottleneck characteristic sequence of output is obtained finally by encoder rear end based on the multilayer RNN of GRU Output be high-level characteristic sequence (h₁,h₂,...,h_T)；

Coding network extracts network by step 1 bottleneck characteristic, the Recognition with Recurrent Neural Network of step 3 forms, and bottleneck characteristic is claimed to mention Taking network is the front end of coding network；The full name in English of GRU is Gated Recurrent Unit, abbreviation GRU；RNN expression follows Ring neural network, full name are Recurrent Neural Network, abbreviation RNN；

Step 4: the high-level characteristic sequence (h for being exported coding network rear end using the attention mechanism of multiple dimensioned bull₁, h₂,...,h_T) it is injected separately into the convolution kernel F of different scale_iIn, and phoneme, syllable, the other speech primitive of word grade are built Mould, to obtain respective object vector ct_o, in order to rear end pattern classification.Again by the corresponding ct of every kind of model_oAs more rulers The input for spending the decoding RNN network of every head in bull attention network, calculates RNN hidden layer status switch (s one by one₁, s₂,...,s_o), and output aligned phoneme sequence (y₁,y₂,...,y_o)；

Step 5: decoding RNN net corresponding to attention network using every head in multiple dimensioned bull attention network Network calculates output sequence (y₁,y₂,...,y_o) spliced to obtain new output sequence (y₁,y₂,...,y_o)_i；By new output Sequence is integrated into final output sequence (Y using DNN₁,Y₂,...,Y_o).Here, the output of the corresponding decoding network of every head y_oIt is that object vector ct is generated by every head attention network_oWith the element y in position o-1_o-1It generates.

The invention has the benefit that

The present invention provides a kind of audio recognition method based on bottleneck characteristic and multiple dimensioned bull attention mechanism.The party Method extracts bottleneck characteristic using DBN as the front end of coding network, can increase the robustness of speech recognition system, and decodes Network, can be respectively to phoneme, syllable, word then using the multiple dimensioned bull attention model being made of the convolution kernel of different scale It is modeled etc. different grades of speech primitive, improves the recognition effect of speech recognition system.

The speech recognition of the attention model based on multiple dimensioned bull of the method for the present invention design is in aligned phoneme sequence identification side Face performance is better than conventional model, has more powerful time series modeling ability and separating capacity.The speech recognition training model of design The accuracy of identification is improved, system misclassification rate and reject rate are reduced.

Detailed description of the invention

Fig. 1 is the multiple dimensioned bull attention model that network is extracted in conjunction with bottleneck characteristic；

Fig. 2 is that the bottleneck characteristic based on DBN extracts network；

Specific embodiment

Specific embodiment 1:

The speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism of present embodiment, It the described method comprises the following steps:

Specific embodiment 2:

Unlike specific embodiment one, present embodiment based on bottleneck characteristic and multiple dimensioned bull attention machine The speech recognition modeling method for building up of system, in the step one, the speech feature vector X=(x of input₁,x₂,...,x_T) adopt With 40 dimension FBank features and energy, then splicing correspondence single order and second differnce, amount to 123 dimension parameters；It is first to the feature of extraction It is first normalized within the scope of training set, so that each component is obeyed standardized normal distribution, the normalization of training set is recycled to join Several pairs of test sets and the processing of development set feature normalization.

Specific embodiment 3:

Unlike specific embodiment one or two, present embodiment is paid attention to based on bottleneck characteristic and multiple dimensioned bull The speech recognition modeling method for building up of power mechanism in the step one and step 2, carries out without prison the RBM network in DBN Superintend and direct training process, wherein the training of RBN network includes propagating using unsupervised pre-training (pretraining) and reversed gradient Algorithm Training method；The input of the RBM network is FBank phonetic feature, and the output layer of RBM network is Softmax layers, the posterior probability of the corresponding binding three-tone state of each output layer unit；Have three between input layer and output layer A hidden layer, second hidden layer are bottleneck layer, and the state cell number of second hidden layer is less than other hidden layers.

Specific embodiment 4:

Unlike specific embodiment one or two, present embodiment is paid attention to based on bottleneck characteristic and multiple dimensioned bull The speech recognition modeling method for building up of power mechanism, in the step one, using the phonetic feature FBank of input to RBM network Carry out the process of unsupervised training are as follows:

Step 1 one, the tag along sort for increasing by one layer of output phoneme state on the basis of the DBN of 4 layers of structure, obtain 5 layers Phonetic feature is extracted the input of RBM network by the bottleneck characteristic of structure as the bottleneck characteristic of 5 layers of structure；Wherein, 4 layers of DBN It is continuously stacked together by 4 groups of single layer RBM, first layer is Gauss-Bernoulli Jacob RBM, and excess-three layer is Bernoulli Jacob-Bernoulli Jacob RBM；

Step 1 two, in RBM, due to there is no neuron to be connected between visible layer and hidden layer inside, so each other Between it is independent mutually.Therefore posterior probability p (v | h) and p (h | v) can easily it be calculated.Wherein Bernoulli Jacob-Bernoulli Jacob Posterior probability p (h | v) calculating process of RBM is as follows:

Wherein, E (v, h) is the energy value of single layer RBM, and v is visual layers vector, and a is the bias vector of visual layers, and h is hidden Vector containing layer, b are hidden layer bias vector, and W is the weight matrix for connecting visible layer and hidden layer, W_i,*vIndicate weight matrix W The i-th row.Formula (1) derives explanation in the case where visible layer determines, implies independent mutually between each component of layer state；

Step 1 three, by Bernoulli Jacob-each component h of Bernoulli Jacob's RBM hidden layer neuron state_iValue there was only 0 and 1, Have:

Wherein, σ () indicates Sigmoid function；

The vector of implicit layer state can be indicated with following formula:

P (h=1 | v)=σ (Wv+b) (3)

According to symmetric relation, under known implicit layer state, the visible layer state of single layer Bernoulli Jacob-Bernoulli Jacob RBM are as follows:

P (v=1 | h)=σ (W^Th+a) (4)

Step 1 four, be for visible layer Gaussian Profile Gauss-Bernoulli Jacob RBM, the calculating of posterior probability p (h=1 | v) It is identical as formula (3), and the calculation of p (v | h) is as follows:

P (v | h)=N (v；W^Th+a,I) (5)

Wherein I represents unit matrix；

Step 1 five, the process of the stacking of the DBN network encoded are as follows:

Firstly, training one Gauss-Bernoulli Jacob RBM, the reason is that the phonetic feature of first layer input is real number, with Gauss point Cloth is adapted；It then, is the input of Bernoulli Jacob-Bernoulli Jacob RBM visible layer using the activation probability of implicit layer unit as next layer, The hidden state of second layer RBM is re-used as the input of third layer RBM, hereafter all layers；

At given training dataset V, the parameter training of single layer RBM minimizes negative logarithm seemingly by using gradient descent method So degree is realized:

In training process, the derivative general type of negative log likelihood are as follows:

Wherein, θ={ a, b, W } determines the distribution of the visual layers vector v of RBM,<ξ>_data<ξ>_modelIt respectively indicates from sample The desired value of stochastic variable ξ is estimated in notebook data and model；For the weight of connection matrix in single layer RBM, have:

By taking Bernoulli Jacob-Bernoulli Jacob RBM as an example, it is expected that < v_ih_j>_dataFor layer state v visible in training data_iWith implicit stratiform State v_iThe frequency that value is simultaneously 1, it is expected that < v_ih_j>_modelThe expectation being calculated for RBM production model profile.

Step 1 six trains the DBN:

Firstly, according to input sample data to first layer RBM with contrast divergence algorithm (Compare Divergence, CD) Algorithm initialization；Wherein, model is estimated with the sample that gibbs sampler (Gibbs Sample) is obtained in CD algorithm, Gibbs sampler process in CD algorithm specifically:

1), Gibbs sampler initializes visible layer by data sample, obtains visible layer state v；

2) posterior probability p (h | v), is calculated as implicit layer state h using formula (3)；

Secondly, formula (4) or formula (5) the visible layer state v of sample reconstruction respectively is respectively adopted in the difference further according to RBM type； Later, the parameter of fixed first layer RBM, the implicit layer state of RBM is calculated according to sample data, and as second layer RBM It can be seen that layer state；At this point, obtaining training second layer RBM using CD algorithm again；

Backward and so on, the weight parameter of connection DBN between layers is finally obtained one by one；

Finally, after training, the weight matrix for only retaining articulamentum and layer is joined as the initialization of DBN transfer matrix Number.

Specific embodiment 5:

Unlike specific embodiment one or two, present embodiment is paid attention to based on bottleneck characteristic and multiple dimensioned bull The speech recognition modeling method for building up of power mechanism in the step two, has carried out supervision instruction to RBM network using BP algorithm Practice, after as carrying out unsupervised training to network using sample input data by RBM technology, preceding 3 layers of connection in coding network Weight matrix is initialized, and the weight matrix after training can partially reflect and " memory " sample data in it is effective Information.In order to further enhance extraction and separating capacity of the network to phonetic feature, so that subsequent attention model obtained Bottleneck characteristic has stronger robustness and distinction, is marked using speech samples to speech samples on the basis of unsupervised training Carry out Training, detailed process are as follows:

Step 2 one, by the weight Direct Transfer of DBN to deep neural network (Deep Neural Network, DNN) In；Later, it is finely adjusted on the basis of known DNN department pattern parameter (fine tuning)；Preceding 3 connections square of network Battle array weight W₁,W₂,W₃It is known that this three layers and one layer of random initializtion weight output layer W₄Form coding network front end based on DBN Bottleneck characteristic extract network, be added activation primitive between the layers, calculate before network to:

F () is enabled to indicate activation primitive, the 1st implicit layer state h₁With first of implicit layer state h_lIt is respectively as follows:

h₁=f (W¹x+b¹) (9)

h_l=f (W^lh_l-1+b^l) (10)

By the 4th hidden layer h₄The last output vector v of network is obtained by softmax function:

V=softmax (W⁴h₃+b⁴) (11)

Wherein, when with above method training network, the sample for randomly selecting fixed number is concentrated from training sample every time This, and the feature of sample drawn is input in network, network output is obtained by forward calculation；

Step 2 two, the gradient for calculating every layer parameter using BP algorithm according to the output of network and sample training data, then Model parameter is updated:

DBN network training criterion, which generallys use, minimizes cross entropy (crossentropy, CE) function:

Then, network parameter is finely adjusted using back-propagation algorithm (BackPropagation, BP), the algorithm is logical Often model parameter is updated according to following formula:

Wherein,WithRespectively carry out the update of t subparameter after, in network l layer of weight matrix be biased towards Amount, μ is learning rate, △ W_t ^lWithThe respectively gradient of weight matrix and bias vector, the gradient calculate one by training sample Order derivative and obtain:

Wherein, M is the number of samples that training uses every time；

Step 2 three cyclically executes 2 two steps of step 2 one and step 2, and by the output of network and model Whether the error of output reaches the mark whether threshold value of setting terminates as training:

Firstly, in the training process, generally also calculating network output and standard using development set (development set) The error of output is used for cross validation (cross validation)；

Then, network is extracted in order to obtain bottleneck characteristic on this basis, by output layer last in trained network Network state unit, transfer matrix and bias vector are all deleted, by the remaining front end as coding network, by feature extraction Output of the state of bottleneck layer as coding network front end；In order to extract bottleneck characteristic, using voice primitive character parameter as net Network input, successively calculates bottleneck layer output vector by the network, which is bottleneck characteristic parameter；Because the network is instructed Model parameter is initialized using RBM during practicing, and the alignment letter that GMM-HMM model provides when Training Breath, so the bottleneck characteristic extracted has better distinction and robustness for different phoneme states compared with traditional characteristic.Cause And if the bottleneck characteristic is applied to attention model, it is equivalent to the knowledge migration in network into attention model.

Specific embodiment 6:

Unlike specific embodiment five, present embodiment based on bottleneck characteristic and multiple dimensioned bull attention machine The speech recognition modeling method for building up of system,

The multiple dimensioned bull attention model is by three coding network, decoding network and attention network module groups At；

Coding network and decoding network include Recognition with Recurrent Neural Network unit；

Coding network front end is DBN feature extraction network, and rear end is then deep layer Recognition with Recurrent Neural Network, it is therefore an objective to study and digging The context relation information for digging phonetic feature sequence, extracts high layer information, the discrimination and table of Enhanced feature from primitive character Sign ability；

The main body of attention network is multilayer perceptron, and network inputs are the output of coding network and implying for decoding network Layer unit state, the output of network are their degree of association scores；Degree of association score is infused after index normalized Meaning force coefficient；

Decoding network is connected to the network by single layer Recognition with Recurrent Neural Network and maxout, is obtained first according to attention network Pay attention to force coefficient, object vector then is obtained to the output weighted sum of all coding networks, finally using object vector as net Network input, calculates the posterior probability that each phoneme occurs on each position of output sequence.

Specific embodiment 7:

Unlike specific embodiment six, present embodiment based on bottleneck characteristic and multiple dimensioned bull attention machine The speech recognition modeling method for building up of system, in the step three, high-level characteristic sequence (h₁,h₂,...,h_T) it is by multiple dimensioned What the coding network front and back end of bull attention model constructed, detailed process are as follows:

The rear end of coding network is made of two groups of multilayer circulation neural networks based on GRU unit, respectively feedforward network With backward network.The structure of bilateral network can be passed on the information with future simultaneously, promote the information content of high-level characteristic.

Step 3 one, firstly, by the trained DBN bottleneck characteristic (v in coding network front end₁,v₂,...,v_n) as circulation The input of neural network；Then, reduce the Recognition with Recurrent Neural Network element stack number of plies, only stay 1 layer network processing feature sequence when Sequence information exports high-level characteristic sequence (h₁,h₂,...,h_T)；

Step 3 two, in coding network rear end moment t, the implicit layer state point of c layers of feedforward network and backward network It is notWithDue to the hidden layer unit information direction of transfer of feedforward network and backward network on the contrary, then corresponding transmitting letter Number is respectively formula (17) and formula (18):

Step 3 two, feedforward network and backward network carry out feature between layers by respective transmission function parallel Transmitting, carried out in transmittance process in the time domain it is down-sampled, to reach the target for reducing calculation amount.With in feedforward network, due to will GRU is as basic unit, so hidden layer state computation such as formula (17) (18) is shown, by input feature vector sequence (x₁,x₂,..., x_T) the 1st layer of implicit layer state can be obtainedSimilarly, c layers of implicit stratiform can be calculated by c-1 layers of implicit layer state StateCalculating process implies shown in layer state such as formula (19):

Similarly, to c layers of implicit layer state of network after calculating；

Step 3 three, c layers of hidden state for splicing feedforward network and backward network, obtain coding network moment t's High-level characteristic h_t:

Specific embodiment 8:

Unlike specific embodiment seven, present embodiment based on bottleneck characteristic and multiple dimensioned bull attention machine The speech recognition modeling method for building up of system, in the step four, the attention network struction of multiple dimensioned bull attention model Process:

Attention mechanism is that the technology of validity feature is extracted from characteristic sequence in the model of sequence to sequence.It will Automobile driving is to the part relevant range of characteristic sequence, to ignore extraneous features.Using the attention machine of multiple dimensioned bull System, the element y of the output sequence of the sequence of calculation to series model_o, by characteristic sequence (h₁,h₂,...,h_T) using different scale Convolution kernel F_iIt is mapped to an object vector ct of the attention mechanism of every head_o；Object vector ct_oAfter being compressed in sequence Important information, realize and sequence important feature information be transformed to regular length vector, consequently facilitating rear end pattern classification；Again will ct_oAs the decoding network input after the attention network of every head, Recognition with Recurrent Neural Network is calculated one by one and implies layer state sequence Arrange (s₁,s₂,...,s_o) and output sequence (y₁,y₂,...,y_o)；Wherein,

The single head attention model of multiple dimensioned bull attention model calculates output sequence position using attention mechanism The corresponding object vector ct of o ∈ { 1,2 ..., O }_oProcess is as follows:

Firstly, calculating the implicit layer state s of output sequence position o-1_o-1With the degree of association of the feature of moment t, such as formula (21) It is shown:

e_o,t=a (s_o-1,h_t) (21)

Wherein, a () represents attention network, is the multilayer perceptron containing only a hidden layer, can show expression are as follows:

e_o,t=ω^T tanh(W[s_o-1,h_t,f_o,t]+b) (22)

f_o=F_i*α_o-1 (23)

Wherein, W and b is respectively input layer to hidden layer weight matrix and bias vector, and ω is that hidden layer is weighed to output layer Weight matrix, F_iFor convolution kernel, for the attention mechanism of multiple dimensioned bull, F_iConvolution kernel size be different, each head Conventional part use different size of convolution filter；Due to bull and convolution filter it is of different sizes, the model is not only The characterization information of the different subspace of different sequence locations is able to use to carry out sequence data processing, and different size of volume Product filter can model the information of phoneme, syllable, word level-one.

Here, multiple dimensioned bull model is not initialized only with different convolution filter stochastic parameters, makes bull model Integrated model is formed, and different size of convolution filter represents different models, and the attention of last moment can be allowed to obtain Divide and learnt and trained under the supervision of various models, obtains different speech primitive models；Here, convolution filter size Lesser model main representative the model of phoneme level-one, the model main representative of normal size the model of syllable level-one, and The biggish model that represent word level-one.So the attention model by multiple dimensioned bull can be to different grades of voice Primitive is modeled, and achievees the purpose that improve speech recognition effect.

Then, index normalization is carried out to the degree of association of all moment features, the numerical value after normalization is known as attention system Number is as follows:

Finally, will notice that force coefficient as weight, sums to the characteristic weighing at all moment, it is defeated under the power that gains attention mechanism The object vector ct of sequence location o out_o:

The process that object vector is calculated using attention mechanism, can be abbreviated as formula (26):

ct_o=attention ([h₁,h₂,...,h_T],s_o-1) (26)。

Specific embodiment 8:

Unlike specific embodiment seven, present embodiment based on bottleneck characteristic and multiple dimensioned bull attention machine The speech recognition modeling method for building up of system, in the step five, decoding net corresponding to the attention network using every head Network calculates output sequence (y₁,y₂,...,y_o) in the element y of position o_oProcess, are as follows:

For the single head of multiple dimensioned bull attention model, the decoding network after the attention network of every head exists Calculate output sequence (y₁,y₂,...,y_o) position o element y_oWhen, the input of network is that multiple dimensioned single attention network generates Object vector ct_oWith the element y of position o-1_o-1, calculating process is as follows:

Firstly, decoding network obtains object vector ct according to attention network query function_o, and by ct_oAnd y_o-1As being based on The input of GRU unit Recognition with Recurrent Neural Network calculates the implicit layer state s of Recognition with Recurrent Neural Network_o, as shown in formula (27):

s_o=GRU (s_o-1,[y_o-1,ct_o]) (27)

Then, the implicit layer state s of decoding network is given_o∈R^dAs under input condition, obtained by maxout network query function To the posterior probability of phoneme iEach of maxout network, which implies layer unit, multiple candidate units, which therefrom selects The maximum unit conduct of numerical value is selected to export, under calculating process:

Wherein, d is to input implicit layer state s_oDimension, corresponding hidden layer number of unit, W_i,j∈R^d×m×kAnd b_i,j∈R^m ^×kRespectively maxout network paramter matrix and bias vector, k are the candidate unit number that maxout network each implies layer unit, M is output layer unit number, the corresponding output phoneme type number in acoustic model；

Finally, obtaining output vector y by the output layer of maxout network as shown in formula (30)_o, y_oI-th of representation in components is defeated The posterior probability that phoneme i occurs on sequence location o out；

Finally, by all output sequence (y₁,y₂,...,y_o)_iOne new output sequence (Y is integrated into using DNN₁, Y₂,...,Y_o)；

Y_i=DNN ((y₁,y₂,...,y_o)_i) (31)。

The speech recognition of attention model based on multiple dimensioned bull performance in terms of aligned phoneme sequence identification is better than traditional mould Type has more powerful time series modeling ability and separating capacity.In order to increase the robustness of model in a noisy environment, in this base Increase the bottleneck characteristic based on DBN network on plinth, reason is: firstly, because the bottleneck characteristic based on DBN extracts network not Recursive structure, thus when training DBN can on GPU parallel computation multiframe gradient to effective training for promotion efficiency, into And it can effectively avoid the problem that the multilayered structure bring of Recognition with Recurrent Neural Network is in large scale and time consumption for training；Secondly, by subtracting The number of plies of few Recognition with Recurrent Neural Network can reduce the distance of gradient back transfer in a network, be further reduced trained consuming.Simultaneously The training of model can be allowed to be easier to restrain using the more powerful feature of distinction, further increase training effectiveness.Finally, base There is stronger robustness to noise in the bottleneck characteristic that DBN is extracted, can overcome the disadvantages that the attention model of multiple dimensioned bull using it The weak defect of anti-noise ability.Multiple dimensioned model is not initialized only with different convolution filter stochastic parameters, makes bull model Integrated model is formed, to be better than single head model, and different size of convolution filter represents different models, can allow The attention score at one moment is learnt and is trained under the supervision of various models, and different speech primitive models is obtained.This In, convolution filter size lesser model main representative the model of phoneme level-one, the model main representative of normal size The model of syllable level-one, and biggish convolution filter size represents and establishes model to word level-one.So by multiple dimensioned more The attention model of head can model different grades of speech primitive, achieve the purpose that improve speech recognition effect.

The above shows and describes the basic principles and main features of the present invention and the advantages of the present invention.The technology of the industry Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and the above embodiments and description only describe this The principle of invention, without departing from the spirit and scope of the present invention, various changes and improvements may be made to the invention, these changes Change and improvement all fall within the protetion scope of the claimed invention.The claimed scope of the invention by appended claims and its Equivalent thereof.

Claims

1. the speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism, it is characterised in that: institute State method the following steps are included:

Step 1: utilizing the sample FBank speech feature vector X=(x of input₁,x₂,...,x_T) to the RBM network in DBN into The unsupervised training of row obtains three first layers connection weight matrix W in initialization coding network₁、W₂、W₃, by this three layers of connection weight squares Battle array and one layer of random initializtion weight output layer W₄The bottleneck characteristic based on DBN for forming coding network front end extracts network；RBM Network representation is limited Boltzmann machine；DBN indicates depth confidence network；FBank indicates filter group；

Step 2: sample drawn FBank speech feature vector X=(x₁,x₂,...,x_T) it is input to four layers of bottleneck spy of step 1 Sign is extracted in network, is obtained bottleneck characteristic using forward calculation and is extracted network output v；Then, network is extracted according to bottleneck characteristic Output v and sample training data X reality output using BP algorithm to bottleneck characteristic extract network carry out Training, with Calculate every layer of weight △ W of four layers of bottleneck characteristic extraction network_t ^lAnd biasingGradient, then to weightAnd biasing It is iterated update；

Remove Step 3: four layers of bottleneck characteristic of step 2 are extracted network the last layer output layer, only retains three first layers again As the front end of coding network, the bottleneck characteristic sequence (v that front end is exported₁,v₂,...,v_n) input as RNN, RNN herein For the rear end of coding network, it is made of two groups of multilayer RNN based on GRU, respectively feedforward network and backward network；

The FBank speech feature vector X then inputted extracts network by coding network front end bottleneck characteristic and obtains bottleneck characteristic sequence Arrange (v₁,v₂,...,v_n), the bottleneck characteristic sequence of output is obtained by encoder rear end based on the multilayer RNN of GRU final defeated It is out high-level characteristic sequence (h₁,h₂,...,h_T)；

Step 4: the high-level characteristic sequence (h for being exported coding network rear end using the attention mechanism of multiple dimensioned bull₁, h₂,...,h_T) it is injected separately into the convolution kernel F of different scale_iIn, and phoneme, syllable, the other speech primitive of word grade are built Mould, to obtain respective object vector ct_o, then by the corresponding ct of every kind of model_oAs every in multiple dimensioned bull attention network The input of the decoding RNN network of head, calculates RNN hidden layer status switch (s one by one₁,s₂,...,s_o), and output phoneme Sequence (y₁,y₂,...,y_o)；

Step 5: decoding RNN network meter corresponding to attention network using every head in multiple dimensioned bull attention network Calculate output sequence (y₁,y₂,...,y_o) spliced to obtain new output sequence (y₁,y₂,...,y_o)_i；By new output sequence It is integrated into final output sequence (Y using DNN₁,Y₂,...,Y_o)。

2. according to claim 1 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method, it is characterised in that: in the step one, the FBank speech feature vector X=(x of input₁,x₂,...,x_T) use 40 FBank feature and energy, then splicing correspondence single order and second differnce are tieed up, 123 dimension parameters are amounted to；To the feature of extraction, exist first It is normalized within the scope of training set, so that each component is obeyed standardized normal distribution, recycle the normalized parameter pair of training set Test set and the processing of development set feature normalization.

3. according to claim 2 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method, it is characterised in that: in the step one and step 2, unsupervised training process, institute are carried out to the RBM network in DBN The input for the RBM network stated is FBank phonetic feature, and the output layer of RBM network is softmax layers, each output layer unit pair The posterior probability of three-tone state should be bound；There are three hidden layer between input layer and output layer, the second hidden layer is bottle The state cell number of neck layer, second hidden layer is less than other hidden layers.

4. according to claim 3 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method, it is characterised in that: in the step one, utilize the FBank speech feature vector X=(x of input₁,x₂,...,x_T) right RBM network in DBN carries out the process of unsupervised training are as follows:

Step 1 one, the tag along sort for increasing by one layer of output phoneme state on the basis of the DBN of 4 layers of structure, obtain 5 layers of structure Bottleneck characteristic, phonetic feature is extracted to the input of RBM network as the bottleneck characteristic of 5 layers of structure；Wherein, 4 layers of DBN are by 4 Group single layer RBM is continuously stacked together, and first layer is Gauss-Bernoulli Jacob RBM, and excess-three layer is Bernoulli Jacob-Bernoulli Jacob RBM；

Step 1 two, in RBM, Bernoulli Jacob-Bernoulli Jacob RBM posterior probability p (h | v) calculating process is as follows:

Wherein, E (v, h) is the energy value of single layer RBM, and v is visual layers vector, and a is the bias vector of visual layers, and h is hidden layer Vector, b are hidden layer bias vector, and W is the weight matrix for connecting visible layer and hidden layer, W_i,*vIndicate the i-th of weight matrix W Row implies independent mutually between each component of layer state；

Wherein, σ () indicates Sigmoid function；

The vector of implicit layer state can be indicated with following formula:

P (h=1 | v)=σ (Wv+b) (3)

P (v=1 | h)=σ (W^Th+a) (4)

Step 1 four, be for visible layer Gaussian Profile Gauss-Bernoulli Jacob RBM, the calculating and formula of posterior probability p (h=1 | v) (3) identical, and the calculation of p (v | h) is as follows:

P (v | h)=N (v；W^Th+a,I) (5)

Wherein I represents unit matrix；

Firstly, training one Gauss-Bernoulli Jacob RBM；It then, is that uncle exerts using the activation probability of implicit layer unit as next layer The input of benefit-Bernoulli Jacob RBM visible layer, the hidden state of second layer RBM are re-used as the input of third layer RBM, hereafter all layers And so on；

At given training dataset V, the parameter training of single layer RBM minimizes negative log likelihood by using gradient descent method It realizes:

Wherein, θ={ a, b, W } determines the distribution of the visual layers vector v of RBM,<ξ>_data<ξ>_modelIt respectively indicates from sample number According to desired value that stochastic variable ξ is estimated in model；For the weight of connection matrix in single layer RBM, have:

Step 1 six trains the DBN:

Firstly, according to input sample data to first layer RBM contrast divergence algorithm algorithm initialization；Wherein, it is transported in CD algorithm Model is estimated with the sample that gibbs sampler obtains, the gibbs sampler process in CD algorithm specifically:

Secondly, formula (4) or formula (5) the visible layer state v of sample reconstruction respectively is respectively adopted in the difference further according to RBM type；Later, The parameter of fixed first layer RBM, the implicit layer state of RBM is calculated according to sample data, and as second layer RBM visible layer State；At this point, obtaining training second layer RBM using CD algorithm again；

Finally, after training, only retain initiation parameter of the weight matrix of articulamentum and layer as DBN transfer matrix.

5. according to claim 4 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method, it is characterised in that: in the step two, the reality of the output v and sample training data X of network are extracted according to bottleneck characteristic Border output extracts network to bottleneck characteristic using BP algorithm and carries out Training, and language is as used on the basis of unsupervised training Sound sample mark carries out Training, detailed process to speech samples are as follows:

Step 2 one, by the weight Direct Transfer of DBN into deep neural network；Later, in known DNN department pattern parameter On the basis of be finely adjusted；First three connection matrix weight W of network₁,W₂,W₃It is known that this three layers and one layer of random initializtion weight Output layer W₄The bottleneck characteristic based on DBN for forming coding network front end extracts network, and activation primitive is added between the layers, Calculate network before to:

h₁=f (W¹x+b¹) (9)

h_l=f (W^lh_l-1+b^l) (10)

V=softmax (W⁴h₃+b⁴) (11)

Wherein, when with above method training network, the sample for randomly selecting fixed number is concentrated from training sample every time, and The feature of sample drawn is input in network, network output is obtained by forward calculation；

Step 2 two, the gradient for calculating every layer parameter using BP algorithm according to the output of network and sample training data, then to mould Shape parameter is updated:

DBN network training criterion intersects entropy function using minimum:

Then, network parameter is finely adjusted using back-propagation algorithm, model parameter is updated according to following formula:

Wherein,WithAfter respectively carrying out the update of t subparameter, l layers of weight matrix and bias vector, μ in network For learning rate, △ W_t ^lWithThe respectively gradient of weight matrix and bias vector, the gradient calculate single order by training sample and lead It counts and obtains:

Wherein, M is the number of samples that training uses every time；

Step 2 three cyclically executes 2 two steps of step 2 one and step 2, and the output of network and model is exported Error whether reach the mark whether threshold value of setting terminates as training:

Firstly, in the training process, the error of network output and standard output is calculated using development set, it to be used for cross validation；

Then, on this basis, by the network state unit of output layer last in trained network, transfer matrix and it is biased towards Amount is all deleted, by the remaining front end as coding network, using the state of feature extraction bottleneck layer as coding network front end Output；Using voice primitive character parameter as network inputs, bottleneck layer output vector is successively calculated by the network, this to Amount is bottleneck characteristic parameter, extracts bottleneck characteristic.

6. according to claim 5 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method, it is characterised in that: the multiple dimensioned bull attention model is by coding network, decoding network and three, attention network Module composition；

Coding network front end is DBN feature extraction network, and rear end is then deep layer Recognition with Recurrent Neural Network；

The main body of attention network is multilayer perceptron, and network inputs are the output of coding network and the hidden layer list of decoding network First state, the output of network are their degree of association scores；Degree of association score obtains attention after index normalized Coefficient；

Decoding network is connected to the network by single layer Recognition with Recurrent Neural Network and maxout, is gained attention first according to attention network Then force coefficient obtains object vector to the output weighted sum of all coding networks, finally defeated using object vector as network Enter, calculates the posterior probability that each phoneme occurs on each position of output sequence.

7. according to claim 6 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method, it is characterised in that: in the step three, high-level characteristic sequence (h₁,h₂,...,h_T) it is to be paid attention to by multiple dimensioned bull What the coding network front and back end of power model constructed, specific building process are as follows:

Step 3 one, firstly, by the trained DBN bottleneck characteristic (v in coding network front end₁,v₂,...,v_n) as circulation nerve The input of network；Then, reduce the Recognition with Recurrent Neural Network element stack number of plies, the timing of 1 layer network processing feature sequence is only stayed to believe Breath exports high-level characteristic sequence (h₁,h₂,...,h_T)；

Step 3 two, in coding network rear end moment t, the implicit layer state of c layers of feedforward network and backward network is respectivelyWithDue to the hidden layer unit information direction of transfer of feedforward network and backward network on the contrary, then corresponding transmission function is divided It Wei formula (17) and formula (18):

In step 3 two, feedforward network, using GRU as basic unit, shown in hidden layer state computation such as formula (17) (18), by defeated Enter characteristic sequence (x₁,x₂,...,x_T) the 1st layer of implicit layer state can be obtainedIt similarly, can by c-1 layers of implicit layer state Calculate c layers of implicit layer stateCalculating process implies shown in layer state such as formula (19):

Similarly, to c layers of implicit layer state of network after calculating；

Step 3 three, c layers of hidden state for splicing feedforward network and backward network, obtain coding network in the high level of moment t Feature h_t:

8. according to claim 7 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method, it is characterised in that: in the step four, the attention network struction process of multiple dimensioned bull attention model:

Using the attention mechanism of multiple dimensioned bull, the element y of the output sequence of the sequence of calculation to series model_o, by characteristic sequence (h₁,h₂,...,h_T) using the convolution kernel F of different scale_iIt is mapped to an object vector ct of the attention mechanism of every head_o； Again by ct_oAs the decoding network input after the attention network of every head, Recognition with Recurrent Neural Network is calculated one by one and implies stratiform State sequence (s₁,s₂,...,s_o) and output sequence (y₁,y₂,...,y_o)；Wherein,

The single head attention model of multiple dimensioned bull attention model calculates output sequence position o ∈ using attention mechanism { 1,2 ..., O } corresponding object vector ct_oProcess is as follows:

Firstly, calculating the implicit layer state s of output sequence position o-1_o-1With the degree of association of the feature of moment t, such as formula (21) institute Show:

e_o,t=a (s_o-1,h_t) (21)

Wherein, a () represents attention network, is the multilayer perceptron containing only a hidden layer, indicates are as follows:

e_o,t=ω^Ttanh(W[s_o-1,h_t,f_o,t]+b) (22)

f_o=F_i*α_o-1 (23)

Wherein, W and b is respectively input layer to hidden layer weight matrix and bias vector, and ω is hidden layer to output layer weight square Battle array, F_iFor convolution kernel, for the attention mechanism of multiple dimensioned bull, F_iConvolution kernel size be different, the volume of each head Product part uses different size of convolution filter；

Multiple dimensioned bull model is not initialized only with different convolution filter stochastic parameters, and bull model is made to form integrated mould Type, and different size of convolution filter represents different models, and the attention score of last moment can be allowed in various moulds Learnt and trained under the supervision of type, obtains different speech primitive models；

Then, index normalization is carried out to the degree of association of all moment features, the numerical value after normalization is known as paying attention to force coefficient such as Under:

Finally, will notice that force coefficient as weight, sums to the characteristic weighing at all moment, exports sequence under the power that gains attention mechanism The object vector ct of column position o_o:

ct_o=attention ([h₁,h₂,...,h_T],s_o-1) (26)。

9. according to claim 8 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method, it is characterised in that: in the step five, decoding network corresponding to the attention network using every head calculates output Sequence (y₁,y₂,...,y_o) in the element y of position o_oProcess, are as follows:

For the single head of multiple dimensioned bull attention model, the decoding network after the attention network of every head is being calculated Output sequence (y₁,y₂,...,y_o) position o element y_oWhen, the input of network is that multiple dimensioned single attention network generates target Vector ct_oWith the element y of position o-1_o-1, calculating process is as follows:

Firstly, decoding network obtains object vector ct according to attention network query function_o, and by ct_oAnd y_o-1As mono- based on GRU The input of first Recognition with Recurrent Neural Network calculates the implicit layer state s of Recognition with Recurrent Neural Network_o, as shown in formula (27):

s_o=GRU (s_o-1,[y_o-1,ct_o]) (27)

Then, the implicit layer state s of decoding network is given_o∈R^dAs under input condition, sound is obtained by maxout network query function The posterior probability of plain iEach of maxout network, which implies layer unit, multiple candidate units, which therefrom selects number It is worth maximum unit conduct to export, under calculating process:

Wherein, d is to input implicit layer state s_oDimension, corresponding hidden layer number of unit, W_i,j∈R^d×m×kAnd b_i,j∈R^m×kPoint Not Wei maxout network paramter matrix and bias vector, k is the candidate unit number that maxout network each implies layer unit, and m is Output layer unit number, the corresponding output phoneme type number in acoustic model；

Finally, obtaining output vector y by the output layer of maxout network as shown in formula (30)_o, y_oI-th of representation in components exports sequence The posterior probability that phoneme i occurs on column position o；

Finally, by all output sequence (y₁,y₂,...,y_o)_iOne new output sequence (Y is integrated into using DNN₁,Y₂,..., Y_o)；

Y_i=DNN ((y₁,y₂,...,y_o)_i) (31)。