CN110211574A - Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism - Google Patents

Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism Download PDF

Info

Publication number
CN110211574A
CN110211574A CN201910477034.2A CN201910477034A CN110211574A CN 110211574 A CN110211574 A CN 110211574A CN 201910477034 A CN201910477034 A CN 201910477034A CN 110211574 A CN110211574 A CN 110211574A
Authority
CN
China
Prior art keywords
network
layer
output
attention
rbm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910477034.2A
Other languages
Chinese (zh)
Other versions
CN110211574B (en
Inventor
韩纪庆
唐海桃
郑铁然
郑贵滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201910477034.2A priority Critical patent/CN110211574B/en
Publication of CN110211574A publication Critical patent/CN110211574A/en
Application granted granted Critical
Publication of CN110211574B publication Critical patent/CN110211574B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Abstract

Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism, belongs to modeling method field.The problems such as that there are recognition performances is poor for traditional attention model, and attention scale is single.A kind of speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism, bottleneck characteristic is extracted as front end using depth confidence network, the robustness of model can be increased, and rear end is then using the multiple dimensioned bull attention model being made of the convolution kernel of different scale, phoneme, syllable, the other speech primitive of word grade are modeled respectively, calculate Recognition with Recurrent Neural Network hidden layer status switch and output sequence one by one;Decoding network corresponding to attention network using every head calculates element of the output sequence at position, finally adopts all output sequences and is integrated into a new output sequence.The present invention can be improved the recognition effect of speech recognition system.

Description

It is established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method
Technical field
The present invention relates to a kind of training patterns of technical field of voice recognition, and in particular to one kind is by extracting bottleneck characteristic Increase the robustness of model, and multiple dimensioned bull model is established to carry out phoneme, syllable, the other speech primitive of word grade Modeling, the improved method to improve its recognition performance.
Background technique
Voice signal is one of most universal and common signal in human society, it is people's expression, exchange and propagates letter The important channel of breath.In the epoch of current information explosion, magnanimity voice is not being generated all the time in internet and telephone channel Data, in order to which more extensive voice signal is identified, classified and is retrieved, automatic speech recognition (Automatic Speech Recognition, ASR) demand become more urgent.With based on traditional hidden Markov model (Hidden Markov Model, HMM) speech recognition system compares, and speech recognition system is acoustic model, Pronounceable dictionary and language end to end Sound model is all fused in a set of neural network, is also more convenient for while allowing model to become more succinct directly excellent to parameter Change;End to end model has stronger versatility, has not only reduced the dependence gained knowledge to professional language, but also reduce speech recognition System builds difficulty;The training data of end to end model does not need alignment information, directly can directly use the voice with mark In training.Attention model, which is got rid of, exports independence assumption to model, quantifies pair of phoneme and feature by paying attention to force coefficient Homogeneous relation has centainly explanatory.This patent will be unfolded about the research that end-to-end speech identifies from attention model, focus The problems such as cycle of training existing for speech recognition system currently based on attention model is long and recognition performance is short of.For attention Power model causes that its parameter is in large scale and internal structure is complicated due to a large amount of Recognition with Recurrent Neural Network units of internal applications, therefore The problem that training effectiveness is low is caused, although can be by increasing hardware computing resource solving come part, from algorithm Layer viewpoint is still a difficulties.Attention model, which thoroughly abandons etic priori knowledge, causes it to lack effectively Initiation parameter, cause parameter convergence rate slow, this is also to increase the major reason of training time, thus how by model It is one with phonetic feature integration significantly to work.There are the originals of gap with conventional model in performance for attention model Cause are as follows: training caused by corpus resource is limited is insufficient, generate feature and phoneme (grapheme) alignment relation is not accurate enough, model pair The robustness of noise is poor.In view of the above problems, needing to further investigate this model, seek reasonable solution.
Summary of the invention
The purpose of the present invention is to solve existing traditional attention model there are recognition performances poor, attention scales Single problem, and propose a kind of speech recognition modeling foundation side based on bottleneck characteristic and multiple dimensioned bull attention mechanism Method.
A kind of speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism, the method The following steps are included:
Step 1: utilizing the sample FBank speech feature vector X=(x of input1,x2,...,xT) to the RBM net in DBN Network carries out unsupervised training, obtains three first layers connection weight matrix W in initialization coding network1、W2、W3, by this three layers of connection weights Weight matrix and one layer of random initializtion weight output layer W4The bottleneck characteristic based on DBN for forming coding network front end extracts net Network;RBM network representation is limited Boltzmann machine, and full name in English is Restricted Boltzmann Machine, abbreviation RBM; DBN indicates depth confidence network, and full name in English is Deep Belief Network, abbreviation DBN;FBank indicates filter group; Sample FBank speech feature vector X=(x1,x2,...,xT) in, x1Indicate the characteristic sequence of the first moment fbank, x2Indicate the The characteristic sequence of two moment fbank, xTThe characteristic sequence for indicating the T moment fbank is finally constituted by these characteristic sequences Speech feature vector;Fbank feature is mel-frequency cepstrum coefficient (Mel-Frequency Cepstral Coefficients) MFCC does not do the result of discrete cosine transform;
Step 2: sample drawn FBank speech feature vector X=(x1,x2,...,xT) it is input to four layers of bottle of step 1 In neck feature extraction network, bottleneck characteristic is obtained using forward calculation and extracts network output v;Then, it is extracted according to bottleneck characteristic The reality output of the output v and sample training data X of network extract network to bottleneck characteristic using BP algorithm and have carried out supervision instruction Practice, the bottleneck characteristic to calculate four layers extracts every layer of weight of networkAnd biasingGradient, then to weightWith it is inclined It setsIt is iterated update;The Training that step 2 carries out, can calculate faster, more acurrate;
Remove Step 3: four layers of bottleneck characteristic of step 2 are extracted network the last layer output layer, only retains three first layers It is re-used as the front end of coding network, the bottleneck characteristic sequence (v that front end is exported1,v2,...,vn) input as RNN, this Locate the rear end that RNN is coding network, it is made of two groups of multilayer RNN based on GRU, respectively feedforward network and backward network;
The FBank speech feature vector X then inputted extracts network by coding network front end bottleneck characteristic and obtains bottleneck spy Levy sequence (v1,v2,...,vn), the bottleneck characteristic sequence of output is obtained finally by encoder rear end based on the multilayer RNN of GRU Output be high-level characteristic sequence (h1,h2,...,hT);
Coding network extracts network by step 1 bottleneck characteristic, the Recognition with Recurrent Neural Network of step 3 forms, and bottleneck characteristic is claimed to mention Taking network is the front end of coding network;The full name in English of GRU is Gated Recurrent Unit, abbreviation GRU;RNN expression follows Ring neural network, full name are Recurrent Neural Network, abbreviation RNN;
Step 4: the high-level characteristic sequence (h for being exported coding network rear end using the attention mechanism of multiple dimensioned bull1, h2,...,hT) it is injected separately into the convolution kernel F of different scaleiIn, and phoneme, syllable, the other speech primitive of word grade are built Mould, to obtain respective object vector cto, in order to rear end pattern classification.Again by the corresponding ct of every kind of modeloAs more rulers The input for spending the decoding RNN network of every head in bull attention network, calculates RNN hidden layer status switch (s one by one1, s2,...,so), and output aligned phoneme sequence (y1,y2,...,yo);
Step 5: decoding RNN net corresponding to attention network using every head in multiple dimensioned bull attention network Network calculates output sequence (y1,y2,...,yo) spliced to obtain new output sequence (y1,y2,...,yo)i;By new output Sequence is integrated into final output sequence (Y using DNN1,Y2,...,Yo).Here, the output of the corresponding decoding network of every head yoIt is that object vector ct is generated by every head attention networkoWith the element y in position o-1o-1It generates.
The invention has the benefit that
The present invention provides a kind of audio recognition method based on bottleneck characteristic and multiple dimensioned bull attention mechanism.The party Method extracts bottleneck characteristic using DBN as the front end of coding network, can increase the robustness of speech recognition system, and decodes Network, can be respectively to phoneme, syllable, word then using the multiple dimensioned bull attention model being made of the convolution kernel of different scale It is modeled etc. different grades of speech primitive, improves the recognition effect of speech recognition system.
The speech recognition of the attention model based on multiple dimensioned bull of the method for the present invention design is in aligned phoneme sequence identification side Face performance is better than conventional model, has more powerful time series modeling ability and separating capacity.The speech recognition training model of design The accuracy of identification is improved, system misclassification rate and reject rate are reduced.
Detailed description of the invention
Fig. 1 is the multiple dimensioned bull attention model that network is extracted in conjunction with bottleneck characteristic;
Fig. 2 is that the bottleneck characteristic based on DBN extracts network;
Specific embodiment
Specific embodiment 1:
The speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism of present embodiment, It the described method comprises the following steps:
Step 1: utilizing the sample FBank speech feature vector X=(x of input1,x2,...,xT) to the RBM net in DBN Network carries out unsupervised training, obtains three first layers connection weight matrix W in initialization coding network1、W2、W3, by this three layers of connection weights Weight matrix and one layer of random initializtion weight output layer W4The bottleneck characteristic based on DBN for forming coding network front end extracts net Network;RBM network representation is limited Boltzmann machine, and full name in English is Restricted Boltzmann Machine, abbreviation RBM; DBN indicates depth confidence network, and full name in English is Deep Belief Network, abbreviation DBN;FBank indicates filter group; Sample FBank speech feature vector X=(x1,x2,...,xT) in, x1Indicate the characteristic sequence of the first moment fbank, x2Indicate the The characteristic sequence of two moment fbank, xTThe characteristic sequence for indicating the T moment fbank is finally constituted by these characteristic sequences Speech feature vector;Fbank feature is mel-frequency cepstrum coefficient (Mel-Frequency Cepstral Coefficients) MFCC does not do the result of discrete cosine transform;
Step 2: sample drawn FBank speech feature vector X=(x1,x2,...,xT) it is input to four layers of bottle of step 1 In neck feature extraction network, bottleneck characteristic is obtained using forward calculation and extracts network output v;Then, it is extracted according to bottleneck characteristic The reality output of the output v and sample training data X of network extract network to bottleneck characteristic using BP algorithm and have carried out supervision instruction Practice, the bottleneck characteristic to calculate four layers extracts every layer of weight of networkAnd biasingGradient, then to weightWith it is inclined It setsIt is iterated update;The Training that step 2 carries out, can calculate faster, more acurrate;
Remove Step 3: four layers of bottleneck characteristic of step 2 are extracted network the last layer output layer, only retains three first layers It is re-used as the front end of coding network, the bottleneck characteristic sequence (v that front end is exported1,v2,...,vn) input as RNN, this Locate the rear end that RNN is coding network, it is made of two groups of multilayer RNN based on GRU, respectively feedforward network and backward network;
The FBank speech feature vector X then inputted extracts network by coding network front end bottleneck characteristic and obtains bottleneck spy Levy sequence (v1,v2,...,vn), the bottleneck characteristic sequence of output is obtained finally by encoder rear end based on the multilayer RNN of GRU Output be high-level characteristic sequence (h1,h2,...,hT);
Coding network extracts network by step 1 bottleneck characteristic, the Recognition with Recurrent Neural Network of step 3 forms, and bottleneck characteristic is claimed to mention Taking network is the front end of coding network;The full name in English of GRU is Gated Recurrent Unit, abbreviation GRU;RNN expression follows Ring neural network, full name are Recurrent Neural Network, abbreviation RNN;
Step 4: the high-level characteristic sequence (h for being exported coding network rear end using the attention mechanism of multiple dimensioned bull1, h2,...,hT) it is injected separately into the convolution kernel F of different scaleiIn, and phoneme, syllable, the other speech primitive of word grade are built Mould, to obtain respective object vector cto, in order to rear end pattern classification.Again by the corresponding ct of every kind of modeloAs more rulers The input for spending the decoding RNN network of every head in bull attention network, calculates RNN hidden layer status switch (s one by one1, s2,...,so), and output aligned phoneme sequence (y1,y2,...,yo);
Step 5: decoding RNN net corresponding to attention network using every head in multiple dimensioned bull attention network Network calculates output sequence (y1,y2,...,yo) spliced to obtain new output sequence (y1,y2,...,yo)i;By new output Sequence is integrated into final output sequence (Y using DNN1,Y2,...,Yo).Here, the output of the corresponding decoding network of every head yoIt is that object vector ct is generated by every head attention networkoWith the element y in position o-1o-1It generates.
Specific embodiment 2:
Unlike specific embodiment one, present embodiment based on bottleneck characteristic and multiple dimensioned bull attention machine The speech recognition modeling method for building up of system, in the step one, the speech feature vector X=(x of input1,x2,...,xT) adopt With 40 dimension FBank features and energy, then splicing correspondence single order and second differnce, amount to 123 dimension parameters;It is first to the feature of extraction It is first normalized within the scope of training set, so that each component is obeyed standardized normal distribution, the normalization of training set is recycled to join Several pairs of test sets and the processing of development set feature normalization.
Specific embodiment 3:
Unlike specific embodiment one or two, present embodiment is paid attention to based on bottleneck characteristic and multiple dimensioned bull The speech recognition modeling method for building up of power mechanism in the step one and step 2, carries out without prison the RBM network in DBN Superintend and direct training process, wherein the training of RBN network includes propagating using unsupervised pre-training (pretraining) and reversed gradient Algorithm Training method;The input of the RBM network is FBank phonetic feature, and the output layer of RBM network is Softmax layers, the posterior probability of the corresponding binding three-tone state of each output layer unit;Have three between input layer and output layer A hidden layer, second hidden layer are bottleneck layer, and the state cell number of second hidden layer is less than other hidden layers.
Specific embodiment 4:
Unlike specific embodiment one or two, present embodiment is paid attention to based on bottleneck characteristic and multiple dimensioned bull The speech recognition modeling method for building up of power mechanism, in the step one, using the phonetic feature FBank of input to RBM network Carry out the process of unsupervised training are as follows:
Step 1 one, the tag along sort for increasing by one layer of output phoneme state on the basis of the DBN of 4 layers of structure, obtain 5 layers Phonetic feature is extracted the input of RBM network by the bottleneck characteristic of structure as the bottleneck characteristic of 5 layers of structure;Wherein, 4 layers of DBN It is continuously stacked together by 4 groups of single layer RBM, first layer is Gauss-Bernoulli Jacob RBM, and excess-three layer is Bernoulli Jacob-Bernoulli Jacob RBM;
Step 1 two, in RBM, due to there is no neuron to be connected between visible layer and hidden layer inside, so each other Between it is independent mutually.Therefore posterior probability p (v | h) and p (h | v) can easily it be calculated.Wherein Bernoulli Jacob-Bernoulli Jacob Posterior probability p (h | v) calculating process of RBM is as follows:
Wherein, E (v, h) is the energy value of single layer RBM, and v is visual layers vector, and a is the bias vector of visual layers, and h is hidden Vector containing layer, b are hidden layer bias vector, and W is the weight matrix for connecting visible layer and hidden layer, Wi,*vIndicate weight matrix W The i-th row.Formula (1) derives explanation in the case where visible layer determines, implies independent mutually between each component of layer state;
Step 1 three, by Bernoulli Jacob-each component h of Bernoulli Jacob's RBM hidden layer neuron stateiValue there was only 0 and 1, Have:
Wherein, σ () indicates Sigmoid function;
The vector of implicit layer state can be indicated with following formula:
P (h=1 | v)=σ (Wv+b) (3)
According to symmetric relation, under known implicit layer state, the visible layer state of single layer Bernoulli Jacob-Bernoulli Jacob RBM are as follows:
P (v=1 | h)=σ (WTh+a) (4)
Step 1 four, be for visible layer Gaussian Profile Gauss-Bernoulli Jacob RBM, the calculating of posterior probability p (h=1 | v) It is identical as formula (3), and the calculation of p (v | h) is as follows:
P (v | h)=N (v;WTh+a,I) (5)
Wherein I represents unit matrix;
Step 1 five, the process of the stacking of the DBN network encoded are as follows:
Firstly, training one Gauss-Bernoulli Jacob RBM, the reason is that the phonetic feature of first layer input is real number, with Gauss point Cloth is adapted;It then, is the input of Bernoulli Jacob-Bernoulli Jacob RBM visible layer using the activation probability of implicit layer unit as next layer, The hidden state of second layer RBM is re-used as the input of third layer RBM, hereafter all layers;
At given training dataset V, the parameter training of single layer RBM minimizes negative logarithm seemingly by using gradient descent method So degree is realized:
In training process, the derivative general type of negative log likelihood are as follows:
Wherein, θ={ a, b, W } determines the distribution of the visual layers vector v of RBM,<ξ>data<ξ>modelIt respectively indicates from sample The desired value of stochastic variable ξ is estimated in notebook data and model;For the weight of connection matrix in single layer RBM, have:
By taking Bernoulli Jacob-Bernoulli Jacob RBM as an example, it is expected that < vihj>dataFor layer state v visible in training dataiWith implicit stratiform State viThe frequency that value is simultaneously 1, it is expected that < vihj>modelThe expectation being calculated for RBM production model profile.
Step 1 six trains the DBN:
Firstly, according to input sample data to first layer RBM with contrast divergence algorithm (Compare Divergence, CD) Algorithm initialization;Wherein, model is estimated with the sample that gibbs sampler (Gibbs Sample) is obtained in CD algorithm, Gibbs sampler process in CD algorithm specifically:
1), Gibbs sampler initializes visible layer by data sample, obtains visible layer state v;
2) posterior probability p (h | v), is calculated as implicit layer state h using formula (3);
Secondly, formula (4) or formula (5) the visible layer state v of sample reconstruction respectively is respectively adopted in the difference further according to RBM type; Later, the parameter of fixed first layer RBM, the implicit layer state of RBM is calculated according to sample data, and as second layer RBM It can be seen that layer state;At this point, obtaining training second layer RBM using CD algorithm again;
Backward and so on, the weight parameter of connection DBN between layers is finally obtained one by one;
Finally, after training, the weight matrix for only retaining articulamentum and layer is joined as the initialization of DBN transfer matrix Number.
Specific embodiment 5:
Unlike specific embodiment one or two, present embodiment is paid attention to based on bottleneck characteristic and multiple dimensioned bull The speech recognition modeling method for building up of power mechanism in the step two, has carried out supervision instruction to RBM network using BP algorithm Practice, after as carrying out unsupervised training to network using sample input data by RBM technology, preceding 3 layers of connection in coding network Weight matrix is initialized, and the weight matrix after training can partially reflect and " memory " sample data in it is effective Information.In order to further enhance extraction and separating capacity of the network to phonetic feature, so that subsequent attention model obtained Bottleneck characteristic has stronger robustness and distinction, is marked using speech samples to speech samples on the basis of unsupervised training Carry out Training, detailed process are as follows:
Step 2 one, by the weight Direct Transfer of DBN to deep neural network (Deep Neural Network, DNN) In;Later, it is finely adjusted on the basis of known DNN department pattern parameter (fine tuning);Preceding 3 connections square of network Battle array weight W1,W2,W3It is known that this three layers and one layer of random initializtion weight output layer W4Form coding network front end based on DBN Bottleneck characteristic extract network, be added activation primitive between the layers, calculate before network to:
F () is enabled to indicate activation primitive, the 1st implicit layer state h1With first of implicit layer state hlIt is respectively as follows:
h1=f (W1x+b1) (9)
hl=f (Wlhl-1+bl) (10)
By the 4th hidden layer h4The last output vector v of network is obtained by softmax function:
V=softmax (W4h3+b4) (11)
Wherein, when with above method training network, the sample for randomly selecting fixed number is concentrated from training sample every time This, and the feature of sample drawn is input in network, network output is obtained by forward calculation;
Step 2 two, the gradient for calculating every layer parameter using BP algorithm according to the output of network and sample training data, then Model parameter is updated:
DBN network training criterion, which generallys use, minimizes cross entropy (crossentropy, CE) function:
Then, network parameter is finely adjusted using back-propagation algorithm (BackPropagation, BP), the algorithm is logical Often model parameter is updated according to following formula:
Wherein,WithRespectively carry out the update of t subparameter after, in network l layer of weight matrix be biased towards Amount, μ is learning rate, △ Wt lWithThe respectively gradient of weight matrix and bias vector, the gradient calculate one by training sample Order derivative and obtain:
Wherein, M is the number of samples that training uses every time;
Step 2 three cyclically executes 2 two steps of step 2 one and step 2, and by the output of network and model Whether the error of output reaches the mark whether threshold value of setting terminates as training:
Firstly, in the training process, generally also calculating network output and standard using development set (development set) The error of output is used for cross validation (cross validation);
Then, network is extracted in order to obtain bottleneck characteristic on this basis, by output layer last in trained network Network state unit, transfer matrix and bias vector are all deleted, by the remaining front end as coding network, by feature extraction Output of the state of bottleneck layer as coding network front end;In order to extract bottleneck characteristic, using voice primitive character parameter as net Network input, successively calculates bottleneck layer output vector by the network, which is bottleneck characteristic parameter;Because the network is instructed Model parameter is initialized using RBM during practicing, and the alignment letter that GMM-HMM model provides when Training Breath, so the bottleneck characteristic extracted has better distinction and robustness for different phoneme states compared with traditional characteristic.Cause And if the bottleneck characteristic is applied to attention model, it is equivalent to the knowledge migration in network into attention model.
Specific embodiment 6:
Unlike specific embodiment five, present embodiment based on bottleneck characteristic and multiple dimensioned bull attention machine The speech recognition modeling method for building up of system,
The multiple dimensioned bull attention model is by three coding network, decoding network and attention network module groups At;
Coding network and decoding network include Recognition with Recurrent Neural Network unit;
Coding network front end is DBN feature extraction network, and rear end is then deep layer Recognition with Recurrent Neural Network, it is therefore an objective to study and digging The context relation information for digging phonetic feature sequence, extracts high layer information, the discrimination and table of Enhanced feature from primitive character Sign ability;
The main body of attention network is multilayer perceptron, and network inputs are the output of coding network and implying for decoding network Layer unit state, the output of network are their degree of association scores;Degree of association score is infused after index normalized Meaning force coefficient;
Decoding network is connected to the network by single layer Recognition with Recurrent Neural Network and maxout, is obtained first according to attention network Pay attention to force coefficient, object vector then is obtained to the output weighted sum of all coding networks, finally using object vector as net Network input, calculates the posterior probability that each phoneme occurs on each position of output sequence.
Specific embodiment 7:
Unlike specific embodiment six, present embodiment based on bottleneck characteristic and multiple dimensioned bull attention machine The speech recognition modeling method for building up of system, in the step three, high-level characteristic sequence (h1,h2,...,hT) it is by multiple dimensioned What the coding network front and back end of bull attention model constructed, detailed process are as follows:
The rear end of coding network is made of two groups of multilayer circulation neural networks based on GRU unit, respectively feedforward network With backward network.The structure of bilateral network can be passed on the information with future simultaneously, promote the information content of high-level characteristic.
Step 3 one, firstly, by the trained DBN bottleneck characteristic (v in coding network front end1,v2,...,vn) as circulation The input of neural network;Then, reduce the Recognition with Recurrent Neural Network element stack number of plies, only stay 1 layer network processing feature sequence when Sequence information exports high-level characteristic sequence (h1,h2,...,hT);
Step 3 two, in coding network rear end moment t, the implicit layer state point of c layers of feedforward network and backward network It is notWithDue to the hidden layer unit information direction of transfer of feedforward network and backward network on the contrary, then corresponding transmitting letter Number is respectively formula (17) and formula (18):
Step 3 two, feedforward network and backward network carry out feature between layers by respective transmission function parallel Transmitting, carried out in transmittance process in the time domain it is down-sampled, to reach the target for reducing calculation amount.With in feedforward network, due to will GRU is as basic unit, so hidden layer state computation such as formula (17) (18) is shown, by input feature vector sequence (x1,x2,..., xT) the 1st layer of implicit layer state can be obtainedSimilarly, c layers of implicit stratiform can be calculated by c-1 layers of implicit layer state StateCalculating process implies shown in layer state such as formula (19):
Similarly, to c layers of implicit layer state of network after calculating;
Step 3 three, c layers of hidden state for splicing feedforward network and backward network, obtain coding network moment t's High-level characteristic ht:
Specific embodiment 8:
Unlike specific embodiment seven, present embodiment based on bottleneck characteristic and multiple dimensioned bull attention machine The speech recognition modeling method for building up of system, in the step four, the attention network struction of multiple dimensioned bull attention model Process:
Attention mechanism is that the technology of validity feature is extracted from characteristic sequence in the model of sequence to sequence.It will Automobile driving is to the part relevant range of characteristic sequence, to ignore extraneous features.Using the attention machine of multiple dimensioned bull System, the element y of the output sequence of the sequence of calculation to series modelo, by characteristic sequence (h1,h2,...,hT) using different scale Convolution kernel FiIt is mapped to an object vector ct of the attention mechanism of every heado;Object vector ctoAfter being compressed in sequence Important information, realize and sequence important feature information be transformed to regular length vector, consequently facilitating rear end pattern classification;Again will ctoAs the decoding network input after the attention network of every head, Recognition with Recurrent Neural Network is calculated one by one and implies layer state sequence Arrange (s1,s2,...,so) and output sequence (y1,y2,...,yo);Wherein,
The single head attention model of multiple dimensioned bull attention model calculates output sequence position using attention mechanism The corresponding object vector ct of o ∈ { 1,2 ..., O }oProcess is as follows:
Firstly, calculating the implicit layer state s of output sequence position o-1o-1With the degree of association of the feature of moment t, such as formula (21) It is shown:
eo,t=a (so-1,ht) (21)
Wherein, a () represents attention network, is the multilayer perceptron containing only a hidden layer, can show expression are as follows:
eo,tT tanh(W[so-1,ht,fo,t]+b) (22)
fo=Fio-1 (23)
Wherein, W and b is respectively input layer to hidden layer weight matrix and bias vector, and ω is that hidden layer is weighed to output layer Weight matrix, FiFor convolution kernel, for the attention mechanism of multiple dimensioned bull, FiConvolution kernel size be different, each head Conventional part use different size of convolution filter;Due to bull and convolution filter it is of different sizes, the model is not only The characterization information of the different subspace of different sequence locations is able to use to carry out sequence data processing, and different size of volume Product filter can model the information of phoneme, syllable, word level-one.
Here, multiple dimensioned bull model is not initialized only with different convolution filter stochastic parameters, makes bull model Integrated model is formed, and different size of convolution filter represents different models, and the attention of last moment can be allowed to obtain Divide and learnt and trained under the supervision of various models, obtains different speech primitive models;Here, convolution filter size Lesser model main representative the model of phoneme level-one, the model main representative of normal size the model of syllable level-one, and The biggish model that represent word level-one.So the attention model by multiple dimensioned bull can be to different grades of voice Primitive is modeled, and achievees the purpose that improve speech recognition effect.
Then, index normalization is carried out to the degree of association of all moment features, the numerical value after normalization is known as attention system Number is as follows:
Finally, will notice that force coefficient as weight, sums to the characteristic weighing at all moment, it is defeated under the power that gains attention mechanism The object vector ct of sequence location o outo:
The process that object vector is calculated using attention mechanism, can be abbreviated as formula (26):
cto=attention ([h1,h2,...,hT],so-1) (26)。
Specific embodiment 8:
Unlike specific embodiment seven, present embodiment based on bottleneck characteristic and multiple dimensioned bull attention machine The speech recognition modeling method for building up of system, in the step five, decoding net corresponding to the attention network using every head Network calculates output sequence (y1,y2,...,yo) in the element y of position ooProcess, are as follows:
For the single head of multiple dimensioned bull attention model, the decoding network after the attention network of every head exists Calculate output sequence (y1,y2,...,yo) position o element yoWhen, the input of network is that multiple dimensioned single attention network generates Object vector ctoWith the element y of position o-1o-1, calculating process is as follows:
Firstly, decoding network obtains object vector ct according to attention network query functiono, and by ctoAnd yo-1As being based on The input of GRU unit Recognition with Recurrent Neural Network calculates the implicit layer state s of Recognition with Recurrent Neural Networko, as shown in formula (27):
so=GRU (so-1,[yo-1,cto]) (27)
Then, the implicit layer state s of decoding network is giveno∈RdAs under input condition, obtained by maxout network query function To the posterior probability of phoneme iEach of maxout network, which implies layer unit, multiple candidate units, which therefrom selects The maximum unit conduct of numerical value is selected to export, under calculating process:
Wherein, d is to input implicit layer state soDimension, corresponding hidden layer number of unit, Wi,j∈Rd×m×kAnd bi,j∈Rm ×kRespectively maxout network paramter matrix and bias vector, k are the candidate unit number that maxout network each implies layer unit, M is output layer unit number, the corresponding output phoneme type number in acoustic model;
Finally, obtaining output vector y by the output layer of maxout network as shown in formula (30)o, yoI-th of representation in components is defeated The posterior probability that phoneme i occurs on sequence location o out;
Finally, by all output sequence (y1,y2,...,yo)iOne new output sequence (Y is integrated into using DNN1, Y2,...,Yo);
Yi=DNN ((y1,y2,...,yo)i) (31)。
The speech recognition of attention model based on multiple dimensioned bull performance in terms of aligned phoneme sequence identification is better than traditional mould Type has more powerful time series modeling ability and separating capacity.In order to increase the robustness of model in a noisy environment, in this base Increase the bottleneck characteristic based on DBN network on plinth, reason is: firstly, because the bottleneck characteristic based on DBN extracts network not Recursive structure, thus when training DBN can on GPU parallel computation multiframe gradient to effective training for promotion efficiency, into And it can effectively avoid the problem that the multilayered structure bring of Recognition with Recurrent Neural Network is in large scale and time consumption for training;Secondly, by subtracting The number of plies of few Recognition with Recurrent Neural Network can reduce the distance of gradient back transfer in a network, be further reduced trained consuming.Simultaneously The training of model can be allowed to be easier to restrain using the more powerful feature of distinction, further increase training effectiveness.Finally, base There is stronger robustness to noise in the bottleneck characteristic that DBN is extracted, can overcome the disadvantages that the attention model of multiple dimensioned bull using it The weak defect of anti-noise ability.Multiple dimensioned model is not initialized only with different convolution filter stochastic parameters, makes bull model Integrated model is formed, to be better than single head model, and different size of convolution filter represents different models, can allow The attention score at one moment is learnt and is trained under the supervision of various models, and different speech primitive models is obtained.This In, convolution filter size lesser model main representative the model of phoneme level-one, the model main representative of normal size The model of syllable level-one, and biggish convolution filter size represents and establishes model to word level-one.So by multiple dimensioned more The attention model of head can model different grades of speech primitive, achieve the purpose that improve speech recognition effect.
The above shows and describes the basic principles and main features of the present invention and the advantages of the present invention.The technology of the industry Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and the above embodiments and description only describe this The principle of invention, without departing from the spirit and scope of the present invention, various changes and improvements may be made to the invention, these changes Change and improvement all fall within the protetion scope of the claimed invention.The claimed scope of the invention by appended claims and its Equivalent thereof.

Claims (9)

1. the speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism, it is characterised in that: institute State method the following steps are included:
Step 1: utilizing the sample FBank speech feature vector X=(x of input1,x2,...,xT) to the RBM network in DBN into The unsupervised training of row obtains three first layers connection weight matrix W in initialization coding network1、W2、W3, by this three layers of connection weight squares Battle array and one layer of random initializtion weight output layer W4The bottleneck characteristic based on DBN for forming coding network front end extracts network;RBM Network representation is limited Boltzmann machine;DBN indicates depth confidence network;FBank indicates filter group;
Step 2: sample drawn FBank speech feature vector X=(x1,x2,...,xT) it is input to four layers of bottleneck spy of step 1 Sign is extracted in network, is obtained bottleneck characteristic using forward calculation and is extracted network output v;Then, network is extracted according to bottleneck characteristic Output v and sample training data X reality output using BP algorithm to bottleneck characteristic extract network carry out Training, with Calculate every layer of weight △ W of four layers of bottleneck characteristic extraction networkt lAnd biasingGradient, then to weightAnd biasing It is iterated update;
Remove Step 3: four layers of bottleneck characteristic of step 2 are extracted network the last layer output layer, only retains three first layers again As the front end of coding network, the bottleneck characteristic sequence (v that front end is exported1,v2,...,vn) input as RNN, RNN herein For the rear end of coding network, it is made of two groups of multilayer RNN based on GRU, respectively feedforward network and backward network;
The FBank speech feature vector X then inputted extracts network by coding network front end bottleneck characteristic and obtains bottleneck characteristic sequence Arrange (v1,v2,...,vn), the bottleneck characteristic sequence of output is obtained by encoder rear end based on the multilayer RNN of GRU final defeated It is out high-level characteristic sequence (h1,h2,...,hT);
Step 4: the high-level characteristic sequence (h for being exported coding network rear end using the attention mechanism of multiple dimensioned bull1, h2,...,hT) it is injected separately into the convolution kernel F of different scaleiIn, and phoneme, syllable, the other speech primitive of word grade are built Mould, to obtain respective object vector cto, then by the corresponding ct of every kind of modeloAs every in multiple dimensioned bull attention network The input of the decoding RNN network of head, calculates RNN hidden layer status switch (s one by one1,s2,...,so), and output phoneme Sequence (y1,y2,...,yo);
Step 5: decoding RNN network meter corresponding to attention network using every head in multiple dimensioned bull attention network Calculate output sequence (y1,y2,...,yo) spliced to obtain new output sequence (y1,y2,...,yo)i;By new output sequence It is integrated into final output sequence (Y using DNN1,Y2,...,Yo)。
2. according to claim 1 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method, it is characterised in that: in the step one, the FBank speech feature vector X=(x of input1,x2,...,xT) use 40 FBank feature and energy, then splicing correspondence single order and second differnce are tieed up, 123 dimension parameters are amounted to;To the feature of extraction, exist first It is normalized within the scope of training set, so that each component is obeyed standardized normal distribution, recycle the normalized parameter pair of training set Test set and the processing of development set feature normalization.
3. according to claim 2 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method, it is characterised in that: in the step one and step 2, unsupervised training process, institute are carried out to the RBM network in DBN The input for the RBM network stated is FBank phonetic feature, and the output layer of RBM network is softmax layers, each output layer unit pair The posterior probability of three-tone state should be bound;There are three hidden layer between input layer and output layer, the second hidden layer is bottle The state cell number of neck layer, second hidden layer is less than other hidden layers.
4. according to claim 3 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method, it is characterised in that: in the step one, utilize the FBank speech feature vector X=(x of input1,x2,...,xT) right RBM network in DBN carries out the process of unsupervised training are as follows:
Step 1 one, the tag along sort for increasing by one layer of output phoneme state on the basis of the DBN of 4 layers of structure, obtain 5 layers of structure Bottleneck characteristic, phonetic feature is extracted to the input of RBM network as the bottleneck characteristic of 5 layers of structure;Wherein, 4 layers of DBN are by 4 Group single layer RBM is continuously stacked together, and first layer is Gauss-Bernoulli Jacob RBM, and excess-three layer is Bernoulli Jacob-Bernoulli Jacob RBM;
Step 1 two, in RBM, Bernoulli Jacob-Bernoulli Jacob RBM posterior probability p (h | v) calculating process is as follows:
Wherein, E (v, h) is the energy value of single layer RBM, and v is visual layers vector, and a is the bias vector of visual layers, and h is hidden layer Vector, b are hidden layer bias vector, and W is the weight matrix for connecting visible layer and hidden layer, Wi,*vIndicate the i-th of weight matrix W Row implies independent mutually between each component of layer state;
Step 1 three, by Bernoulli Jacob-each component h of Bernoulli Jacob's RBM hidden layer neuron stateiValue there was only 0 and 1, have:
Wherein, σ () indicates Sigmoid function;
The vector of implicit layer state can be indicated with following formula:
P (h=1 | v)=σ (Wv+b) (3)
According to symmetric relation, under known implicit layer state, the visible layer state of single layer Bernoulli Jacob-Bernoulli Jacob RBM are as follows:
P (v=1 | h)=σ (WTh+a) (4)
Step 1 four, be for visible layer Gaussian Profile Gauss-Bernoulli Jacob RBM, the calculating and formula of posterior probability p (h=1 | v) (3) identical, and the calculation of p (v | h) is as follows:
P (v | h)=N (v;WTh+a,I) (5)
Wherein I represents unit matrix;
Step 1 five, the process of the stacking of the DBN network encoded are as follows:
Firstly, training one Gauss-Bernoulli Jacob RBM;It then, is that uncle exerts using the activation probability of implicit layer unit as next layer The input of benefit-Bernoulli Jacob RBM visible layer, the hidden state of second layer RBM are re-used as the input of third layer RBM, hereafter all layers And so on;
At given training dataset V, the parameter training of single layer RBM minimizes negative log likelihood by using gradient descent method It realizes:
In training process, the derivative general type of negative log likelihood are as follows:
Wherein, θ={ a, b, W } determines the distribution of the visual layers vector v of RBM,<ξ>data<ξ>modelIt respectively indicates from sample number According to desired value that stochastic variable ξ is estimated in model;For the weight of connection matrix in single layer RBM, have:
Step 1 six trains the DBN:
Firstly, according to input sample data to first layer RBM contrast divergence algorithm algorithm initialization;Wherein, it is transported in CD algorithm Model is estimated with the sample that gibbs sampler obtains, the gibbs sampler process in CD algorithm specifically:
1), Gibbs sampler initializes visible layer by data sample, obtains visible layer state v;
2) posterior probability p (h | v), is calculated as implicit layer state h using formula (3);
Secondly, formula (4) or formula (5) the visible layer state v of sample reconstruction respectively is respectively adopted in the difference further according to RBM type;Later, The parameter of fixed first layer RBM, the implicit layer state of RBM is calculated according to sample data, and as second layer RBM visible layer State;At this point, obtaining training second layer RBM using CD algorithm again;
Backward and so on, the weight parameter of connection DBN between layers is finally obtained one by one;
Finally, after training, only retain initiation parameter of the weight matrix of articulamentum and layer as DBN transfer matrix.
5. according to claim 4 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method, it is characterised in that: in the step two, the reality of the output v and sample training data X of network are extracted according to bottleneck characteristic Border output extracts network to bottleneck characteristic using BP algorithm and carries out Training, and language is as used on the basis of unsupervised training Sound sample mark carries out Training, detailed process to speech samples are as follows:
Step 2 one, by the weight Direct Transfer of DBN into deep neural network;Later, in known DNN department pattern parameter On the basis of be finely adjusted;First three connection matrix weight W of network1,W2,W3It is known that this three layers and one layer of random initializtion weight Output layer W4The bottleneck characteristic based on DBN for forming coding network front end extracts network, and activation primitive is added between the layers, Calculate network before to:
F () is enabled to indicate activation primitive, the 1st implicit layer state h1With first of implicit layer state hlIt is respectively as follows:
h1=f (W1x+b1) (9)
hl=f (Wlhl-1+bl) (10)
By the 4th hidden layer h4The last output vector v of network is obtained by softmax function:
V=softmax (W4h3+b4) (11)
Wherein, when with above method training network, the sample for randomly selecting fixed number is concentrated from training sample every time, and The feature of sample drawn is input in network, network output is obtained by forward calculation;
Step 2 two, the gradient for calculating every layer parameter using BP algorithm according to the output of network and sample training data, then to mould Shape parameter is updated:
DBN network training criterion intersects entropy function using minimum:
Then, network parameter is finely adjusted using back-propagation algorithm, model parameter is updated according to following formula:
Wherein,WithAfter respectively carrying out the update of t subparameter, l layers of weight matrix and bias vector, μ in network For learning rate, △ Wt lWithThe respectively gradient of weight matrix and bias vector, the gradient calculate single order by training sample and lead It counts and obtains:
Wherein, M is the number of samples that training uses every time;
Step 2 three cyclically executes 2 two steps of step 2 one and step 2, and the output of network and model is exported Error whether reach the mark whether threshold value of setting terminates as training:
Firstly, in the training process, the error of network output and standard output is calculated using development set, it to be used for cross validation;
Then, on this basis, by the network state unit of output layer last in trained network, transfer matrix and it is biased towards Amount is all deleted, by the remaining front end as coding network, using the state of feature extraction bottleneck layer as coding network front end Output;Using voice primitive character parameter as network inputs, bottleneck layer output vector is successively calculated by the network, this to Amount is bottleneck characteristic parameter, extracts bottleneck characteristic.
6. according to claim 5 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method, it is characterised in that: the multiple dimensioned bull attention model is by coding network, decoding network and three, attention network Module composition;
Coding network and decoding network include Recognition with Recurrent Neural Network unit;
Coding network front end is DBN feature extraction network, and rear end is then deep layer Recognition with Recurrent Neural Network;
The main body of attention network is multilayer perceptron, and network inputs are the output of coding network and the hidden layer list of decoding network First state, the output of network are their degree of association scores;Degree of association score obtains attention after index normalized Coefficient;
Decoding network is connected to the network by single layer Recognition with Recurrent Neural Network and maxout, is gained attention first according to attention network Then force coefficient obtains object vector to the output weighted sum of all coding networks, finally defeated using object vector as network Enter, calculates the posterior probability that each phoneme occurs on each position of output sequence.
7. according to claim 6 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method, it is characterised in that: in the step three, high-level characteristic sequence (h1,h2,...,hT) it is to be paid attention to by multiple dimensioned bull What the coding network front and back end of power model constructed, specific building process are as follows:
Step 3 one, firstly, by the trained DBN bottleneck characteristic (v in coding network front end1,v2,...,vn) as circulation nerve The input of network;Then, reduce the Recognition with Recurrent Neural Network element stack number of plies, the timing of 1 layer network processing feature sequence is only stayed to believe Breath exports high-level characteristic sequence (h1,h2,...,hT);
Step 3 two, in coding network rear end moment t, the implicit layer state of c layers of feedforward network and backward network is respectivelyWithDue to the hidden layer unit information direction of transfer of feedforward network and backward network on the contrary, then corresponding transmission function is divided It Wei formula (17) and formula (18):
In step 3 two, feedforward network, using GRU as basic unit, shown in hidden layer state computation such as formula (17) (18), by defeated Enter characteristic sequence (x1,x2,...,xT) the 1st layer of implicit layer state can be obtainedIt similarly, can by c-1 layers of implicit layer state Calculate c layers of implicit layer stateCalculating process implies shown in layer state such as formula (19):
Similarly, to c layers of implicit layer state of network after calculating;
Step 3 three, c layers of hidden state for splicing feedforward network and backward network, obtain coding network in the high level of moment t Feature ht:
8. according to claim 7 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method, it is characterised in that: in the step four, the attention network struction process of multiple dimensioned bull attention model:
Using the attention mechanism of multiple dimensioned bull, the element y of the output sequence of the sequence of calculation to series modelo, by characteristic sequence (h1,h2,...,hT) using the convolution kernel F of different scaleiIt is mapped to an object vector ct of the attention mechanism of every heado; Again by ctoAs the decoding network input after the attention network of every head, Recognition with Recurrent Neural Network is calculated one by one and implies stratiform State sequence (s1,s2,...,so) and output sequence (y1,y2,...,yo);Wherein,
The single head attention model of multiple dimensioned bull attention model calculates output sequence position o ∈ using attention mechanism { 1,2 ..., O } corresponding object vector ctoProcess is as follows:
Firstly, calculating the implicit layer state s of output sequence position o-1o-1With the degree of association of the feature of moment t, such as formula (21) institute Show:
eo,t=a (so-1,ht) (21)
Wherein, a () represents attention network, is the multilayer perceptron containing only a hidden layer, indicates are as follows:
eo,tTtanh(W[so-1,ht,fo,t]+b) (22)
fo=Fio-1 (23)
Wherein, W and b is respectively input layer to hidden layer weight matrix and bias vector, and ω is hidden layer to output layer weight square Battle array, FiFor convolution kernel, for the attention mechanism of multiple dimensioned bull, FiConvolution kernel size be different, the volume of each head Product part uses different size of convolution filter;
Multiple dimensioned bull model is not initialized only with different convolution filter stochastic parameters, and bull model is made to form integrated mould Type, and different size of convolution filter represents different models, and the attention score of last moment can be allowed in various moulds Learnt and trained under the supervision of type, obtains different speech primitive models;
Then, index normalization is carried out to the degree of association of all moment features, the numerical value after normalization is known as paying attention to force coefficient such as Under:
Finally, will notice that force coefficient as weight, sums to the characteristic weighing at all moment, exports sequence under the power that gains attention mechanism The object vector ct of column position oo:
The process that object vector is calculated using attention mechanism, can be abbreviated as formula (26):
cto=attention ([h1,h2,...,hT],so-1) (26)。
9. according to claim 8 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism Method, it is characterised in that: in the step five, decoding network corresponding to the attention network using every head calculates output Sequence (y1,y2,...,yo) in the element y of position ooProcess, are as follows:
For the single head of multiple dimensioned bull attention model, the decoding network after the attention network of every head is being calculated Output sequence (y1,y2,...,yo) position o element yoWhen, the input of network is that multiple dimensioned single attention network generates target Vector ctoWith the element y of position o-1o-1, calculating process is as follows:
Firstly, decoding network obtains object vector ct according to attention network query functiono, and by ctoAnd yo-1As mono- based on GRU The input of first Recognition with Recurrent Neural Network calculates the implicit layer state s of Recognition with Recurrent Neural Networko, as shown in formula (27):
so=GRU (so-1,[yo-1,cto]) (27)
Then, the implicit layer state s of decoding network is giveno∈RdAs under input condition, sound is obtained by maxout network query function The posterior probability of plain iEach of maxout network, which implies layer unit, multiple candidate units, which therefrom selects number It is worth maximum unit conduct to export, under calculating process:
Wherein, d is to input implicit layer state soDimension, corresponding hidden layer number of unit, Wi,j∈Rd×m×kAnd bi,j∈Rm×kPoint Not Wei maxout network paramter matrix and bias vector, k is the candidate unit number that maxout network each implies layer unit, and m is Output layer unit number, the corresponding output phoneme type number in acoustic model;
Finally, obtaining output vector y by the output layer of maxout network as shown in formula (30)o, yoI-th of representation in components exports sequence The posterior probability that phoneme i occurs on column position o;
Finally, by all output sequence (y1,y2,...,yo)iOne new output sequence (Y is integrated into using DNN1,Y2,..., Yo);
Yi=DNN ((y1,y2,...,yo)i) (31)。
CN201910477034.2A 2019-06-03 2019-06-03 Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism Active CN110211574B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910477034.2A CN110211574B (en) 2019-06-03 2019-06-03 Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910477034.2A CN110211574B (en) 2019-06-03 2019-06-03 Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism

Publications (2)

Publication Number Publication Date
CN110211574A true CN110211574A (en) 2019-09-06
CN110211574B CN110211574B (en) 2022-03-11

Family

ID=67790458

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910477034.2A Active CN110211574B (en) 2019-06-03 2019-06-03 Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism

Country Status (1)

Country Link
CN (1) CN110211574B (en)

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765966A (en) * 2019-10-30 2020-02-07 哈尔滨工业大学 One-stage automatic recognition and translation method for handwritten characters
CN110824915A (en) * 2019-09-30 2020-02-21 华南师范大学 GA-DBN network-based intelligent monitoring method and system for wastewater treatment
CN110880328A (en) * 2019-11-20 2020-03-13 Oppo广东移动通信有限公司 Arrival reminding method, device, terminal and storage medium
CN110930978A (en) * 2019-11-08 2020-03-27 北京搜狗科技发展有限公司 Language identification method and device and language identification device
CN111145760A (en) * 2020-04-02 2020-05-12 支付宝(杭州)信息技术有限公司 Method and neural network model for speaker recognition
CN111159419A (en) * 2019-12-09 2020-05-15 浙江师范大学 Knowledge tracking data processing method, system and storage medium based on graph convolution
CN111243609A (en) * 2020-01-10 2020-06-05 平安科技(深圳)有限公司 Method and device for intelligently detecting effective voice and computer readable storage medium
CN111243578A (en) * 2020-01-10 2020-06-05 中国科学院声学研究所 Chinese mandarin character-voice conversion method based on self-attention mechanism
CN111276131A (en) * 2020-01-22 2020-06-12 厦门大学 Multi-class acoustic feature integration method and system based on deep neural network
CN111312228A (en) * 2019-12-09 2020-06-19 中国南方电网有限责任公司 End-to-end-based voice navigation method applied to electric power enterprise customer service
CN111309965A (en) * 2020-03-20 2020-06-19 腾讯科技(深圳)有限公司 Audio matching method and device, computer equipment and storage medium
CN111477221A (en) * 2020-05-28 2020-07-31 中国科学技术大学 Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network
CN111489738A (en) * 2020-04-13 2020-08-04 深圳市友杰智新科技有限公司 Feature extraction method and voice command identification method based on multi-head attention mechanism
CN111524519A (en) * 2020-05-28 2020-08-11 中国科学技术大学 End-to-end multi-channel speech recognition method using advanced feature fusion
CN111680591A (en) * 2020-05-28 2020-09-18 天津大学 Pronunciation inversion method based on feature fusion and attention mechanism
CN112530421A (en) * 2020-11-03 2021-03-19 科大讯飞股份有限公司 Voice recognition method, electronic equipment and storage device
CN112633175A (en) * 2020-12-24 2021-04-09 哈尔滨理工大学 Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment
CN112992125A (en) * 2021-04-20 2021-06-18 北京沃丰时代数据科技有限公司 Voice recognition method and device, electronic equipment and readable storage medium
CN112992126A (en) * 2021-04-22 2021-06-18 北京远鉴信息技术有限公司 Voice authenticity verification method and device, electronic equipment and readable storage medium
CN113192489A (en) * 2021-05-16 2021-07-30 金陵科技学院 Paint spraying robot voice recognition method based on multi-scale enhancement BiLSTM model
CN113450781A (en) * 2020-03-25 2021-09-28 阿里巴巴集团控股有限公司 Speech processing method, speech encoder, speech decoder and speech recognition system
CN113537472A (en) * 2021-07-26 2021-10-22 北京计算机技术及应用研究所 Bidirectional recurrent neural network with low computation and storage consumption
CN113593526A (en) * 2021-07-27 2021-11-02 哈尔滨理工大学 Speech emotion recognition method based on deep learning
CN113705322A (en) * 2021-06-11 2021-11-26 北京易达图灵科技有限公司 Handwritten Chinese character recognition method and device based on threshold graph neural network
CN113782014A (en) * 2021-09-26 2021-12-10 联想(北京)有限公司 Voice recognition method and device
CN113936680A (en) * 2021-10-08 2022-01-14 电子科技大学 Single-channel speech enhancement method based on multi-scale information perception convolutional neural network
CN114333900A (en) * 2021-11-30 2022-04-12 南京硅基智能科技有限公司 Method for extracting BNF characteristics end to end, network model, training method and system
CN114999459A (en) * 2022-05-05 2022-09-02 苏州大学 Voice recognition method and system based on multi-scale recursive quantitative analysis
WO2022267960A1 (en) * 2021-06-24 2022-12-29 长沙理工大学 Federated attention dbn collaborative detection system based on client selections

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120253812A1 (en) * 2011-04-01 2012-10-04 Sony Computer Entertainment Inc. Speech syllable/vowel/phone boundary detection using auditory attention cues
US20120328161A1 (en) * 2011-06-22 2012-12-27 Palenychka Roman Method and multi-scale attention system for spatiotemporal change determination and object detection
US20140114655A1 (en) * 2012-10-19 2014-04-24 Sony Computer Entertainment Inc. Emotion recognition using auditory attention cues extracted from users voice
CN104756182A (en) * 2012-11-29 2015-07-01 索尼电脑娱乐公司 Combining auditory attention cues with phoneme posterior scores for phone/vowel/syllable boundary detection
CN106571135A (en) * 2016-10-27 2017-04-19 苏州大学 Whisper speech feature extraction method and system
CN108417202A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 Audio recognition method and system
CN108510983A (en) * 2017-02-24 2018-09-07 百度(美国)有限责任公司 The system and method for automatic unit selection and goal decomposition for sequence labelling
CN109271838A (en) * 2018-07-19 2019-01-25 重庆邮电大学 A kind of three parameter attributes fusion gesture identification method based on fmcw radar

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120253812A1 (en) * 2011-04-01 2012-10-04 Sony Computer Entertainment Inc. Speech syllable/vowel/phone boundary detection using auditory attention cues
CN103503060A (en) * 2011-04-01 2014-01-08 索尼电脑娱乐公司 Speech syllable/vowel/phone boundary detection using auditory attention cues
CN105023573A (en) * 2011-04-01 2015-11-04 索尼电脑娱乐公司 Speech syllable/vowel/phone boundary detection using auditory attention cues
US20120328161A1 (en) * 2011-06-22 2012-12-27 Palenychka Roman Method and multi-scale attention system for spatiotemporal change determination and object detection
US20140114655A1 (en) * 2012-10-19 2014-04-24 Sony Computer Entertainment Inc. Emotion recognition using auditory attention cues extracted from users voice
CN104756182A (en) * 2012-11-29 2015-07-01 索尼电脑娱乐公司 Combining auditory attention cues with phoneme posterior scores for phone/vowel/syllable boundary detection
CN106571135A (en) * 2016-10-27 2017-04-19 苏州大学 Whisper speech feature extraction method and system
CN108510983A (en) * 2017-02-24 2018-09-07 百度(美国)有限责任公司 The system and method for automatic unit selection and goal decomposition for sequence labelling
CN108417202A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 Audio recognition method and system
CN109271838A (en) * 2018-07-19 2019-01-25 重庆邮电大学 A kind of three parameter attributes fusion gesture identification method based on fmcw radar

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ANDROS TJANDRA ET AL.: "MULTI-SCALE ALIGNMENT AND CONTEXTUAL HISTORY FOR ATTENTION MECHANISM IN SEQUENCE-TO-SEQUENCE MODEL", 《SLT 2018》 *
YOON KIM: "Convolutional Neural Networks for Sentence Classification", 《ARXIV》 *
龙星延: "基于注意力机制的端到端语音识别技术研究", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110824915A (en) * 2019-09-30 2020-02-21 华南师范大学 GA-DBN network-based intelligent monitoring method and system for wastewater treatment
CN110824915B (en) * 2019-09-30 2022-06-07 华南师范大学 GA-DBN network-based intelligent monitoring method and system for wastewater treatment
CN110765966A (en) * 2019-10-30 2020-02-07 哈尔滨工业大学 One-stage automatic recognition and translation method for handwritten characters
CN110765966B (en) * 2019-10-30 2022-03-25 哈尔滨工业大学 One-stage automatic recognition and translation method for handwritten characters
CN110930978A (en) * 2019-11-08 2020-03-27 北京搜狗科技发展有限公司 Language identification method and device and language identification device
CN110880328A (en) * 2019-11-20 2020-03-13 Oppo广东移动通信有限公司 Arrival reminding method, device, terminal and storage medium
CN111159419B (en) * 2019-12-09 2021-05-25 浙江师范大学 Knowledge tracking data processing method, system and storage medium based on graph convolution
CN111159419A (en) * 2019-12-09 2020-05-15 浙江师范大学 Knowledge tracking data processing method, system and storage medium based on graph convolution
CN111312228A (en) * 2019-12-09 2020-06-19 中国南方电网有限责任公司 End-to-end-based voice navigation method applied to electric power enterprise customer service
CN111243609A (en) * 2020-01-10 2020-06-05 平安科技(深圳)有限公司 Method and device for intelligently detecting effective voice and computer readable storage medium
CN111243578A (en) * 2020-01-10 2020-06-05 中国科学院声学研究所 Chinese mandarin character-voice conversion method based on self-attention mechanism
CN111276131A (en) * 2020-01-22 2020-06-12 厦门大学 Multi-class acoustic feature integration method and system based on deep neural network
CN111309965A (en) * 2020-03-20 2020-06-19 腾讯科技(深圳)有限公司 Audio matching method and device, computer equipment and storage medium
CN111309965B (en) * 2020-03-20 2024-02-13 腾讯科技(深圳)有限公司 Audio matching method, device, computer equipment and storage medium
CN113450781A (en) * 2020-03-25 2021-09-28 阿里巴巴集团控股有限公司 Speech processing method, speech encoder, speech decoder and speech recognition system
CN111145760B (en) * 2020-04-02 2020-06-30 支付宝(杭州)信息技术有限公司 Method and neural network model for speaker recognition
CN111145760A (en) * 2020-04-02 2020-05-12 支付宝(杭州)信息技术有限公司 Method and neural network model for speaker recognition
CN111489738B (en) * 2020-04-13 2021-02-23 深圳市友杰智新科技有限公司 Feature extraction method and voice command identification method based on multi-head attention mechanism
CN111489738A (en) * 2020-04-13 2020-08-04 深圳市友杰智新科技有限公司 Feature extraction method and voice command identification method based on multi-head attention mechanism
CN111477221A (en) * 2020-05-28 2020-07-31 中国科学技术大学 Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network
CN111680591B (en) * 2020-05-28 2023-01-13 天津大学 Pronunciation inversion method based on feature fusion and attention mechanism
CN111477221B (en) * 2020-05-28 2022-12-30 中国科学技术大学 Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network
CN111680591A (en) * 2020-05-28 2020-09-18 天津大学 Pronunciation inversion method based on feature fusion and attention mechanism
CN111524519B (en) * 2020-05-28 2022-09-30 中国科学技术大学 End-to-end multi-channel speech recognition method using advanced feature fusion
CN111524519A (en) * 2020-05-28 2020-08-11 中国科学技术大学 End-to-end multi-channel speech recognition method using advanced feature fusion
CN112530421A (en) * 2020-11-03 2021-03-19 科大讯飞股份有限公司 Voice recognition method, electronic equipment and storage device
CN112633175A (en) * 2020-12-24 2021-04-09 哈尔滨理工大学 Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment
CN112992125A (en) * 2021-04-20 2021-06-18 北京沃丰时代数据科技有限公司 Voice recognition method and device, electronic equipment and readable storage medium
CN112992125B (en) * 2021-04-20 2021-08-03 北京沃丰时代数据科技有限公司 Voice recognition method and device, electronic equipment and readable storage medium
CN112992126A (en) * 2021-04-22 2021-06-18 北京远鉴信息技术有限公司 Voice authenticity verification method and device, electronic equipment and readable storage medium
CN113192489A (en) * 2021-05-16 2021-07-30 金陵科技学院 Paint spraying robot voice recognition method based on multi-scale enhancement BiLSTM model
CN113705322A (en) * 2021-06-11 2021-11-26 北京易达图灵科技有限公司 Handwritten Chinese character recognition method and device based on threshold graph neural network
WO2022267960A1 (en) * 2021-06-24 2022-12-29 长沙理工大学 Federated attention dbn collaborative detection system based on client selections
CN113537472A (en) * 2021-07-26 2021-10-22 北京计算机技术及应用研究所 Bidirectional recurrent neural network with low computation and storage consumption
CN113537472B (en) * 2021-07-26 2024-04-09 北京计算机技术及应用研究所 Construction method of bidirectional recurrent neural network with low calculation and storage consumption
CN113593526A (en) * 2021-07-27 2021-11-02 哈尔滨理工大学 Speech emotion recognition method based on deep learning
CN113782014A (en) * 2021-09-26 2021-12-10 联想(北京)有限公司 Voice recognition method and device
CN113782014B (en) * 2021-09-26 2024-03-26 联想(北京)有限公司 Speech recognition method and device
CN113936680B (en) * 2021-10-08 2023-08-08 电子科技大学 Single-channel voice enhancement method based on multi-scale information perception convolutional neural network
CN113936680A (en) * 2021-10-08 2022-01-14 电子科技大学 Single-channel speech enhancement method based on multi-scale information perception convolutional neural network
CN114333900A (en) * 2021-11-30 2022-04-12 南京硅基智能科技有限公司 Method for extracting BNF characteristics end to end, network model, training method and system
CN114333900B (en) * 2021-11-30 2023-09-05 南京硅基智能科技有限公司 Method for extracting BNF (BNF) characteristics end to end, network model, training method and training system
CN114999459A (en) * 2022-05-05 2022-09-02 苏州大学 Voice recognition method and system based on multi-scale recursive quantitative analysis

Also Published As

Publication number Publication date
CN110211574B (en) 2022-03-11

Similar Documents

Publication Publication Date Title
CN110211574A (en) Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism
Zhang et al. End-to-end attention based text-dependent speaker verification
An et al. Deep CNNs with self-attention for speaker identification
Cai et al. A novel learnable dictionary encoding layer for end-to-end language identification
CN109409296B (en) Video emotion recognition method integrating facial expression recognition and voice emotion recognition
Deng et al. New types of deep neural network learning for speech recognition and related applications: An overview
CN109192199A (en) A kind of data processing method of combination bottleneck characteristic acoustic model
Waibel Modular construction of time-delay neural networks for speech recognition
Tan et al. The artificial intelligence renaissance: deep learning and the road to human-level machine intelligence
Elleuch et al. Arabic handwritten characters recognition using deep belief neural networks
Irum et al. Speaker verification using deep neural networks: A
Frinken et al. Deep BLSTM neural networks for unconstrained continuous handwritten text recognition
CN109545227A (en) Speaker&#39;s gender automatic identifying method and system based on depth autoencoder network
CN109637526A (en) The adaptive approach of DNN acoustic model based on personal identification feature
Chattopadhyay et al. Optimizing speech emotion recognition using manta-ray based feature selection
CN111899766A (en) Speech emotion recognition method based on optimization fusion of depth features and acoustic features
Jiang et al. Speech Emotion Recognition Using Deep Convolutional Neural Network and Simple Recurrent Unit.
Huang et al. A neural net approach to speech recognition
Singh et al. Emoception: An inception inspired efficient speech emotion recognition network
Wang et al. L2 mispronunciation verification based on acoustic phone embedding and Siamese networks
Huilian et al. Speech emotion recognition based on BLSTM and CNN feature fusion
Yang et al. Exploration of acoustic and lexical cues for the interspeech 2020 computational paralinguistic challenge
Cai et al. Deep speaker embeddings with convolutional neural network on supervector for text-independent speaker recognition
Bottou Multilayer neural networks
Hu et al. I-vector and dnn hybrid method for short utterance speaker recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant