CN110211574A - Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism - Google Patents
Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism Download PDFInfo
- Publication number
- CN110211574A CN110211574A CN201910477034.2A CN201910477034A CN110211574A CN 110211574 A CN110211574 A CN 110211574A CN 201910477034 A CN201910477034 A CN 201910477034A CN 110211574 A CN110211574 A CN 110211574A
- Authority
- CN
- China
- Prior art keywords
- network
- layer
- output
- attention
- rbm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Abstract
Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism, belongs to modeling method field.The problems such as that there are recognition performances is poor for traditional attention model, and attention scale is single.A kind of speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism, bottleneck characteristic is extracted as front end using depth confidence network, the robustness of model can be increased, and rear end is then using the multiple dimensioned bull attention model being made of the convolution kernel of different scale, phoneme, syllable, the other speech primitive of word grade are modeled respectively, calculate Recognition with Recurrent Neural Network hidden layer status switch and output sequence one by one;Decoding network corresponding to attention network using every head calculates element of the output sequence at position, finally adopts all output sequences and is integrated into a new output sequence.The present invention can be improved the recognition effect of speech recognition system.
Description
Technical field
The present invention relates to a kind of training patterns of technical field of voice recognition, and in particular to one kind is by extracting bottleneck characteristic
Increase the robustness of model, and multiple dimensioned bull model is established to carry out phoneme, syllable, the other speech primitive of word grade
Modeling, the improved method to improve its recognition performance.
Background technique
Voice signal is one of most universal and common signal in human society, it is people's expression, exchange and propagates letter
The important channel of breath.In the epoch of current information explosion, magnanimity voice is not being generated all the time in internet and telephone channel
Data, in order to which more extensive voice signal is identified, classified and is retrieved, automatic speech recognition (Automatic
Speech Recognition, ASR) demand become more urgent.With based on traditional hidden Markov model (Hidden
Markov Model, HMM) speech recognition system compares, and speech recognition system is acoustic model, Pronounceable dictionary and language end to end
Sound model is all fused in a set of neural network, is also more convenient for while allowing model to become more succinct directly excellent to parameter
Change;End to end model has stronger versatility, has not only reduced the dependence gained knowledge to professional language, but also reduce speech recognition
System builds difficulty;The training data of end to end model does not need alignment information, directly can directly use the voice with mark
In training.Attention model, which is got rid of, exports independence assumption to model, quantifies pair of phoneme and feature by paying attention to force coefficient
Homogeneous relation has centainly explanatory.This patent will be unfolded about the research that end-to-end speech identifies from attention model, focus
The problems such as cycle of training existing for speech recognition system currently based on attention model is long and recognition performance is short of.For attention
Power model causes that its parameter is in large scale and internal structure is complicated due to a large amount of Recognition with Recurrent Neural Network units of internal applications, therefore
The problem that training effectiveness is low is caused, although can be by increasing hardware computing resource solving come part, from algorithm
Layer viewpoint is still a difficulties.Attention model, which thoroughly abandons etic priori knowledge, causes it to lack effectively
Initiation parameter, cause parameter convergence rate slow, this is also to increase the major reason of training time, thus how by model
It is one with phonetic feature integration significantly to work.There are the originals of gap with conventional model in performance for attention model
Cause are as follows: training caused by corpus resource is limited is insufficient, generate feature and phoneme (grapheme) alignment relation is not accurate enough, model pair
The robustness of noise is poor.In view of the above problems, needing to further investigate this model, seek reasonable solution.
Summary of the invention
The purpose of the present invention is to solve existing traditional attention model there are recognition performances poor, attention scales
Single problem, and propose a kind of speech recognition modeling foundation side based on bottleneck characteristic and multiple dimensioned bull attention mechanism
Method.
A kind of speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism, the method
The following steps are included:
Step 1: utilizing the sample FBank speech feature vector X=(x of input1,x2,...,xT) to the RBM net in DBN
Network carries out unsupervised training, obtains three first layers connection weight matrix W in initialization coding network1、W2、W3, by this three layers of connection weights
Weight matrix and one layer of random initializtion weight output layer W4The bottleneck characteristic based on DBN for forming coding network front end extracts net
Network;RBM network representation is limited Boltzmann machine, and full name in English is Restricted Boltzmann Machine, abbreviation RBM;
DBN indicates depth confidence network, and full name in English is Deep Belief Network, abbreviation DBN;FBank indicates filter group;
Sample FBank speech feature vector X=(x1,x2,...,xT) in, x1Indicate the characteristic sequence of the first moment fbank, x2Indicate the
The characteristic sequence of two moment fbank, xTThe characteristic sequence for indicating the T moment fbank is finally constituted by these characteristic sequences
Speech feature vector;Fbank feature is mel-frequency cepstrum coefficient (Mel-Frequency Cepstral Coefficients)
MFCC does not do the result of discrete cosine transform;
Step 2: sample drawn FBank speech feature vector X=(x1,x2,...,xT) it is input to four layers of bottle of step 1
In neck feature extraction network, bottleneck characteristic is obtained using forward calculation and extracts network output v;Then, it is extracted according to bottleneck characteristic
The reality output of the output v and sample training data X of network extract network to bottleneck characteristic using BP algorithm and have carried out supervision instruction
Practice, the bottleneck characteristic to calculate four layers extracts every layer of weight of networkAnd biasingGradient, then to weightWith it is inclined
It setsIt is iterated update;The Training that step 2 carries out, can calculate faster, more acurrate;
Remove Step 3: four layers of bottleneck characteristic of step 2 are extracted network the last layer output layer, only retains three first layers
It is re-used as the front end of coding network, the bottleneck characteristic sequence (v that front end is exported1,v2,...,vn) input as RNN, this
Locate the rear end that RNN is coding network, it is made of two groups of multilayer RNN based on GRU, respectively feedforward network and backward network;
The FBank speech feature vector X then inputted extracts network by coding network front end bottleneck characteristic and obtains bottleneck spy
Levy sequence (v1,v2,...,vn), the bottleneck characteristic sequence of output is obtained finally by encoder rear end based on the multilayer RNN of GRU
Output be high-level characteristic sequence (h1,h2,...,hT);
Coding network extracts network by step 1 bottleneck characteristic, the Recognition with Recurrent Neural Network of step 3 forms, and bottleneck characteristic is claimed to mention
Taking network is the front end of coding network;The full name in English of GRU is Gated Recurrent Unit, abbreviation GRU;RNN expression follows
Ring neural network, full name are Recurrent Neural Network, abbreviation RNN;
Step 4: the high-level characteristic sequence (h for being exported coding network rear end using the attention mechanism of multiple dimensioned bull1,
h2,...,hT) it is injected separately into the convolution kernel F of different scaleiIn, and phoneme, syllable, the other speech primitive of word grade are built
Mould, to obtain respective object vector cto, in order to rear end pattern classification.Again by the corresponding ct of every kind of modeloAs more rulers
The input for spending the decoding RNN network of every head in bull attention network, calculates RNN hidden layer status switch (s one by one1,
s2,...,so), and output aligned phoneme sequence (y1,y2,...,yo);
Step 5: decoding RNN net corresponding to attention network using every head in multiple dimensioned bull attention network
Network calculates output sequence (y1,y2,...,yo) spliced to obtain new output sequence (y1,y2,...,yo)i;By new output
Sequence is integrated into final output sequence (Y using DNN1,Y2,...,Yo).Here, the output of the corresponding decoding network of every head
yoIt is that object vector ct is generated by every head attention networkoWith the element y in position o-1o-1It generates.
The invention has the benefit that
The present invention provides a kind of audio recognition method based on bottleneck characteristic and multiple dimensioned bull attention mechanism.The party
Method extracts bottleneck characteristic using DBN as the front end of coding network, can increase the robustness of speech recognition system, and decodes
Network, can be respectively to phoneme, syllable, word then using the multiple dimensioned bull attention model being made of the convolution kernel of different scale
It is modeled etc. different grades of speech primitive, improves the recognition effect of speech recognition system.
The speech recognition of the attention model based on multiple dimensioned bull of the method for the present invention design is in aligned phoneme sequence identification side
Face performance is better than conventional model, has more powerful time series modeling ability and separating capacity.The speech recognition training model of design
The accuracy of identification is improved, system misclassification rate and reject rate are reduced.
Detailed description of the invention
Fig. 1 is the multiple dimensioned bull attention model that network is extracted in conjunction with bottleneck characteristic;
Fig. 2 is that the bottleneck characteristic based on DBN extracts network;
Specific embodiment
Specific embodiment 1:
The speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism of present embodiment,
It the described method comprises the following steps:
Step 1: utilizing the sample FBank speech feature vector X=(x of input1,x2,...,xT) to the RBM net in DBN
Network carries out unsupervised training, obtains three first layers connection weight matrix W in initialization coding network1、W2、W3, by this three layers of connection weights
Weight matrix and one layer of random initializtion weight output layer W4The bottleneck characteristic based on DBN for forming coding network front end extracts net
Network;RBM network representation is limited Boltzmann machine, and full name in English is Restricted Boltzmann Machine, abbreviation RBM;
DBN indicates depth confidence network, and full name in English is Deep Belief Network, abbreviation DBN;FBank indicates filter group;
Sample FBank speech feature vector X=(x1,x2,...,xT) in, x1Indicate the characteristic sequence of the first moment fbank, x2Indicate the
The characteristic sequence of two moment fbank, xTThe characteristic sequence for indicating the T moment fbank is finally constituted by these characteristic sequences
Speech feature vector;Fbank feature is mel-frequency cepstrum coefficient (Mel-Frequency Cepstral Coefficients)
MFCC does not do the result of discrete cosine transform;
Step 2: sample drawn FBank speech feature vector X=(x1,x2,...,xT) it is input to four layers of bottle of step 1
In neck feature extraction network, bottleneck characteristic is obtained using forward calculation and extracts network output v;Then, it is extracted according to bottleneck characteristic
The reality output of the output v and sample training data X of network extract network to bottleneck characteristic using BP algorithm and have carried out supervision instruction
Practice, the bottleneck characteristic to calculate four layers extracts every layer of weight of networkAnd biasingGradient, then to weightWith it is inclined
It setsIt is iterated update;The Training that step 2 carries out, can calculate faster, more acurrate;
Remove Step 3: four layers of bottleneck characteristic of step 2 are extracted network the last layer output layer, only retains three first layers
It is re-used as the front end of coding network, the bottleneck characteristic sequence (v that front end is exported1,v2,...,vn) input as RNN, this
Locate the rear end that RNN is coding network, it is made of two groups of multilayer RNN based on GRU, respectively feedforward network and backward network;
The FBank speech feature vector X then inputted extracts network by coding network front end bottleneck characteristic and obtains bottleneck spy
Levy sequence (v1,v2,...,vn), the bottleneck characteristic sequence of output is obtained finally by encoder rear end based on the multilayer RNN of GRU
Output be high-level characteristic sequence (h1,h2,...,hT);
Coding network extracts network by step 1 bottleneck characteristic, the Recognition with Recurrent Neural Network of step 3 forms, and bottleneck characteristic is claimed to mention
Taking network is the front end of coding network;The full name in English of GRU is Gated Recurrent Unit, abbreviation GRU;RNN expression follows
Ring neural network, full name are Recurrent Neural Network, abbreviation RNN;
Step 4: the high-level characteristic sequence (h for being exported coding network rear end using the attention mechanism of multiple dimensioned bull1,
h2,...,hT) it is injected separately into the convolution kernel F of different scaleiIn, and phoneme, syllable, the other speech primitive of word grade are built
Mould, to obtain respective object vector cto, in order to rear end pattern classification.Again by the corresponding ct of every kind of modeloAs more rulers
The input for spending the decoding RNN network of every head in bull attention network, calculates RNN hidden layer status switch (s one by one1,
s2,...,so), and output aligned phoneme sequence (y1,y2,...,yo);
Step 5: decoding RNN net corresponding to attention network using every head in multiple dimensioned bull attention network
Network calculates output sequence (y1,y2,...,yo) spliced to obtain new output sequence (y1,y2,...,yo)i;By new output
Sequence is integrated into final output sequence (Y using DNN1,Y2,...,Yo).Here, the output of the corresponding decoding network of every head
yoIt is that object vector ct is generated by every head attention networkoWith the element y in position o-1o-1It generates.
Specific embodiment 2:
Unlike specific embodiment one, present embodiment based on bottleneck characteristic and multiple dimensioned bull attention machine
The speech recognition modeling method for building up of system, in the step one, the speech feature vector X=(x of input1,x2,...,xT) adopt
With 40 dimension FBank features and energy, then splicing correspondence single order and second differnce, amount to 123 dimension parameters;It is first to the feature of extraction
It is first normalized within the scope of training set, so that each component is obeyed standardized normal distribution, the normalization of training set is recycled to join
Several pairs of test sets and the processing of development set feature normalization.
Specific embodiment 3:
Unlike specific embodiment one or two, present embodiment is paid attention to based on bottleneck characteristic and multiple dimensioned bull
The speech recognition modeling method for building up of power mechanism in the step one and step 2, carries out without prison the RBM network in DBN
Superintend and direct training process, wherein the training of RBN network includes propagating using unsupervised pre-training (pretraining) and reversed gradient
Algorithm Training method;The input of the RBM network is FBank phonetic feature, and the output layer of RBM network is
Softmax layers, the posterior probability of the corresponding binding three-tone state of each output layer unit;Have three between input layer and output layer
A hidden layer, second hidden layer are bottleneck layer, and the state cell number of second hidden layer is less than other hidden layers.
Specific embodiment 4:
Unlike specific embodiment one or two, present embodiment is paid attention to based on bottleneck characteristic and multiple dimensioned bull
The speech recognition modeling method for building up of power mechanism, in the step one, using the phonetic feature FBank of input to RBM network
Carry out the process of unsupervised training are as follows:
Step 1 one, the tag along sort for increasing by one layer of output phoneme state on the basis of the DBN of 4 layers of structure, obtain 5 layers
Phonetic feature is extracted the input of RBM network by the bottleneck characteristic of structure as the bottleneck characteristic of 5 layers of structure;Wherein, 4 layers of DBN
It is continuously stacked together by 4 groups of single layer RBM, first layer is Gauss-Bernoulli Jacob RBM, and excess-three layer is Bernoulli Jacob-Bernoulli Jacob RBM;
Step 1 two, in RBM, due to there is no neuron to be connected between visible layer and hidden layer inside, so each other
Between it is independent mutually.Therefore posterior probability p (v | h) and p (h | v) can easily it be calculated.Wherein Bernoulli Jacob-Bernoulli Jacob
Posterior probability p (h | v) calculating process of RBM is as follows:
Wherein, E (v, h) is the energy value of single layer RBM, and v is visual layers vector, and a is the bias vector of visual layers, and h is hidden
Vector containing layer, b are hidden layer bias vector, and W is the weight matrix for connecting visible layer and hidden layer, Wi,*vIndicate weight matrix W
The i-th row.Formula (1) derives explanation in the case where visible layer determines, implies independent mutually between each component of layer state;
Step 1 three, by Bernoulli Jacob-each component h of Bernoulli Jacob's RBM hidden layer neuron stateiValue there was only 0 and 1,
Have:
Wherein, σ () indicates Sigmoid function;
The vector of implicit layer state can be indicated with following formula:
P (h=1 | v)=σ (Wv+b) (3)
According to symmetric relation, under known implicit layer state, the visible layer state of single layer Bernoulli Jacob-Bernoulli Jacob RBM are as follows:
P (v=1 | h)=σ (WTh+a) (4)
Step 1 four, be for visible layer Gaussian Profile Gauss-Bernoulli Jacob RBM, the calculating of posterior probability p (h=1 | v)
It is identical as formula (3), and the calculation of p (v | h) is as follows:
P (v | h)=N (v;WTh+a,I) (5)
Wherein I represents unit matrix;
Step 1 five, the process of the stacking of the DBN network encoded are as follows:
Firstly, training one Gauss-Bernoulli Jacob RBM, the reason is that the phonetic feature of first layer input is real number, with Gauss point
Cloth is adapted;It then, is the input of Bernoulli Jacob-Bernoulli Jacob RBM visible layer using the activation probability of implicit layer unit as next layer,
The hidden state of second layer RBM is re-used as the input of third layer RBM, hereafter all layers;
At given training dataset V, the parameter training of single layer RBM minimizes negative logarithm seemingly by using gradient descent method
So degree is realized:
In training process, the derivative general type of negative log likelihood are as follows:
Wherein, θ={ a, b, W } determines the distribution of the visual layers vector v of RBM,<ξ>data<ξ>modelIt respectively indicates from sample
The desired value of stochastic variable ξ is estimated in notebook data and model;For the weight of connection matrix in single layer RBM, have:
By taking Bernoulli Jacob-Bernoulli Jacob RBM as an example, it is expected that < vihj>dataFor layer state v visible in training dataiWith implicit stratiform
State viThe frequency that value is simultaneously 1, it is expected that < vihj>modelThe expectation being calculated for RBM production model profile.
Step 1 six trains the DBN:
Firstly, according to input sample data to first layer RBM with contrast divergence algorithm (Compare Divergence, CD)
Algorithm initialization;Wherein, model is estimated with the sample that gibbs sampler (Gibbs Sample) is obtained in CD algorithm,
Gibbs sampler process in CD algorithm specifically:
1), Gibbs sampler initializes visible layer by data sample, obtains visible layer state v;
2) posterior probability p (h | v), is calculated as implicit layer state h using formula (3);
Secondly, formula (4) or formula (5) the visible layer state v of sample reconstruction respectively is respectively adopted in the difference further according to RBM type;
Later, the parameter of fixed first layer RBM, the implicit layer state of RBM is calculated according to sample data, and as second layer RBM
It can be seen that layer state;At this point, obtaining training second layer RBM using CD algorithm again;
Backward and so on, the weight parameter of connection DBN between layers is finally obtained one by one;
Finally, after training, the weight matrix for only retaining articulamentum and layer is joined as the initialization of DBN transfer matrix
Number.
Specific embodiment 5:
Unlike specific embodiment one or two, present embodiment is paid attention to based on bottleneck characteristic and multiple dimensioned bull
The speech recognition modeling method for building up of power mechanism in the step two, has carried out supervision instruction to RBM network using BP algorithm
Practice, after as carrying out unsupervised training to network using sample input data by RBM technology, preceding 3 layers of connection in coding network
Weight matrix is initialized, and the weight matrix after training can partially reflect and " memory " sample data in it is effective
Information.In order to further enhance extraction and separating capacity of the network to phonetic feature, so that subsequent attention model obtained
Bottleneck characteristic has stronger robustness and distinction, is marked using speech samples to speech samples on the basis of unsupervised training
Carry out Training, detailed process are as follows:
Step 2 one, by the weight Direct Transfer of DBN to deep neural network (Deep Neural Network, DNN)
In;Later, it is finely adjusted on the basis of known DNN department pattern parameter (fine tuning);Preceding 3 connections square of network
Battle array weight W1,W2,W3It is known that this three layers and one layer of random initializtion weight output layer W4Form coding network front end based on DBN
Bottleneck characteristic extract network, be added activation primitive between the layers, calculate before network to:
F () is enabled to indicate activation primitive, the 1st implicit layer state h1With first of implicit layer state hlIt is respectively as follows:
h1=f (W1x+b1) (9)
hl=f (Wlhl-1+bl) (10)
By the 4th hidden layer h4The last output vector v of network is obtained by softmax function:
V=softmax (W4h3+b4) (11)
Wherein, when with above method training network, the sample for randomly selecting fixed number is concentrated from training sample every time
This, and the feature of sample drawn is input in network, network output is obtained by forward calculation;
Step 2 two, the gradient for calculating every layer parameter using BP algorithm according to the output of network and sample training data, then
Model parameter is updated:
DBN network training criterion, which generallys use, minimizes cross entropy (crossentropy, CE) function:
Then, network parameter is finely adjusted using back-propagation algorithm (BackPropagation, BP), the algorithm is logical
Often model parameter is updated according to following formula:
Wherein,WithRespectively carry out the update of t subparameter after, in network l layer of weight matrix be biased towards
Amount, μ is learning rate, △ Wt lWithThe respectively gradient of weight matrix and bias vector, the gradient calculate one by training sample
Order derivative and obtain:
Wherein, M is the number of samples that training uses every time;
Step 2 three cyclically executes 2 two steps of step 2 one and step 2, and by the output of network and model
Whether the error of output reaches the mark whether threshold value of setting terminates as training:
Firstly, in the training process, generally also calculating network output and standard using development set (development set)
The error of output is used for cross validation (cross validation);
Then, network is extracted in order to obtain bottleneck characteristic on this basis, by output layer last in trained network
Network state unit, transfer matrix and bias vector are all deleted, by the remaining front end as coding network, by feature extraction
Output of the state of bottleneck layer as coding network front end;In order to extract bottleneck characteristic, using voice primitive character parameter as net
Network input, successively calculates bottleneck layer output vector by the network, which is bottleneck characteristic parameter;Because the network is instructed
Model parameter is initialized using RBM during practicing, and the alignment letter that GMM-HMM model provides when Training
Breath, so the bottleneck characteristic extracted has better distinction and robustness for different phoneme states compared with traditional characteristic.Cause
And if the bottleneck characteristic is applied to attention model, it is equivalent to the knowledge migration in network into attention model.
Specific embodiment 6:
Unlike specific embodiment five, present embodiment based on bottleneck characteristic and multiple dimensioned bull attention machine
The speech recognition modeling method for building up of system,
The multiple dimensioned bull attention model is by three coding network, decoding network and attention network module groups
At;
Coding network and decoding network include Recognition with Recurrent Neural Network unit;
Coding network front end is DBN feature extraction network, and rear end is then deep layer Recognition with Recurrent Neural Network, it is therefore an objective to study and digging
The context relation information for digging phonetic feature sequence, extracts high layer information, the discrimination and table of Enhanced feature from primitive character
Sign ability;
The main body of attention network is multilayer perceptron, and network inputs are the output of coding network and implying for decoding network
Layer unit state, the output of network are their degree of association scores;Degree of association score is infused after index normalized
Meaning force coefficient;
Decoding network is connected to the network by single layer Recognition with Recurrent Neural Network and maxout, is obtained first according to attention network
Pay attention to force coefficient, object vector then is obtained to the output weighted sum of all coding networks, finally using object vector as net
Network input, calculates the posterior probability that each phoneme occurs on each position of output sequence.
Specific embodiment 7:
Unlike specific embodiment six, present embodiment based on bottleneck characteristic and multiple dimensioned bull attention machine
The speech recognition modeling method for building up of system, in the step three, high-level characteristic sequence (h1,h2,...,hT) it is by multiple dimensioned
What the coding network front and back end of bull attention model constructed, detailed process are as follows:
The rear end of coding network is made of two groups of multilayer circulation neural networks based on GRU unit, respectively feedforward network
With backward network.The structure of bilateral network can be passed on the information with future simultaneously, promote the information content of high-level characteristic.
Step 3 one, firstly, by the trained DBN bottleneck characteristic (v in coding network front end1,v2,...,vn) as circulation
The input of neural network;Then, reduce the Recognition with Recurrent Neural Network element stack number of plies, only stay 1 layer network processing feature sequence when
Sequence information exports high-level characteristic sequence (h1,h2,...,hT);
Step 3 two, in coding network rear end moment t, the implicit layer state point of c layers of feedforward network and backward network
It is notWithDue to the hidden layer unit information direction of transfer of feedforward network and backward network on the contrary, then corresponding transmitting letter
Number is respectively formula (17) and formula (18):
Step 3 two, feedforward network and backward network carry out feature between layers by respective transmission function parallel
Transmitting, carried out in transmittance process in the time domain it is down-sampled, to reach the target for reducing calculation amount.With in feedforward network, due to will
GRU is as basic unit, so hidden layer state computation such as formula (17) (18) is shown, by input feature vector sequence (x1,x2,...,
xT) the 1st layer of implicit layer state can be obtainedSimilarly, c layers of implicit stratiform can be calculated by c-1 layers of implicit layer state
StateCalculating process implies shown in layer state such as formula (19):
Similarly, to c layers of implicit layer state of network after calculating;
Step 3 three, c layers of hidden state for splicing feedforward network and backward network, obtain coding network moment t's
High-level characteristic ht:
Specific embodiment 8:
Unlike specific embodiment seven, present embodiment based on bottleneck characteristic and multiple dimensioned bull attention machine
The speech recognition modeling method for building up of system, in the step four, the attention network struction of multiple dimensioned bull attention model
Process:
Attention mechanism is that the technology of validity feature is extracted from characteristic sequence in the model of sequence to sequence.It will
Automobile driving is to the part relevant range of characteristic sequence, to ignore extraneous features.Using the attention machine of multiple dimensioned bull
System, the element y of the output sequence of the sequence of calculation to series modelo, by characteristic sequence (h1,h2,...,hT) using different scale
Convolution kernel FiIt is mapped to an object vector ct of the attention mechanism of every heado;Object vector ctoAfter being compressed in sequence
Important information, realize and sequence important feature information be transformed to regular length vector, consequently facilitating rear end pattern classification;Again will
ctoAs the decoding network input after the attention network of every head, Recognition with Recurrent Neural Network is calculated one by one and implies layer state sequence
Arrange (s1,s2,...,so) and output sequence (y1,y2,...,yo);Wherein,
The single head attention model of multiple dimensioned bull attention model calculates output sequence position using attention mechanism
The corresponding object vector ct of o ∈ { 1,2 ..., O }oProcess is as follows:
Firstly, calculating the implicit layer state s of output sequence position o-1o-1With the degree of association of the feature of moment t, such as formula (21)
It is shown:
eo,t=a (so-1,ht) (21)
Wherein, a () represents attention network, is the multilayer perceptron containing only a hidden layer, can show expression are as follows:
eo,t=ωT tanh(W[so-1,ht,fo,t]+b) (22)
fo=Fi*αo-1 (23)
Wherein, W and b is respectively input layer to hidden layer weight matrix and bias vector, and ω is that hidden layer is weighed to output layer
Weight matrix, FiFor convolution kernel, for the attention mechanism of multiple dimensioned bull, FiConvolution kernel size be different, each head
Conventional part use different size of convolution filter;Due to bull and convolution filter it is of different sizes, the model is not only
The characterization information of the different subspace of different sequence locations is able to use to carry out sequence data processing, and different size of volume
Product filter can model the information of phoneme, syllable, word level-one.
Here, multiple dimensioned bull model is not initialized only with different convolution filter stochastic parameters, makes bull model
Integrated model is formed, and different size of convolution filter represents different models, and the attention of last moment can be allowed to obtain
Divide and learnt and trained under the supervision of various models, obtains different speech primitive models;Here, convolution filter size
Lesser model main representative the model of phoneme level-one, the model main representative of normal size the model of syllable level-one, and
The biggish model that represent word level-one.So the attention model by multiple dimensioned bull can be to different grades of voice
Primitive is modeled, and achievees the purpose that improve speech recognition effect.
Then, index normalization is carried out to the degree of association of all moment features, the numerical value after normalization is known as attention system
Number is as follows:
Finally, will notice that force coefficient as weight, sums to the characteristic weighing at all moment, it is defeated under the power that gains attention mechanism
The object vector ct of sequence location o outo:
The process that object vector is calculated using attention mechanism, can be abbreviated as formula (26):
cto=attention ([h1,h2,...,hT],so-1) (26)。
Specific embodiment 8:
Unlike specific embodiment seven, present embodiment based on bottleneck characteristic and multiple dimensioned bull attention machine
The speech recognition modeling method for building up of system, in the step five, decoding net corresponding to the attention network using every head
Network calculates output sequence (y1,y2,...,yo) in the element y of position ooProcess, are as follows:
For the single head of multiple dimensioned bull attention model, the decoding network after the attention network of every head exists
Calculate output sequence (y1,y2,...,yo) position o element yoWhen, the input of network is that multiple dimensioned single attention network generates
Object vector ctoWith the element y of position o-1o-1, calculating process is as follows:
Firstly, decoding network obtains object vector ct according to attention network query functiono, and by ctoAnd yo-1As being based on
The input of GRU unit Recognition with Recurrent Neural Network calculates the implicit layer state s of Recognition with Recurrent Neural Networko, as shown in formula (27):
so=GRU (so-1,[yo-1,cto]) (27)
Then, the implicit layer state s of decoding network is giveno∈RdAs under input condition, obtained by maxout network query function
To the posterior probability of phoneme iEach of maxout network, which implies layer unit, multiple candidate units, which therefrom selects
The maximum unit conduct of numerical value is selected to export, under calculating process:
Wherein, d is to input implicit layer state soDimension, corresponding hidden layer number of unit, Wi,j∈Rd×m×kAnd bi,j∈Rm ×kRespectively maxout network paramter matrix and bias vector, k are the candidate unit number that maxout network each implies layer unit,
M is output layer unit number, the corresponding output phoneme type number in acoustic model;
Finally, obtaining output vector y by the output layer of maxout network as shown in formula (30)o, yoI-th of representation in components is defeated
The posterior probability that phoneme i occurs on sequence location o out;
Finally, by all output sequence (y1,y2,...,yo)iOne new output sequence (Y is integrated into using DNN1,
Y2,...,Yo);
Yi=DNN ((y1,y2,...,yo)i) (31)。
The speech recognition of attention model based on multiple dimensioned bull performance in terms of aligned phoneme sequence identification is better than traditional mould
Type has more powerful time series modeling ability and separating capacity.In order to increase the robustness of model in a noisy environment, in this base
Increase the bottleneck characteristic based on DBN network on plinth, reason is: firstly, because the bottleneck characteristic based on DBN extracts network not
Recursive structure, thus when training DBN can on GPU parallel computation multiframe gradient to effective training for promotion efficiency, into
And it can effectively avoid the problem that the multilayered structure bring of Recognition with Recurrent Neural Network is in large scale and time consumption for training;Secondly, by subtracting
The number of plies of few Recognition with Recurrent Neural Network can reduce the distance of gradient back transfer in a network, be further reduced trained consuming.Simultaneously
The training of model can be allowed to be easier to restrain using the more powerful feature of distinction, further increase training effectiveness.Finally, base
There is stronger robustness to noise in the bottleneck characteristic that DBN is extracted, can overcome the disadvantages that the attention model of multiple dimensioned bull using it
The weak defect of anti-noise ability.Multiple dimensioned model is not initialized only with different convolution filter stochastic parameters, makes bull model
Integrated model is formed, to be better than single head model, and different size of convolution filter represents different models, can allow
The attention score at one moment is learnt and is trained under the supervision of various models, and different speech primitive models is obtained.This
In, convolution filter size lesser model main representative the model of phoneme level-one, the model main representative of normal size
The model of syllable level-one, and biggish convolution filter size represents and establishes model to word level-one.So by multiple dimensioned more
The attention model of head can model different grades of speech primitive, achieve the purpose that improve speech recognition effect.
The above shows and describes the basic principles and main features of the present invention and the advantages of the present invention.The technology of the industry
Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and the above embodiments and description only describe this
The principle of invention, without departing from the spirit and scope of the present invention, various changes and improvements may be made to the invention, these changes
Change and improvement all fall within the protetion scope of the claimed invention.The claimed scope of the invention by appended claims and its
Equivalent thereof.
Claims (9)
1. the speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism, it is characterised in that: institute
State method the following steps are included:
Step 1: utilizing the sample FBank speech feature vector X=(x of input1,x2,...,xT) to the RBM network in DBN into
The unsupervised training of row obtains three first layers connection weight matrix W in initialization coding network1、W2、W3, by this three layers of connection weight squares
Battle array and one layer of random initializtion weight output layer W4The bottleneck characteristic based on DBN for forming coding network front end extracts network;RBM
Network representation is limited Boltzmann machine;DBN indicates depth confidence network;FBank indicates filter group;
Step 2: sample drawn FBank speech feature vector X=(x1,x2,...,xT) it is input to four layers of bottleneck spy of step 1
Sign is extracted in network, is obtained bottleneck characteristic using forward calculation and is extracted network output v;Then, network is extracted according to bottleneck characteristic
Output v and sample training data X reality output using BP algorithm to bottleneck characteristic extract network carry out Training, with
Calculate every layer of weight △ W of four layers of bottleneck characteristic extraction networkt lAnd biasingGradient, then to weightAnd biasing
It is iterated update;
Remove Step 3: four layers of bottleneck characteristic of step 2 are extracted network the last layer output layer, only retains three first layers again
As the front end of coding network, the bottleneck characteristic sequence (v that front end is exported1,v2,...,vn) input as RNN, RNN herein
For the rear end of coding network, it is made of two groups of multilayer RNN based on GRU, respectively feedforward network and backward network;
The FBank speech feature vector X then inputted extracts network by coding network front end bottleneck characteristic and obtains bottleneck characteristic sequence
Arrange (v1,v2,...,vn), the bottleneck characteristic sequence of output is obtained by encoder rear end based on the multilayer RNN of GRU final defeated
It is out high-level characteristic sequence (h1,h2,...,hT);
Step 4: the high-level characteristic sequence (h for being exported coding network rear end using the attention mechanism of multiple dimensioned bull1,
h2,...,hT) it is injected separately into the convolution kernel F of different scaleiIn, and phoneme, syllable, the other speech primitive of word grade are built
Mould, to obtain respective object vector cto, then by the corresponding ct of every kind of modeloAs every in multiple dimensioned bull attention network
The input of the decoding RNN network of head, calculates RNN hidden layer status switch (s one by one1,s2,...,so), and output phoneme
Sequence (y1,y2,...,yo);
Step 5: decoding RNN network meter corresponding to attention network using every head in multiple dimensioned bull attention network
Calculate output sequence (y1,y2,...,yo) spliced to obtain new output sequence (y1,y2,...,yo)i;By new output sequence
It is integrated into final output sequence (Y using DNN1,Y2,...,Yo)。
2. according to claim 1 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism
Method, it is characterised in that: in the step one, the FBank speech feature vector X=(x of input1,x2,...,xT) use 40
FBank feature and energy, then splicing correspondence single order and second differnce are tieed up, 123 dimension parameters are amounted to;To the feature of extraction, exist first
It is normalized within the scope of training set, so that each component is obeyed standardized normal distribution, recycle the normalized parameter pair of training set
Test set and the processing of development set feature normalization.
3. according to claim 2 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism
Method, it is characterised in that: in the step one and step 2, unsupervised training process, institute are carried out to the RBM network in DBN
The input for the RBM network stated is FBank phonetic feature, and the output layer of RBM network is softmax layers, each output layer unit pair
The posterior probability of three-tone state should be bound;There are three hidden layer between input layer and output layer, the second hidden layer is bottle
The state cell number of neck layer, second hidden layer is less than other hidden layers.
4. according to claim 3 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism
Method, it is characterised in that: in the step one, utilize the FBank speech feature vector X=(x of input1,x2,...,xT) right
RBM network in DBN carries out the process of unsupervised training are as follows:
Step 1 one, the tag along sort for increasing by one layer of output phoneme state on the basis of the DBN of 4 layers of structure, obtain 5 layers of structure
Bottleneck characteristic, phonetic feature is extracted to the input of RBM network as the bottleneck characteristic of 5 layers of structure;Wherein, 4 layers of DBN are by 4
Group single layer RBM is continuously stacked together, and first layer is Gauss-Bernoulli Jacob RBM, and excess-three layer is Bernoulli Jacob-Bernoulli Jacob RBM;
Step 1 two, in RBM, Bernoulli Jacob-Bernoulli Jacob RBM posterior probability p (h | v) calculating process is as follows:
Wherein, E (v, h) is the energy value of single layer RBM, and v is visual layers vector, and a is the bias vector of visual layers, and h is hidden layer
Vector, b are hidden layer bias vector, and W is the weight matrix for connecting visible layer and hidden layer, Wi,*vIndicate the i-th of weight matrix W
Row implies independent mutually between each component of layer state;
Step 1 three, by Bernoulli Jacob-each component h of Bernoulli Jacob's RBM hidden layer neuron stateiValue there was only 0 and 1, have:
Wherein, σ () indicates Sigmoid function;
The vector of implicit layer state can be indicated with following formula:
P (h=1 | v)=σ (Wv+b) (3)
According to symmetric relation, under known implicit layer state, the visible layer state of single layer Bernoulli Jacob-Bernoulli Jacob RBM are as follows:
P (v=1 | h)=σ (WTh+a) (4)
Step 1 four, be for visible layer Gaussian Profile Gauss-Bernoulli Jacob RBM, the calculating and formula of posterior probability p (h=1 | v)
(3) identical, and the calculation of p (v | h) is as follows:
P (v | h)=N (v;WTh+a,I) (5)
Wherein I represents unit matrix;
Step 1 five, the process of the stacking of the DBN network encoded are as follows:
Firstly, training one Gauss-Bernoulli Jacob RBM;It then, is that uncle exerts using the activation probability of implicit layer unit as next layer
The input of benefit-Bernoulli Jacob RBM visible layer, the hidden state of second layer RBM are re-used as the input of third layer RBM, hereafter all layers
And so on;
At given training dataset V, the parameter training of single layer RBM minimizes negative log likelihood by using gradient descent method
It realizes:
In training process, the derivative general type of negative log likelihood are as follows:
Wherein, θ={ a, b, W } determines the distribution of the visual layers vector v of RBM,<ξ>data<ξ>modelIt respectively indicates from sample number
According to desired value that stochastic variable ξ is estimated in model;For the weight of connection matrix in single layer RBM, have:
Step 1 six trains the DBN:
Firstly, according to input sample data to first layer RBM contrast divergence algorithm algorithm initialization;Wherein, it is transported in CD algorithm
Model is estimated with the sample that gibbs sampler obtains, the gibbs sampler process in CD algorithm specifically:
1), Gibbs sampler initializes visible layer by data sample, obtains visible layer state v;
2) posterior probability p (h | v), is calculated as implicit layer state h using formula (3);
Secondly, formula (4) or formula (5) the visible layer state v of sample reconstruction respectively is respectively adopted in the difference further according to RBM type;Later,
The parameter of fixed first layer RBM, the implicit layer state of RBM is calculated according to sample data, and as second layer RBM visible layer
State;At this point, obtaining training second layer RBM using CD algorithm again;
Backward and so on, the weight parameter of connection DBN between layers is finally obtained one by one;
Finally, after training, only retain initiation parameter of the weight matrix of articulamentum and layer as DBN transfer matrix.
5. according to claim 4 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism
Method, it is characterised in that: in the step two, the reality of the output v and sample training data X of network are extracted according to bottleneck characteristic
Border output extracts network to bottleneck characteristic using BP algorithm and carries out Training, and language is as used on the basis of unsupervised training
Sound sample mark carries out Training, detailed process to speech samples are as follows:
Step 2 one, by the weight Direct Transfer of DBN into deep neural network;Later, in known DNN department pattern parameter
On the basis of be finely adjusted;First three connection matrix weight W of network1,W2,W3It is known that this three layers and one layer of random initializtion weight
Output layer W4The bottleneck characteristic based on DBN for forming coding network front end extracts network, and activation primitive is added between the layers,
Calculate network before to:
F () is enabled to indicate activation primitive, the 1st implicit layer state h1With first of implicit layer state hlIt is respectively as follows:
h1=f (W1x+b1) (9)
hl=f (Wlhl-1+bl) (10)
By the 4th hidden layer h4The last output vector v of network is obtained by softmax function:
V=softmax (W4h3+b4) (11)
Wherein, when with above method training network, the sample for randomly selecting fixed number is concentrated from training sample every time, and
The feature of sample drawn is input in network, network output is obtained by forward calculation;
Step 2 two, the gradient for calculating every layer parameter using BP algorithm according to the output of network and sample training data, then to mould
Shape parameter is updated:
DBN network training criterion intersects entropy function using minimum:
Then, network parameter is finely adjusted using back-propagation algorithm, model parameter is updated according to following formula:
Wherein,WithAfter respectively carrying out the update of t subparameter, l layers of weight matrix and bias vector, μ in network
For learning rate, △ Wt lWithThe respectively gradient of weight matrix and bias vector, the gradient calculate single order by training sample and lead
It counts and obtains:
Wherein, M is the number of samples that training uses every time;
Step 2 three cyclically executes 2 two steps of step 2 one and step 2, and the output of network and model is exported
Error whether reach the mark whether threshold value of setting terminates as training:
Firstly, in the training process, the error of network output and standard output is calculated using development set, it to be used for cross validation;
Then, on this basis, by the network state unit of output layer last in trained network, transfer matrix and it is biased towards
Amount is all deleted, by the remaining front end as coding network, using the state of feature extraction bottleneck layer as coding network front end
Output;Using voice primitive character parameter as network inputs, bottleneck layer output vector is successively calculated by the network, this to
Amount is bottleneck characteristic parameter, extracts bottleneck characteristic.
6. according to claim 5 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism
Method, it is characterised in that: the multiple dimensioned bull attention model is by coding network, decoding network and three, attention network
Module composition;
Coding network and decoding network include Recognition with Recurrent Neural Network unit;
Coding network front end is DBN feature extraction network, and rear end is then deep layer Recognition with Recurrent Neural Network;
The main body of attention network is multilayer perceptron, and network inputs are the output of coding network and the hidden layer list of decoding network
First state, the output of network are their degree of association scores;Degree of association score obtains attention after index normalized
Coefficient;
Decoding network is connected to the network by single layer Recognition with Recurrent Neural Network and maxout, is gained attention first according to attention network
Then force coefficient obtains object vector to the output weighted sum of all coding networks, finally defeated using object vector as network
Enter, calculates the posterior probability that each phoneme occurs on each position of output sequence.
7. according to claim 6 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism
Method, it is characterised in that: in the step three, high-level characteristic sequence (h1,h2,...,hT) it is to be paid attention to by multiple dimensioned bull
What the coding network front and back end of power model constructed, specific building process are as follows:
Step 3 one, firstly, by the trained DBN bottleneck characteristic (v in coding network front end1,v2,...,vn) as circulation nerve
The input of network;Then, reduce the Recognition with Recurrent Neural Network element stack number of plies, the timing of 1 layer network processing feature sequence is only stayed to believe
Breath exports high-level characteristic sequence (h1,h2,...,hT);
Step 3 two, in coding network rear end moment t, the implicit layer state of c layers of feedforward network and backward network is respectivelyWithDue to the hidden layer unit information direction of transfer of feedforward network and backward network on the contrary, then corresponding transmission function is divided
It Wei formula (17) and formula (18):
In step 3 two, feedforward network, using GRU as basic unit, shown in hidden layer state computation such as formula (17) (18), by defeated
Enter characteristic sequence (x1,x2,...,xT) the 1st layer of implicit layer state can be obtainedIt similarly, can by c-1 layers of implicit layer state
Calculate c layers of implicit layer stateCalculating process implies shown in layer state such as formula (19):
Similarly, to c layers of implicit layer state of network after calculating;
Step 3 three, c layers of hidden state for splicing feedforward network and backward network, obtain coding network in the high level of moment t
Feature ht:
8. according to claim 7 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism
Method, it is characterised in that: in the step four, the attention network struction process of multiple dimensioned bull attention model:
Using the attention mechanism of multiple dimensioned bull, the element y of the output sequence of the sequence of calculation to series modelo, by characteristic sequence
(h1,h2,...,hT) using the convolution kernel F of different scaleiIt is mapped to an object vector ct of the attention mechanism of every heado;
Again by ctoAs the decoding network input after the attention network of every head, Recognition with Recurrent Neural Network is calculated one by one and implies stratiform
State sequence (s1,s2,...,so) and output sequence (y1,y2,...,yo);Wherein,
The single head attention model of multiple dimensioned bull attention model calculates output sequence position o ∈ using attention mechanism
{ 1,2 ..., O } corresponding object vector ctoProcess is as follows:
Firstly, calculating the implicit layer state s of output sequence position o-1o-1With the degree of association of the feature of moment t, such as formula (21) institute
Show:
eo,t=a (so-1,ht) (21)
Wherein, a () represents attention network, is the multilayer perceptron containing only a hidden layer, indicates are as follows:
eo,t=ωTtanh(W[so-1,ht,fo,t]+b) (22)
fo=Fi*αo-1 (23)
Wherein, W and b is respectively input layer to hidden layer weight matrix and bias vector, and ω is hidden layer to output layer weight square
Battle array, FiFor convolution kernel, for the attention mechanism of multiple dimensioned bull, FiConvolution kernel size be different, the volume of each head
Product part uses different size of convolution filter;
Multiple dimensioned bull model is not initialized only with different convolution filter stochastic parameters, and bull model is made to form integrated mould
Type, and different size of convolution filter represents different models, and the attention score of last moment can be allowed in various moulds
Learnt and trained under the supervision of type, obtains different speech primitive models;
Then, index normalization is carried out to the degree of association of all moment features, the numerical value after normalization is known as paying attention to force coefficient such as
Under:
Finally, will notice that force coefficient as weight, sums to the characteristic weighing at all moment, exports sequence under the power that gains attention mechanism
The object vector ct of column position oo:
The process that object vector is calculated using attention mechanism, can be abbreviated as formula (26):
cto=attention ([h1,h2,...,hT],so-1) (26)。
9. according to claim 8 established based on the speech recognition modeling of bottleneck characteristic and multiple dimensioned bull attention mechanism
Method, it is characterised in that: in the step five, decoding network corresponding to the attention network using every head calculates output
Sequence (y1,y2,...,yo) in the element y of position ooProcess, are as follows:
For the single head of multiple dimensioned bull attention model, the decoding network after the attention network of every head is being calculated
Output sequence (y1,y2,...,yo) position o element yoWhen, the input of network is that multiple dimensioned single attention network generates target
Vector ctoWith the element y of position o-1o-1, calculating process is as follows:
Firstly, decoding network obtains object vector ct according to attention network query functiono, and by ctoAnd yo-1As mono- based on GRU
The input of first Recognition with Recurrent Neural Network calculates the implicit layer state s of Recognition with Recurrent Neural Networko, as shown in formula (27):
so=GRU (so-1,[yo-1,cto]) (27)
Then, the implicit layer state s of decoding network is giveno∈RdAs under input condition, sound is obtained by maxout network query function
The posterior probability of plain iEach of maxout network, which implies layer unit, multiple candidate units, which therefrom selects number
It is worth maximum unit conduct to export, under calculating process:
Wherein, d is to input implicit layer state soDimension, corresponding hidden layer number of unit, Wi,j∈Rd×m×kAnd bi,j∈Rm×kPoint
Not Wei maxout network paramter matrix and bias vector, k is the candidate unit number that maxout network each implies layer unit, and m is
Output layer unit number, the corresponding output phoneme type number in acoustic model;
Finally, obtaining output vector y by the output layer of maxout network as shown in formula (30)o, yoI-th of representation in components exports sequence
The posterior probability that phoneme i occurs on column position o;
Finally, by all output sequence (y1,y2,...,yo)iOne new output sequence (Y is integrated into using DNN1,Y2,...,
Yo);
Yi=DNN ((y1,y2,...,yo)i) (31)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910477034.2A CN110211574B (en) | 2019-06-03 | 2019-06-03 | Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910477034.2A CN110211574B (en) | 2019-06-03 | 2019-06-03 | Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110211574A true CN110211574A (en) | 2019-09-06 |
CN110211574B CN110211574B (en) | 2022-03-11 |
Family
ID=67790458
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910477034.2A Active CN110211574B (en) | 2019-06-03 | 2019-06-03 | Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110211574B (en) |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110765966A (en) * | 2019-10-30 | 2020-02-07 | 哈尔滨工业大学 | One-stage automatic recognition and translation method for handwritten characters |
CN110824915A (en) * | 2019-09-30 | 2020-02-21 | 华南师范大学 | GA-DBN network-based intelligent monitoring method and system for wastewater treatment |
CN110880328A (en) * | 2019-11-20 | 2020-03-13 | Oppo广东移动通信有限公司 | Arrival reminding method, device, terminal and storage medium |
CN110930978A (en) * | 2019-11-08 | 2020-03-27 | 北京搜狗科技发展有限公司 | Language identification method and device and language identification device |
CN111145760A (en) * | 2020-04-02 | 2020-05-12 | 支付宝(杭州)信息技术有限公司 | Method and neural network model for speaker recognition |
CN111159419A (en) * | 2019-12-09 | 2020-05-15 | 浙江师范大学 | Knowledge tracking data processing method, system and storage medium based on graph convolution |
CN111243609A (en) * | 2020-01-10 | 2020-06-05 | 平安科技(深圳)有限公司 | Method and device for intelligently detecting effective voice and computer readable storage medium |
CN111243578A (en) * | 2020-01-10 | 2020-06-05 | 中国科学院声学研究所 | Chinese mandarin character-voice conversion method based on self-attention mechanism |
CN111276131A (en) * | 2020-01-22 | 2020-06-12 | 厦门大学 | Multi-class acoustic feature integration method and system based on deep neural network |
CN111312228A (en) * | 2019-12-09 | 2020-06-19 | 中国南方电网有限责任公司 | End-to-end-based voice navigation method applied to electric power enterprise customer service |
CN111309965A (en) * | 2020-03-20 | 2020-06-19 | 腾讯科技(深圳)有限公司 | Audio matching method and device, computer equipment and storage medium |
CN111477221A (en) * | 2020-05-28 | 2020-07-31 | 中国科学技术大学 | Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network |
CN111489738A (en) * | 2020-04-13 | 2020-08-04 | 深圳市友杰智新科技有限公司 | Feature extraction method and voice command identification method based on multi-head attention mechanism |
CN111524519A (en) * | 2020-05-28 | 2020-08-11 | 中国科学技术大学 | End-to-end multi-channel speech recognition method using advanced feature fusion |
CN111680591A (en) * | 2020-05-28 | 2020-09-18 | 天津大学 | Pronunciation inversion method based on feature fusion and attention mechanism |
CN112530421A (en) * | 2020-11-03 | 2021-03-19 | 科大讯飞股份有限公司 | Voice recognition method, electronic equipment and storage device |
CN112633175A (en) * | 2020-12-24 | 2021-04-09 | 哈尔滨理工大学 | Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment |
CN112992125A (en) * | 2021-04-20 | 2021-06-18 | 北京沃丰时代数据科技有限公司 | Voice recognition method and device, electronic equipment and readable storage medium |
CN112992126A (en) * | 2021-04-22 | 2021-06-18 | 北京远鉴信息技术有限公司 | Voice authenticity verification method and device, electronic equipment and readable storage medium |
CN113192489A (en) * | 2021-05-16 | 2021-07-30 | 金陵科技学院 | Paint spraying robot voice recognition method based on multi-scale enhancement BiLSTM model |
CN113450781A (en) * | 2020-03-25 | 2021-09-28 | 阿里巴巴集团控股有限公司 | Speech processing method, speech encoder, speech decoder and speech recognition system |
CN113537472A (en) * | 2021-07-26 | 2021-10-22 | 北京计算机技术及应用研究所 | Bidirectional recurrent neural network with low computation and storage consumption |
CN113593526A (en) * | 2021-07-27 | 2021-11-02 | 哈尔滨理工大学 | Speech emotion recognition method based on deep learning |
CN113705322A (en) * | 2021-06-11 | 2021-11-26 | 北京易达图灵科技有限公司 | Handwritten Chinese character recognition method and device based on threshold graph neural network |
CN113782014A (en) * | 2021-09-26 | 2021-12-10 | 联想(北京)有限公司 | Voice recognition method and device |
CN113936680A (en) * | 2021-10-08 | 2022-01-14 | 电子科技大学 | Single-channel speech enhancement method based on multi-scale information perception convolutional neural network |
CN114333900A (en) * | 2021-11-30 | 2022-04-12 | 南京硅基智能科技有限公司 | Method for extracting BNF characteristics end to end, network model, training method and system |
CN114999459A (en) * | 2022-05-05 | 2022-09-02 | 苏州大学 | Voice recognition method and system based on multi-scale recursive quantitative analysis |
WO2022267960A1 (en) * | 2021-06-24 | 2022-12-29 | 长沙理工大学 | Federated attention dbn collaborative detection system based on client selections |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120253812A1 (en) * | 2011-04-01 | 2012-10-04 | Sony Computer Entertainment Inc. | Speech syllable/vowel/phone boundary detection using auditory attention cues |
US20120328161A1 (en) * | 2011-06-22 | 2012-12-27 | Palenychka Roman | Method and multi-scale attention system for spatiotemporal change determination and object detection |
US20140114655A1 (en) * | 2012-10-19 | 2014-04-24 | Sony Computer Entertainment Inc. | Emotion recognition using auditory attention cues extracted from users voice |
CN104756182A (en) * | 2012-11-29 | 2015-07-01 | 索尼电脑娱乐公司 | Combining auditory attention cues with phoneme posterior scores for phone/vowel/syllable boundary detection |
CN106571135A (en) * | 2016-10-27 | 2017-04-19 | 苏州大学 | Whisper speech feature extraction method and system |
CN108417202A (en) * | 2018-01-19 | 2018-08-17 | 苏州思必驰信息科技有限公司 | Audio recognition method and system |
CN108510983A (en) * | 2017-02-24 | 2018-09-07 | 百度(美国)有限责任公司 | The system and method for automatic unit selection and goal decomposition for sequence labelling |
CN109271838A (en) * | 2018-07-19 | 2019-01-25 | 重庆邮电大学 | A kind of three parameter attributes fusion gesture identification method based on fmcw radar |
-
2019
- 2019-06-03 CN CN201910477034.2A patent/CN110211574B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120253812A1 (en) * | 2011-04-01 | 2012-10-04 | Sony Computer Entertainment Inc. | Speech syllable/vowel/phone boundary detection using auditory attention cues |
CN103503060A (en) * | 2011-04-01 | 2014-01-08 | 索尼电脑娱乐公司 | Speech syllable/vowel/phone boundary detection using auditory attention cues |
CN105023573A (en) * | 2011-04-01 | 2015-11-04 | 索尼电脑娱乐公司 | Speech syllable/vowel/phone boundary detection using auditory attention cues |
US20120328161A1 (en) * | 2011-06-22 | 2012-12-27 | Palenychka Roman | Method and multi-scale attention system for spatiotemporal change determination and object detection |
US20140114655A1 (en) * | 2012-10-19 | 2014-04-24 | Sony Computer Entertainment Inc. | Emotion recognition using auditory attention cues extracted from users voice |
CN104756182A (en) * | 2012-11-29 | 2015-07-01 | 索尼电脑娱乐公司 | Combining auditory attention cues with phoneme posterior scores for phone/vowel/syllable boundary detection |
CN106571135A (en) * | 2016-10-27 | 2017-04-19 | 苏州大学 | Whisper speech feature extraction method and system |
CN108510983A (en) * | 2017-02-24 | 2018-09-07 | 百度(美国)有限责任公司 | The system and method for automatic unit selection and goal decomposition for sequence labelling |
CN108417202A (en) * | 2018-01-19 | 2018-08-17 | 苏州思必驰信息科技有限公司 | Audio recognition method and system |
CN109271838A (en) * | 2018-07-19 | 2019-01-25 | 重庆邮电大学 | A kind of three parameter attributes fusion gesture identification method based on fmcw radar |
Non-Patent Citations (3)
Title |
---|
ANDROS TJANDRA ET AL.: "MULTI-SCALE ALIGNMENT AND CONTEXTUAL HISTORY FOR ATTENTION MECHANISM IN SEQUENCE-TO-SEQUENCE MODEL", 《SLT 2018》 * |
YOON KIM: "Convolutional Neural Networks for Sentence Classification", 《ARXIV》 * |
龙星延: "基于注意力机制的端到端语音识别技术研究", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
Cited By (43)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110824915A (en) * | 2019-09-30 | 2020-02-21 | 华南师范大学 | GA-DBN network-based intelligent monitoring method and system for wastewater treatment |
CN110824915B (en) * | 2019-09-30 | 2022-06-07 | 华南师范大学 | GA-DBN network-based intelligent monitoring method and system for wastewater treatment |
CN110765966A (en) * | 2019-10-30 | 2020-02-07 | 哈尔滨工业大学 | One-stage automatic recognition and translation method for handwritten characters |
CN110765966B (en) * | 2019-10-30 | 2022-03-25 | 哈尔滨工业大学 | One-stage automatic recognition and translation method for handwritten characters |
CN110930978A (en) * | 2019-11-08 | 2020-03-27 | 北京搜狗科技发展有限公司 | Language identification method and device and language identification device |
CN110880328A (en) * | 2019-11-20 | 2020-03-13 | Oppo广东移动通信有限公司 | Arrival reminding method, device, terminal and storage medium |
CN111159419B (en) * | 2019-12-09 | 2021-05-25 | 浙江师范大学 | Knowledge tracking data processing method, system and storage medium based on graph convolution |
CN111159419A (en) * | 2019-12-09 | 2020-05-15 | 浙江师范大学 | Knowledge tracking data processing method, system and storage medium based on graph convolution |
CN111312228A (en) * | 2019-12-09 | 2020-06-19 | 中国南方电网有限责任公司 | End-to-end-based voice navigation method applied to electric power enterprise customer service |
CN111243609A (en) * | 2020-01-10 | 2020-06-05 | 平安科技(深圳)有限公司 | Method and device for intelligently detecting effective voice and computer readable storage medium |
CN111243578A (en) * | 2020-01-10 | 2020-06-05 | 中国科学院声学研究所 | Chinese mandarin character-voice conversion method based on self-attention mechanism |
CN111276131A (en) * | 2020-01-22 | 2020-06-12 | 厦门大学 | Multi-class acoustic feature integration method and system based on deep neural network |
CN111309965A (en) * | 2020-03-20 | 2020-06-19 | 腾讯科技(深圳)有限公司 | Audio matching method and device, computer equipment and storage medium |
CN111309965B (en) * | 2020-03-20 | 2024-02-13 | 腾讯科技(深圳)有限公司 | Audio matching method, device, computer equipment and storage medium |
CN113450781A (en) * | 2020-03-25 | 2021-09-28 | 阿里巴巴集团控股有限公司 | Speech processing method, speech encoder, speech decoder and speech recognition system |
CN111145760B (en) * | 2020-04-02 | 2020-06-30 | 支付宝(杭州)信息技术有限公司 | Method and neural network model for speaker recognition |
CN111145760A (en) * | 2020-04-02 | 2020-05-12 | 支付宝(杭州)信息技术有限公司 | Method and neural network model for speaker recognition |
CN111489738B (en) * | 2020-04-13 | 2021-02-23 | 深圳市友杰智新科技有限公司 | Feature extraction method and voice command identification method based on multi-head attention mechanism |
CN111489738A (en) * | 2020-04-13 | 2020-08-04 | 深圳市友杰智新科技有限公司 | Feature extraction method and voice command identification method based on multi-head attention mechanism |
CN111477221A (en) * | 2020-05-28 | 2020-07-31 | 中国科学技术大学 | Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network |
CN111680591B (en) * | 2020-05-28 | 2023-01-13 | 天津大学 | Pronunciation inversion method based on feature fusion and attention mechanism |
CN111477221B (en) * | 2020-05-28 | 2022-12-30 | 中国科学技术大学 | Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network |
CN111680591A (en) * | 2020-05-28 | 2020-09-18 | 天津大学 | Pronunciation inversion method based on feature fusion and attention mechanism |
CN111524519B (en) * | 2020-05-28 | 2022-09-30 | 中国科学技术大学 | End-to-end multi-channel speech recognition method using advanced feature fusion |
CN111524519A (en) * | 2020-05-28 | 2020-08-11 | 中国科学技术大学 | End-to-end multi-channel speech recognition method using advanced feature fusion |
CN112530421A (en) * | 2020-11-03 | 2021-03-19 | 科大讯飞股份有限公司 | Voice recognition method, electronic equipment and storage device |
CN112633175A (en) * | 2020-12-24 | 2021-04-09 | 哈尔滨理工大学 | Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment |
CN112992125A (en) * | 2021-04-20 | 2021-06-18 | 北京沃丰时代数据科技有限公司 | Voice recognition method and device, electronic equipment and readable storage medium |
CN112992125B (en) * | 2021-04-20 | 2021-08-03 | 北京沃丰时代数据科技有限公司 | Voice recognition method and device, electronic equipment and readable storage medium |
CN112992126A (en) * | 2021-04-22 | 2021-06-18 | 北京远鉴信息技术有限公司 | Voice authenticity verification method and device, electronic equipment and readable storage medium |
CN113192489A (en) * | 2021-05-16 | 2021-07-30 | 金陵科技学院 | Paint spraying robot voice recognition method based on multi-scale enhancement BiLSTM model |
CN113705322A (en) * | 2021-06-11 | 2021-11-26 | 北京易达图灵科技有限公司 | Handwritten Chinese character recognition method and device based on threshold graph neural network |
WO2022267960A1 (en) * | 2021-06-24 | 2022-12-29 | 长沙理工大学 | Federated attention dbn collaborative detection system based on client selections |
CN113537472A (en) * | 2021-07-26 | 2021-10-22 | 北京计算机技术及应用研究所 | Bidirectional recurrent neural network with low computation and storage consumption |
CN113537472B (en) * | 2021-07-26 | 2024-04-09 | 北京计算机技术及应用研究所 | Construction method of bidirectional recurrent neural network with low calculation and storage consumption |
CN113593526A (en) * | 2021-07-27 | 2021-11-02 | 哈尔滨理工大学 | Speech emotion recognition method based on deep learning |
CN113782014A (en) * | 2021-09-26 | 2021-12-10 | 联想(北京)有限公司 | Voice recognition method and device |
CN113782014B (en) * | 2021-09-26 | 2024-03-26 | 联想(北京)有限公司 | Speech recognition method and device |
CN113936680B (en) * | 2021-10-08 | 2023-08-08 | 电子科技大学 | Single-channel voice enhancement method based on multi-scale information perception convolutional neural network |
CN113936680A (en) * | 2021-10-08 | 2022-01-14 | 电子科技大学 | Single-channel speech enhancement method based on multi-scale information perception convolutional neural network |
CN114333900A (en) * | 2021-11-30 | 2022-04-12 | 南京硅基智能科技有限公司 | Method for extracting BNF characteristics end to end, network model, training method and system |
CN114333900B (en) * | 2021-11-30 | 2023-09-05 | 南京硅基智能科技有限公司 | Method for extracting BNF (BNF) characteristics end to end, network model, training method and training system |
CN114999459A (en) * | 2022-05-05 | 2022-09-02 | 苏州大学 | Voice recognition method and system based on multi-scale recursive quantitative analysis |
Also Published As
Publication number | Publication date |
---|---|
CN110211574B (en) | 2022-03-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110211574A (en) | Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism | |
Zhang et al. | End-to-end attention based text-dependent speaker verification | |
An et al. | Deep CNNs with self-attention for speaker identification | |
Cai et al. | A novel learnable dictionary encoding layer for end-to-end language identification | |
CN109409296B (en) | Video emotion recognition method integrating facial expression recognition and voice emotion recognition | |
Deng et al. | New types of deep neural network learning for speech recognition and related applications: An overview | |
CN109192199A (en) | A kind of data processing method of combination bottleneck characteristic acoustic model | |
Waibel | Modular construction of time-delay neural networks for speech recognition | |
Tan et al. | The artificial intelligence renaissance: deep learning and the road to human-level machine intelligence | |
Elleuch et al. | Arabic handwritten characters recognition using deep belief neural networks | |
Irum et al. | Speaker verification using deep neural networks: A | |
Frinken et al. | Deep BLSTM neural networks for unconstrained continuous handwritten text recognition | |
CN109545227A (en) | Speaker's gender automatic identifying method and system based on depth autoencoder network | |
CN109637526A (en) | The adaptive approach of DNN acoustic model based on personal identification feature | |
Chattopadhyay et al. | Optimizing speech emotion recognition using manta-ray based feature selection | |
CN111899766A (en) | Speech emotion recognition method based on optimization fusion of depth features and acoustic features | |
Jiang et al. | Speech Emotion Recognition Using Deep Convolutional Neural Network and Simple Recurrent Unit. | |
Huang et al. | A neural net approach to speech recognition | |
Singh et al. | Emoception: An inception inspired efficient speech emotion recognition network | |
Wang et al. | L2 mispronunciation verification based on acoustic phone embedding and Siamese networks | |
Huilian et al. | Speech emotion recognition based on BLSTM and CNN feature fusion | |
Yang et al. | Exploration of acoustic and lexical cues for the interspeech 2020 computational paralinguistic challenge | |
Cai et al. | Deep speaker embeddings with convolutional neural network on supervector for text-independent speaker recognition | |
Bottou | Multilayer neural networks | |
Hu et al. | I-vector and dnn hybrid method for short utterance speaker recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |