CN109767759A - End-to-end speech recognition methods based on modified CLDNN structure - Google Patents
End-to-end speech recognition methods based on modified CLDNN structure Download PDFInfo
- Publication number
- CN109767759A CN109767759A CN201910115486.6A CN201910115486A CN109767759A CN 109767759 A CN109767759 A CN 109767759A CN 201910115486 A CN201910115486 A CN 201910115486A CN 109767759 A CN109767759 A CN 109767759A
- Authority
- CN
- China
- Prior art keywords
- model
- cldnn
- network
- rate
- gradient
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
A kind of end-to-end speech recognition methods based on modified CLDNN structure is claimed in the present invention; it is usually used in traditional CLDNN structure of speech recognition using the timing information in full connection LSTM (Long Short Term Memory) model treatment voice signal; over-fitting easily occurs in the training process, influences learning effect.Deeper model often show it is more outstanding, but by simply stack network layer increase model depth can occur gradient disappear, gradient explosion and " degeneration " problem.For the above phenomenon and problem, the present invention proposes a kind of modified CLDNN structure, establishes residual error ConvLSTM model using residual error network and ConvLSTM combination, and replace the full connection LSTM model in tradition CLDNN structure with this.The model structure improve traditional CLDNN model there are the problem of, and can by stack residual error ConvLSTM block increase model depth without gradient disappear, gradient explosion and " degeneration " problem, keep speech recognition system performance more excellent.
Description
Technical field
The invention belongs to field of speech recognition, especially a kind of audio recognition method based on deep learning.
Background technique
Automatic speech recognition technology has always very important status in artificial intelligence field.It is with HMM-GMM model
The traditional voice identification technology of representative is once used as always mainstream, and it is decades-long to have ruled field of speech recognition.In recent years, it benefits
In the breakthrough of deep learning, automatic speech recognition technology is also at the stage of rapid development.Currently, based on deep learning
End-to-end speech identifying system has surmounted legacy speech recognition systems on the popularity degree in academia, and starts
Gradually apply to actual production instead of legacy speech recognition systems.
Since the 1980s, it is based on mixed Gauss model/hidden Markov model (Gaussian Mixture
Model/Hidden Markov Model, GMM/HMM) acoustic model be just widely used, HMM for handle voice when
Variation in sequence, GMM is for completing the mapping that acoustics is input between Hidden Markov state.In recent years, it is based on depth nerve net
The acoustic model of network (Deep Neural Network, DNN) is proved to possess in the voice recognition tasks of large vocabulary more preferable
Performance, a large amount of neurons activity simulation acoustic feature on show it is more outstanding.Due to the property that DNN is linked completely, lead
Cause it that cannot make full use of the structure partial in speech feature space.And convolutional neural networks (Convolutional
Nerual Network, CNN) it can use its translation invariance to overcome the diversity of voice signal itself, and can be very
The variation in speech feature space is explained well.Recurrent neural network (Recurrent Neural Network, RNN) passes through
Recurrence is come the shortcomings that excavating the context-related information in sequence, overcome DNN to a certain extent.But RNN is in training
The problem of it is easy to appear gradient disappearances, and information when being difficult to remember long.Shot and long term memory unit (Long Short-Term
Memory, LSTM) enable the error at current time to preserve and be selectively transmitted to specific by specific door control unit
Unit, so as to avoid gradient disappearance the problem of.Connect timing sorting algorithm (Connectionist Temporal
Classifier, CTC) it was proposed by Grave etc. in 2006, it can be applied to end-to-end speech identifying system, portray phonetic feature
The relevance of sequence and aligned phoneme sequence, and artificial alignment feature and phoneme need not be relied on.
Domestic and international relevant technical company is all in the end-to-end speech identification model for constantly researching and developing oneself at present.People studies in Baidu
Member has delivered Deep speech for 2015, has delivered within 2016 Deep Speech2, and the two is combined using CLDNN and CTC
Mode establishes speech recognition modeling, reaches excellent properties.Iflytek research team proposed depth complete sequence volume in 2016
Product neural network (DF-CNN, Deep Fully Convolutional NeuralNetwork) structure, utilizes a large amount of convolutional layers
With the combination of pond layer, whole sentence voice is modeled, the ability to express of CNN is greatly strengthened.DFCNN is very more by accumulating
This convolution pond layer it is right, it can be seen that very long history and Future Information, this guarantees DFCNN can outstanding earth's surface
Up to voice it is long when correlation, it is more outstanding in robustness compared to RNN network structure.The researcher of IBM was in 2016
It is being delivered on ICASSP the article stated that can be trained using the technology of pond layer is connect again after the convolution kernel and multilayer convolution of 3x3
14 layers of (including full connection) Deep CNN model.The model is on Switchboard data set compared to tradition CNN application method
Model can bring relatively about 10.6% WER to decline.MSRA team proposes residual error network in 2015, solves with model
Depth down occur " degeneration " problem.Residual error network was also applied on speech recognition modeling later, had been proved good
Effect.Google research team illustrates one kind for 2017 by Network-in-Network in icassp meeting
(NiN), the acoustic model knot that Batch Normalization (BN) and Convolutional LSTM (ConvLSTM) is combined
Structure.In the case where not having language model, which has reached 10.5% WER in WSJ voice recognition tasks.
CLDNN is always a kind of comparison in end-to-end speech identification model due to its simple construction and excellent performance
Popular structure.But the depth of common CLDNN model is inadequate, and the feature of extraction is not abundant enough, and the speech recognition modeling of foundation is not
It can achieve the effect that best.The long memory models (FC-LSTM) in short-term of full connection in its model cannot keep speech feature space
Structure partial, and be easy over-fitting.
Summary of the invention
Present invention seek to address that the above problem of the prior art.Traditional CLDNN can be efficiently solved by proposing one kind
In, the problem of LSTM is easy to cause over-fitting, overcomes and increases that model depth bring gradient disappears, gradient explosion and " degeneration " are asked
The end-to-end speech recognition methods based on modified CLDNN structure of topic.Technical scheme is as follows:
A kind of end-to-end speech recognition methods based on modified CLDNN structure comprising following steps:
S1, it obtains voice data collection and is divided, voice data collection is divided into training set, cross validation collection and test
Collection;
S2, all voice data are pre-processed, obtains the mel-frequency cepstrum coefficient MFCC of voice signal;
S3, building modified CLDNN network model, including the phonetic feature abstract being made of convolutional neural networks CNN
Processing part, processing voice signal timing information the long memory models in short-term of residual error convolution and will treated feature space mapping
To the deep neural network DNN of output layer;
S4, building speech recognition loss function, loss function use CTC loss;
S5, it is trained with modified CLDNN model of the training set to step S3, utilizes Adam arithmetic operators optimization step S4's
Objective function;
S6, the model income cross validation to step S5 after trained, adjust the hyper parameter of model, obtain final network mould
Type.
Further, the pre-treatment step of the step 2 include: preemphasis, framing, adding window, Fast Fourier Transform (FFT),
Mel filtering and discrete cosine transform.
Further, the long memory models in short-term of the residual error convolution in the step S3 specifically: full connection length is remembered in short-term
Recall the matrix product in model and replace with convolution algorithm and obtain the long memory models in short-term of convolution, residual error network knot is used to the model
Structure obtains the long memory models in short-term of residual error convolution.
Further, the residual error network structure is used to construct deep layer network, connects skip connection by jump
It is directly connected to shallow-layer network and deep layer network, so that gradient can be better transmitted to shallow-layer, residual error network is by multiple residual errors
Block is constituted, and the depth residual error network structure being made of multiple residual blocks replaces the multilayer LSTM (length in traditional CLDNN model
When memory models) structure.
Further, the step S4 loss function, loss function use CTC loss, specifically include:
Assuming that the size of tag element table L is K, list entries X=(x is given1,x2,...,xT), corresponding output label sequence
Arrange Y=(y1,y2,...,yU), the task of CTC is that penalty values are fed back to neural network, are passed through under given list entries
Adjustment neural network inner parameter maximizes the log probability of output label, i.e. max (lnP (Y | X)), CTC (connection timing
Classification) sky label blank is also introduced to indicate the mapping for being not belonging to tag element table L;
Softmax layer after the last layer DNN is exported into the input as CTC, softmax output includes K+1 node
The each element being mapped in L ∪ { blank }, entire CTC path probability are shown below:
Wherein ztFor in t moment, softmax obtains output vector,The corresponding posterior probability of k-th of label is represented, is
The alignment problem between softmax output and sequence label is solved, list entries one-to-one CTC on frame-layer face is introduced
Path p=(p1,p2,...,pT), sequence label Y is corresponded on the p of the path CTC by mapping Ф, since this mapping is a pair of
More mapping a, so label can correspond to multiple paths CTC, so CTC of the probability of label Y by all this labels of correspondence
Path probability and it is expressed as following formula:
The loss function of CTC and the sum of the negative logarithm for being defined as each training sample correct labeling, such as following formula:
Further, the step S5 utilizes the objective function of Adam arithmetic operators optimization step S4;
Calculate the gradient of t time step:
Firstly, calculating the index moving average of gradient, m0 is initialized as 0.The gradient of time step is dynamic before comprehensively considering
Amount.1 coefficient of β is exponential decay rate, and control weight distribution (momentum and current gradient) usually takes the value close to 1, is defaulted as
0.9
mt=β1mt-1+(1-β1)gt
Second, the index moving average of gradient square is calculated, v0 is initialized as 0.2 coefficient of β is exponential decay rate, control
The influence situation of gradient square before, is defaulted as 0.999.
Third will lead to mt and be partial to 0 since m0 is initialized as 0, especially in training initial stage.So need herein
Bias correction is carried out to gradient mean value mt, reduces influence of the deviation to training initial stage.
4th, since v0 is initialized as 0 lead to that initial stage vt is trained to be biased to 0, it is corrected.
5th, undated parameter, initial learning rate α multiplied by gradient mean value and gradient variance the ratio between square root.Wherein write from memory
Recognize learning rate α=0.001, ε=10^-8.
Further, the step S6 carries out cross validation to the model after step S5 training, adjusts the hyper parameter of model,
Final network model is obtained, is specifically included:
Cross validation step:
1, weight is initialized, weighting value is the random number between -0.5 to 0.5.
2, dividing learning sample space C is N parts.
3, N-1 parts are taken out according to regulation sequence from learning data file and is used as training data sample.Remaining N parts of works
For verify data sample.It completes step 4 and arrives step 7.
4, it is trained reading in a sample since training data sample.
5, it calculates this sample output error and always measures EP.Two layers of weight is modified until EP < (for defined error metrics), is read
Enter next training sample.
6, until in N-1 parts of training samples all sample learnings terminate, generate one group of weight, verified with this group of weight computing
Sample, calculate verifying sample is proved to be successful rate RATE=(meet EP < verifying number of samples)/(total verifying number of samples)
If 7, verifying sample success rate RATE > rate (rate is defined success rate), terminate the study of this wheel.It is no
Then learn all verifying samples.
Hyper parameter:
Learning rate: learning rate refers to the amplitude size that network weight is updated in optimization algorithm.Different optimization algorithms are determined
Fixed different learning rate.When learning rate is excessive, it may cause model and do not restrain, loss function constantly concussion up and down;Learning rate mistake
It is small, cause model convergence rate partially slow, needs longer time training.Usual value is 0.01,0.001,0.0001.
Batch size: batch size is the sample number that trained neural network is sent into model each time, in convolutional neural networks
In, big batch can usually make network more rapid convergence, but due to the limitation of memory source, batch is excessive to may result in Out of Memory
With or program Kernel Panic.Usual value is 16,32,64,128.
The number of iterations: the number of iterations refers to that entire training set is input to the number that neural network is trained, when verifying is wrong
When accidentally rate and training error rate differ smaller, it is believed that current iteration number is suitable;When becoming larger after authentication error takes the lead in becoming smaller
Then illustrate that the number of iterations is excessive, needs to reduce the number of iterations, be otherwise easy to appear over-fitting.
It advantages of the present invention and has the beneficial effect that:
Present invention introduces the long memory models in short-term of convolution (Convolutional Long Short-Term Memory,
ConvLSTM the FC-LSTM in common CLDNN model) is replaced, to improve model cannot keep space structure locality and appearance
The problem of easy over-fitting.In order to deepen the problems such as model depth explodes without " degeneration ", gradient disappearance and gradient, the present invention
Also introduce residual error network (Residual Network, ResNet).In order to which stacked multilayer ConvLSTM improves the performance of model
And gradient disappearance, gradient explosion and " degeneration " problem does not occur, the present invention has merged ConvLSTM and residual error network structure, residual
Poor ConvLSTM block structure is as shown in Figure 1.Based on the above structure, the present invention proposes improvement to traditional CLDNN structure.For biography
The long memory models in short-term of full connection in system CLDNN model cannot keep the structure partial of feature space, and be easy to intend
The problem of conjunction, replaces tradition using the depth residual error ConvLSTM network structure being made of multiple residual error ConvLSTM blocks
Multilayer LSTM structure in CLDNN model makes model have better performance in the time relationship in processing phonetic feature, and
It is not easy over-fitting.Improved CNN-ResconvLSTM-DNN model can by be superimposed more residual error ConvLSTM come
Deeper model is established without gradient disappearance, gradient explosion and " degeneration " problem, can be played in voice recognition tasks
Better performance, structure are as shown in Figure 2.
Detailed description of the invention
Fig. 1 is that the present invention provides preferred embodiment residual error convolution long memory models block structure in short-term;
Fig. 2 is modified CLDNN model structure proposed by the present invention;
Fig. 3 is flow chart of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, detailed
Carefully describe.Described embodiment is only a part of the embodiments of the present invention.
The technical solution that the present invention solves above-mentioned technical problem is:
S1, voice data collection is divided, data set is divided into training set, cross validation collection and test set;
S2, all data are pre-processed, and then obtains the mel-frequency cepstrum coefficient (MFCC) of voice signal, it is pre- to locate
Manage step are as follows:
Preemphasis: for passing through high-pass filter H (Z)=1- μ z because of signal-1
Framing: whole section of voice signal is divided into every frame 30ms, frame moves the segment of 10ms.
Adding window: add Hamming window for each frame signal
S ' (n)=S (n) * W (n)
Fast Fourier Transform (FFT): every frame is by Fast Fourier Transform (FFT) to obtain the Energy distribution on frequency spectrum.
Mel filtering: energy spectrum is passed through into one group of Mel filter group, the frequency response of single foot filter is defined as:
Wherein m is triangular filter centre frequency.
Calculate the logarithmic energy of each filter group output:
Discrete cosine transform:
S3, building modified CLDNN network model, model includes the phonetic feature being made of convolutional neural networks (CNN)
Abstract processing part, the long part memory models (ResConvLstm) in short-term of residual error convolution for handling voice signal timing information
With by part deep neural network (DNN) of treated feature space is mapped to output layer;
The long memory models in short-term of convolution are the extensions of the length that is fully connected memory models in short-term, it in the state that is input to and
State to state convert in all have convolutional coding structure, this structure compared to common CNN more can performance characteristic time relationship, and phase
Than connecting the more difficult over-fitting of LSTM entirely, such as following formula:
it=σ (Wxi*xt+Whi*ht-1+bi)
ft=σ (Wxf*xt+Whf*ht-1+bi)
ot=σ (Wxo*xt+Who*ht-1+bo)
σ is sigmoid activation primitive, it, ft, ot, ct, htIt respectively indicates the input gate of t moment, forget door, out gate, list
Member input activation and unit output vector,Indicate that the element product of vector, W indicate to connect the weight matrix between not fellow disciple, b
Represent corresponding bias vector.
Residual error network structure constructs deep layer network, is directly connected to shallow-layer by jump connection (skip connection)
Network and deep layer network, so that gradient can be better transmitted to shallow-layer.Residual error network is by multiple residual block (Residual
Block it) constitutes, if the input of residual block is xl, export as xl+1, then the structure of residual block may be expressed as:
xl+1=xl+F(xl,wl) (9)
F(xl,wl)=wlσ(wl-1xl-1) (10)
Wherein σ is activation primitive, so, for any xL, have:
Assuming that loss function C, available:
Wherein,It ensure that information can pass random layer x backl,It ensure that network is not in ladder
The phenomenon that degree disappears.
In order to the long memory models in short-term of stacked multilayer convolution improve the performance of model and do not occur gradient disappear, gradient it is quick-fried
Fried and " degeneration " problem, the present invention have merged convolution long memory models and residual error network structure in short-term, and residual error convolution length is remembered in short-term
It is as shown in Figure 1 to recall model block structure.
Based on the above structure, the present invention proposes improvement to traditional CLDNN structure.For complete in traditional CLDNN model
The long memory models in short-term of connection cannot keep the structure partial of feature space, and the problem of easy over-fitting, using by more
The depth residual error ConvLSTM network structure of a residual error ConvLSTM block composition replaces the multilayer LSTM in traditional CLDNN model
Structure makes model have better performance in the time relationship in processing phonetic feature, and is not easy over-fitting.It is improved
CNN-ResconvLSTM-DNN model can be established by being superimposed more residual error ConvLSTM deeper model without
Gradient disappears, gradient is exploded and " degeneration " problem, and better performance, structure such as Fig. 2 can be played in voice recognition tasks
It is shown.
S4, building objective function, i.e. speech recognition word error rate (WER%), loss function use CTC loss;
Assuming that the size of tag element table L is K.Given list entries X=(x1,x2,...,xT), corresponding output label sequence
Arrange Y=(y1,y2,...,yU).The task of CTC is that penalty values are fed back to neural network, are passed through under given list entries
Adjustment neural network inner parameter maximizes the log probability of output label, i.e. max (lnP (Y | X)).CTC also introduces sky
Label blank indicates mapping, such as pause, cough for being not belonging to tag element table L etc..
Softmax layer after the last layer DNN is exported into the input as CTC, softmax output includes K+1 node
The each element being mapped in L ∪ { blank }.Entire CTC path probability is shown below:
Wherein ztFor in t moment, softmax obtains output vector,Represent the corresponding posterior probability of k-th of label.For
The alignment problem between softmax output and sequence label is solved, list entries one-to-one CTC on frame-layer face is introduced
Path p=(p1,p2,...,pT).Sequence label Y is corresponded on the p of the path CTC by mapping Ф, since this mapping is a pair of
More mapping a, so label can correspond to multiple paths CTC.So CTC of the probability of label Y by all this labels of correspondence
Path probability and it is expressed as following formula:
The loss function of CTC and the sum of the negative logarithm for being defined as each training sample correct labeling, such as following formula:
S5, it is trained with training the set pair analysis model, utilizes Adam arithmetic operators optimization objective function;
S6, collected using verifying to the model income cross validation after training, adjust the hyper parameter of model, obtain final network
Model;
The above embodiment is interpreted as being merely to illustrate the present invention rather than limit the scope of the invention.?
After the content for having read record of the invention, technical staff can be made various changes or modifications the present invention, these equivalent changes
Change and modification equally falls into the scope of the claims in the present invention.
Claims (7)
1. a kind of end-to-end speech recognition methods based on modified CLDNN structure, which comprises the following steps:
S1, it obtains voice data collection and is divided, voice data collection is divided into training set, cross validation collection and test set;
S2, all voice data are pre-processed, obtains the mel-frequency cepstrum coefficient MFCC of voice signal;
S3, building modified CLDNN network model, including the phonetic feature abstract processing being made of convolutional neural networks CNN
Partially, handle voice signal timing information the long memory models in short-term of residual error convolution and will treated that feature space is mapped to is defeated
The deep neural network DNN of layer out;
S4, the loss function for constructing speech recognition, loss function use CTC loss;
S5, it is trained with modified CLDNN model of the training set to step S3, utilizes the target of Adam arithmetic operators optimization step S4
Function;
S6, collected using verifying to the model income cross validation after step S5 training, adjust the hyper parameter of model, obtain final net
Network model.
2. a kind of end-to-end speech recognition methods based on modified CLDNN structure according to claim 1, feature exist
In, the pre-treatment step of the step 2 include: preemphasis, framing, adding window, Fast Fourier Transform (FFT), Mel filtering and discrete remaining
String transformation.
3. a kind of end-to-end speech recognition methods based on modified CLDNN structure according to claim 1, feature exist
In the residual error convolution in the step S3 grows memory models in short-term specifically: to the matrix in the long memory models in short-term of full connection
Product replaces with convolution algorithm and obtains the long memory models in short-term of convolution, obtains residual error convolution using residual error network structure to the model
Long memory models in short-term.
4. a kind of end-to-end speech recognition methods based on modified CLDNN structure according to claim 3, feature exist
In the residual error network structure is used to construct deep layer network, is directly connected to shallow-layer net by jump connection skip connection
Network and deep layer network, so that gradient can be better transmitted to shallow-layer, residual error network is made of multiple residual blocks, by multiple residual
The depth residual error network structure of poor block composition replaces the memory models structure in short-term of the multilayer LSTM long in traditional CLDNN model.
5. a kind of end-to-end speech recognition methods based on modified CLDNN structure according to claim 3, feature exist
In, the step S4, loss function uses CTC loss, it specifically includes:
Assuming that the size of tag element table L is K, list entries X=(x is given1,x2,...,xT), corresponding output label sequence Y=
(y1,y2,...,yU), the task of CTC is penalty values to be fed back to neural network, by adjusting mind under given list entries
Maximize the log probability of output label through parameters within network, i.e. max (lnP (Y | X)), the classification of CTC connection timing is also drawn
Empty label blank is entered to indicate the mapping for being not belonging to tag element table L;
Softmax layer after the last layer DNN is exported into the input as CTC, softmax output includes K+1 node mapping
To each element in L ∪ { blank }, entire CTC path probability is shown below:
Wherein ztFor in t moment, softmax obtains output vector,The corresponding posterior probability of k-th of label is represented, in order to solve
Alignment problem between softmax output and sequence label, introduces the list entries one-to-one path CTC p on frame-layer face
=(p1,p2,...,pT), sequence label Y is corresponded on the p of the path CTC by mapping Ф, since this mapping is one-to-many reflects
It penetrates, so a label can correspond to multiple paths CTC, so the probability of label Y is general by the path CTC of all this labels of correspondence
Rate and it is expressed as following formula:
The loss function of CTC and the sum of the negative logarithm for being defined as each training sample correct labeling, such as following formula:
6. a kind of end-to-end speech recognition methods based on modified CLDNN structure according to claim 5, feature exist
In the step S5 utilizes the objective function of Adam arithmetic operators optimization step S4;
Calculate the gradient of t time step:
gt=▽θJ(θt-1)
Firstly, calculating the index moving average of gradient, m0 is initialized as 0, the gradient momentum of time step, β 1 before comprehensively considering
Coefficient is exponential decay rate, and control weight distribution (momentum and current gradient) usually takes the value close to 1, is defaulted as 0.9
mt=β1mt-1+(1-β1)gt
Second, the index moving average of gradient square is calculated, v0 is initialized as 0.2 coefficient of β is exponential decay rate, before control
Gradient square influence situation, be defaulted as 0.999;
Third will lead to mt and be partial to 0 since m0 is initialized as 0, especially in training initial stage.So need to ladder herein
It spends mean value mt and carries out bias correction, reduce influence of the deviation to training initial stage;
4th, since v0 is initialized as 0 lead to that initial stage vt is trained to be biased to 0, it is corrected;
5th, undated parameter, initial learning rate α multiplied by gradient mean value and gradient variance the ratio between square root.Wherein default is learned
Habit rate α=0.001, ε=10^-8;
7. a kind of end-to-end speech recognition methods based on modified CLDNN structure according to claim 6, feature exist
In the step S6 adjusts the hyper parameter of model, obtain final network mould to the model income cross validation after step S5 training
Type specifically includes:
Cross validation step:
1, weight is initialized, weighting value is the random number between -0.5 to 0.5;
2, dividing learning sample space C is N parts;
3, N-1 parts are taken out according to regulation sequence from learning data file and is used as training data sample;Remaining N parts as verifying
Data sample;It completes step 4 and arrives step 7;
4, it is trained reading in a sample since training data sample;
5, it calculates this sample output error and always measures EP;Two layers of weight is modified until EP < (for defined error metrics), under reading
One training sample;
6, until in N-1 parts of training samples all sample learnings terminate, generate one group of weight, verify sample with this group of weight computing
This, calculate verifying sample is proved to be successful rate RATE=(meet EP < verifying number of samples)/(total verifying number of samples)
If 7, verifying sample success rate RATE > rate (rate is defined success rate), terminate the study of this wheel;Otherwise it learns
Practise all verifying samples;
Hyper parameter:
Learning rate: learning rate refers to the amplitude size that network weight is updated in optimization algorithm;Different optimization algorithms determines not
Same learning rate;When learning rate is excessive, it may cause model and do not restrain, loss function constantly concussion up and down;Learning rate is too small then
Cause model convergence rate partially slow, needs longer time training;Usual value is 0.01,0.001,0.0001;
Batch size: batch size is the sample number that trained neural network is sent into model each time, in convolutional neural networks, greatly
Batch can usually make network more rapid convergence, but due to the limitation of memory source, batch is excessive may result in Out of Memory use or
Program Kernel Panic;Usual value is 16,32,64,128;
The number of iterations: the number of iterations refers to that entire training set is input to the number that neural network is trained, when authentication error rate
When smaller with training error rate difference, it is believed that current iteration number is suitable;It is then said when becoming larger after authentication error takes the lead in becoming smaller
Bright the number of iterations is excessive, needs to reduce the number of iterations, is otherwise easy to appear over-fitting.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910115486.6A CN109767759B (en) | 2019-02-14 | 2019-02-14 | Method for establishing CLDNN structure applied to end-to-end speech recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910115486.6A CN109767759B (en) | 2019-02-14 | 2019-02-14 | Method for establishing CLDNN structure applied to end-to-end speech recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109767759A true CN109767759A (en) | 2019-05-17 |
CN109767759B CN109767759B (en) | 2020-12-22 |
Family
ID=66456247
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910115486.6A Active CN109767759B (en) | 2019-02-14 | 2019-02-14 | Method for establishing CLDNN structure applied to end-to-end speech recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109767759B (en) |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110148408A (en) * | 2019-05-29 | 2019-08-20 | 上海电力学院 | A kind of Chinese speech recognition method based on depth residual error |
CN110309771A (en) * | 2019-06-28 | 2019-10-08 | 南京丰厚电子有限公司 | A kind of EAS sound magnetic system tag recognition algorithm based on GBDT-INSGAII |
CN110335591A (en) * | 2019-07-04 | 2019-10-15 | 广州云从信息科技有限公司 | A kind of parameter management method, device, machine readable media and equipment |
CN110443127A (en) * | 2019-06-28 | 2019-11-12 | 天津大学 | In conjunction with the musical score image recognition methods of residual error convolutional coding structure and Recognition with Recurrent Neural Network |
CN110600053A (en) * | 2019-07-30 | 2019-12-20 | 广东工业大学 | Cerebral stroke dysarthria risk prediction method based on ResNet and LSTM network |
CN110634476A (en) * | 2019-10-09 | 2019-12-31 | 深圳大学 | Method and system for rapidly building robust acoustic model |
CN110659773A (en) * | 2019-09-16 | 2020-01-07 | 杭州师范大学 | Flight delay prediction method based on deep learning |
CN110751944A (en) * | 2019-09-19 | 2020-02-04 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for constructing voice recognition model |
CN110942090A (en) * | 2019-11-11 | 2020-03-31 | 北京迈格威科技有限公司 | Model training method, image processing method, device, electronic equipment and storage medium |
CN110992940A (en) * | 2019-11-25 | 2020-04-10 | 百度在线网络技术(北京)有限公司 | Voice interaction method, device, equipment and computer-readable storage medium |
CN111009235A (en) * | 2019-11-20 | 2020-04-14 | 武汉水象电子科技有限公司 | Voice recognition method based on CLDNN + CTC acoustic model |
CN111092798A (en) * | 2019-12-24 | 2020-05-01 | 东华大学 | Wearable system based on spoken language understanding |
CN111243624A (en) * | 2020-01-02 | 2020-06-05 | 武汉船舶通信研究所(中国船舶重工集团公司第七二二研究所) | Method and system for evaluating personnel state |
CN111401530A (en) * | 2020-04-22 | 2020-07-10 | 上海依图网络科技有限公司 | Recurrent neural network and training method thereof |
CN111429947A (en) * | 2020-03-26 | 2020-07-17 | 重庆邮电大学 | Speech emotion recognition method based on multi-stage residual convolutional neural network |
CN111898734A (en) * | 2020-07-10 | 2020-11-06 | 中国科学院精密测量科学与技术创新研究院 | NMR (nuclear magnetic resonance) relaxation time inversion method based on MLP (Multi-layer linear programming) |
CN112289309A (en) * | 2020-10-30 | 2021-01-29 | 西安工程大学 | Robot voice control method based on deep learning |
CN112560453A (en) * | 2020-12-18 | 2021-03-26 | 平安银行股份有限公司 | Voice information verification method and device, electronic equipment and medium |
CN112651313A (en) * | 2020-12-17 | 2021-04-13 | 国网上海市电力公司 | Equipment nameplate double-intelligent identification method, storage medium and terminal |
CN112652296A (en) * | 2020-12-23 | 2021-04-13 | 北京华宇信息技术有限公司 | Streaming voice endpoint detection method, device and equipment |
CN112669827A (en) * | 2020-12-28 | 2021-04-16 | 清华大学 | Joint optimization method and system for automatic speech recognizer |
CN112904220A (en) * | 2020-12-30 | 2021-06-04 | 厦门大学 | UPS (uninterrupted Power supply) health prediction method and system based on digital twinning and machine learning, electronic equipment and storable medium |
CN113270097A (en) * | 2021-05-18 | 2021-08-17 | 成都傅立叶电子科技有限公司 | Unmanned mechanical control method, radio station voice instruction conversion method and device |
CN113327590A (en) * | 2021-04-15 | 2021-08-31 | 中标软件有限公司 | Speech recognition method |
CN113569992A (en) * | 2021-08-26 | 2021-10-29 | 中国电子信息产业集团有限公司第六研究所 | Abnormal data identification method and device, electronic equipment and storage medium |
CN113852434A (en) * | 2021-09-18 | 2021-12-28 | 中山大学 | LSTM and ResNet assisted deep learning end-to-end intelligent communication method and system |
CN114550706A (en) * | 2022-02-21 | 2022-05-27 | 苏州市职业大学 | Smart campus voice recognition method based on deep learning |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106328122A (en) * | 2016-08-19 | 2017-01-11 | 深圳市唯特视科技有限公司 | Voice identification method using long-short term memory model recurrent neural network |
CN107562784A (en) * | 2017-07-25 | 2018-01-09 | 同济大学 | Short text classification method based on ResLCNN models |
WO2018071389A1 (en) * | 2016-10-10 | 2018-04-19 | Google Llc | Very deep convolutional neural networks for end-to-end speech recognition |
CN108564940A (en) * | 2018-03-20 | 2018-09-21 | 平安科技(深圳)有限公司 | Audio recognition method, server and computer readable storage medium |
-
2019
- 2019-02-14 CN CN201910115486.6A patent/CN109767759B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106328122A (en) * | 2016-08-19 | 2017-01-11 | 深圳市唯特视科技有限公司 | Voice identification method using long-short term memory model recurrent neural network |
WO2018071389A1 (en) * | 2016-10-10 | 2018-04-19 | Google Llc | Very deep convolutional neural networks for end-to-end speech recognition |
CN107562784A (en) * | 2017-07-25 | 2018-01-09 | 同济大学 | Short text classification method based on ResLCNN models |
CN108564940A (en) * | 2018-03-20 | 2018-09-21 | 平安科技(深圳)有限公司 | Audio recognition method, server and computer readable storage medium |
Non-Patent Citations (6)
Title |
---|
《SUYOUN KIM ET AL.》: "《Joint CTC-attention based end-to-end speech recognition using multi task learning》", 《ICASSP 2017》 * |
DIEDERIK P.KINGMA ET AL.: "《Adam:A method for stochastic optimization》", 《ICLR 2015》 * |
SYLVAIN ARLOT: "《A survey of cross-validation procedures for model selection》", 《STATISTICS SURVEYS》 * |
TARA N.SAINATH ET AL.: "《Convolutional,Long Short-Term Memory,fully connected Deep Neural Networks》", 《ICASSP 2015》 * |
李刚等: "《有指导机器学习超参数的交叉验证智能优化》", 《西安工业大学学报》 * |
李睿琪等: "《一种基于支持向量机的锂电池健康状态评估方法》", 《17TH CCSSTAE 2016》 * |
Cited By (43)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110148408A (en) * | 2019-05-29 | 2019-08-20 | 上海电力学院 | A kind of Chinese speech recognition method based on depth residual error |
CN110309771B (en) * | 2019-06-28 | 2023-03-24 | 南京丰厚电子有限公司 | GBDT-INSGAII-based EAS (Acoustic magnetic System) label identification algorithm |
CN110309771A (en) * | 2019-06-28 | 2019-10-08 | 南京丰厚电子有限公司 | A kind of EAS sound magnetic system tag recognition algorithm based on GBDT-INSGAII |
CN110443127A (en) * | 2019-06-28 | 2019-11-12 | 天津大学 | In conjunction with the musical score image recognition methods of residual error convolutional coding structure and Recognition with Recurrent Neural Network |
CN110335591A (en) * | 2019-07-04 | 2019-10-15 | 广州云从信息科技有限公司 | A kind of parameter management method, device, machine readable media and equipment |
CN110600053A (en) * | 2019-07-30 | 2019-12-20 | 广东工业大学 | Cerebral stroke dysarthria risk prediction method based on ResNet and LSTM network |
CN110659773A (en) * | 2019-09-16 | 2020-01-07 | 杭州师范大学 | Flight delay prediction method based on deep learning |
CN110751944A (en) * | 2019-09-19 | 2020-02-04 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for constructing voice recognition model |
WO2021051628A1 (en) * | 2019-09-19 | 2021-03-25 | 平安科技(深圳)有限公司 | Method, apparatus and device for constructing speech recognition model, and storage medium |
CN110634476B (en) * | 2019-10-09 | 2022-06-14 | 深圳大学 | Method and system for rapidly building robust acoustic model |
CN110634476A (en) * | 2019-10-09 | 2019-12-31 | 深圳大学 | Method and system for rapidly building robust acoustic model |
CN110942090A (en) * | 2019-11-11 | 2020-03-31 | 北京迈格威科技有限公司 | Model training method, image processing method, device, electronic equipment and storage medium |
CN110942090B (en) * | 2019-11-11 | 2024-03-29 | 北京迈格威科技有限公司 | Model training method, image processing device, electronic equipment and storage medium |
CN111009235A (en) * | 2019-11-20 | 2020-04-14 | 武汉水象电子科技有限公司 | Voice recognition method based on CLDNN + CTC acoustic model |
US11250854B2 (en) * | 2019-11-25 | 2022-02-15 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for voice interaction, device and computer-readable storage medium |
CN110992940A (en) * | 2019-11-25 | 2020-04-10 | 百度在线网络技术(北京)有限公司 | Voice interaction method, device, equipment and computer-readable storage medium |
CN111092798A (en) * | 2019-12-24 | 2020-05-01 | 东华大学 | Wearable system based on spoken language understanding |
CN111092798B (en) * | 2019-12-24 | 2021-06-11 | 东华大学 | Wearable system based on spoken language understanding |
CN111243624A (en) * | 2020-01-02 | 2020-06-05 | 武汉船舶通信研究所(中国船舶重工集团公司第七二二研究所) | Method and system for evaluating personnel state |
CN111243624B (en) * | 2020-01-02 | 2023-04-07 | 武汉船舶通信研究所(中国船舶重工集团公司第七二二研究所) | Method and system for evaluating personnel state |
CN111429947A (en) * | 2020-03-26 | 2020-07-17 | 重庆邮电大学 | Speech emotion recognition method based on multi-stage residual convolutional neural network |
CN111429947B (en) * | 2020-03-26 | 2022-06-10 | 重庆邮电大学 | Speech emotion recognition method based on multi-stage residual convolutional neural network |
CN111401530B (en) * | 2020-04-22 | 2021-04-09 | 上海依图网络科技有限公司 | Training method for neural network of voice recognition device |
WO2021212684A1 (en) * | 2020-04-22 | 2021-10-28 | 上海依图网络科技有限公司 | Recurrent neural network and training method therefor |
CN111401530A (en) * | 2020-04-22 | 2020-07-10 | 上海依图网络科技有限公司 | Recurrent neural network and training method thereof |
CN111898734A (en) * | 2020-07-10 | 2020-11-06 | 中国科学院精密测量科学与技术创新研究院 | NMR (nuclear magnetic resonance) relaxation time inversion method based on MLP (Multi-layer linear programming) |
CN111898734B (en) * | 2020-07-10 | 2023-06-23 | 中国科学院精密测量科学与技术创新研究院 | NMR relaxation time inversion method based on MLP |
CN112289309A (en) * | 2020-10-30 | 2021-01-29 | 西安工程大学 | Robot voice control method based on deep learning |
CN112651313A (en) * | 2020-12-17 | 2021-04-13 | 国网上海市电力公司 | Equipment nameplate double-intelligent identification method, storage medium and terminal |
CN112560453B (en) * | 2020-12-18 | 2023-07-14 | 平安银行股份有限公司 | Voice information verification method and device, electronic equipment and medium |
CN112560453A (en) * | 2020-12-18 | 2021-03-26 | 平安银行股份有限公司 | Voice information verification method and device, electronic equipment and medium |
CN112652296A (en) * | 2020-12-23 | 2021-04-13 | 北京华宇信息技术有限公司 | Streaming voice endpoint detection method, device and equipment |
CN112669827A (en) * | 2020-12-28 | 2021-04-16 | 清华大学 | Joint optimization method and system for automatic speech recognizer |
CN112669827B (en) * | 2020-12-28 | 2022-08-02 | 清华大学 | Joint optimization method and system for automatic speech recognizer |
CN112904220A (en) * | 2020-12-30 | 2021-06-04 | 厦门大学 | UPS (uninterrupted Power supply) health prediction method and system based on digital twinning and machine learning, electronic equipment and storable medium |
CN113327590A (en) * | 2021-04-15 | 2021-08-31 | 中标软件有限公司 | Speech recognition method |
CN113270097B (en) * | 2021-05-18 | 2022-05-17 | 成都傅立叶电子科技有限公司 | Unmanned mechanical control method, radio station voice instruction conversion method and device |
CN113270097A (en) * | 2021-05-18 | 2021-08-17 | 成都傅立叶电子科技有限公司 | Unmanned mechanical control method, radio station voice instruction conversion method and device |
CN113569992A (en) * | 2021-08-26 | 2021-10-29 | 中国电子信息产业集团有限公司第六研究所 | Abnormal data identification method and device, electronic equipment and storage medium |
CN113569992B (en) * | 2021-08-26 | 2024-01-09 | 中国电子信息产业集团有限公司第六研究所 | Abnormal data identification method and device, electronic equipment and storage medium |
CN113852434A (en) * | 2021-09-18 | 2021-12-28 | 中山大学 | LSTM and ResNet assisted deep learning end-to-end intelligent communication method and system |
CN113852434B (en) * | 2021-09-18 | 2023-07-25 | 中山大学 | LSTM and ResNet-assisted deep learning end-to-end intelligent communication method and system |
CN114550706A (en) * | 2022-02-21 | 2022-05-27 | 苏州市职业大学 | Smart campus voice recognition method based on deep learning |
Also Published As
Publication number | Publication date |
---|---|
CN109767759B (en) | 2020-12-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109767759A (en) | End-to-end speech recognition methods based on modified CLDNN structure | |
CN110556100B (en) | Training method and system of end-to-end speech recognition model | |
CN109003601A (en) | A kind of across language end-to-end speech recognition methods for low-resource Tujia language | |
CN110222163A (en) | A kind of intelligent answer method and system merging CNN and two-way LSTM | |
CN110706692B (en) | Training method and system of child voice recognition model | |
CN109801621A (en) | A kind of audio recognition method based on residual error gating cycle unit | |
CN112509564A (en) | End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism | |
CN110444208A (en) | A kind of speech recognition attack defense method and device based on gradient estimation and CTC algorithm | |
CN106782511A (en) | Amendment linear depth autoencoder network audio recognition method | |
CN109189925A (en) | Term vector model based on mutual information and based on the file classification method of CNN | |
CN107408384A (en) | The end-to-end speech recognition of deployment | |
CN109063820A (en) | Utilize the data processing method of time-frequency combination Recognition with Recurrent Neural Network when long | |
CN103531199A (en) | Ecological sound identification method on basis of rapid sparse decomposition and deep learning | |
CN110321418A (en) | A kind of field based on deep learning, intention assessment and slot fill method | |
CN110379418A (en) | A kind of voice confrontation sample generating method | |
CN109637526A (en) | The adaptive approach of DNN acoustic model based on personal identification feature | |
CN107818080A (en) | Term recognition methods and device | |
CN109448706A (en) | Neural network language model compression method and system | |
CN111899766B (en) | Speech emotion recognition method based on optimization fusion of depth features and acoustic features | |
CN110009025A (en) | A kind of semi-supervised additive noise self-encoding encoder for voice lie detection | |
CN108461080A (en) | A kind of Acoustic Modeling method and apparatus based on HLSTM models | |
CN111882042A (en) | Automatic searching method, system and medium for neural network architecture of liquid state machine | |
Shi et al. | Construction of English Pronunciation Judgment and Detection Model Based on Deep Learning Neural Networks Data Stream Fusion | |
Jin et al. | Research on objective evaluation of recording audio restoration based on deep learning network | |
CN108388942A (en) | Information intelligent processing method based on big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |