CN109767759A - End-to-end speech recognition methods based on modified CLDNN structure - Google Patents

End-to-end speech recognition methods based on modified CLDNN structure Download PDF

Info

Publication number
CN109767759A
CN109767759A CN201910115486.6A CN201910115486A CN109767759A CN 109767759 A CN109767759 A CN 109767759A CN 201910115486 A CN201910115486 A CN 201910115486A CN 109767759 A CN109767759 A CN 109767759A
Authority
CN
China
Prior art keywords
model
cldnn
network
rate
gradient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910115486.6A
Other languages
Chinese (zh)
Other versions
CN109767759B (en
Inventor
冯昱劼
张毅
徐轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201910115486.6A priority Critical patent/CN109767759B/en
Publication of CN109767759A publication Critical patent/CN109767759A/en
Application granted granted Critical
Publication of CN109767759B publication Critical patent/CN109767759B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

A kind of end-to-end speech recognition methods based on modified CLDNN structure is claimed in the present invention; it is usually used in traditional CLDNN structure of speech recognition using the timing information in full connection LSTM (Long Short Term Memory) model treatment voice signal; over-fitting easily occurs in the training process, influences learning effect.Deeper model often show it is more outstanding, but by simply stack network layer increase model depth can occur gradient disappear, gradient explosion and " degeneration " problem.For the above phenomenon and problem, the present invention proposes a kind of modified CLDNN structure, establishes residual error ConvLSTM model using residual error network and ConvLSTM combination, and replace the full connection LSTM model in tradition CLDNN structure with this.The model structure improve traditional CLDNN model there are the problem of, and can by stack residual error ConvLSTM block increase model depth without gradient disappear, gradient explosion and " degeneration " problem, keep speech recognition system performance more excellent.

Description

End-to-end speech recognition methods based on modified CLDNN structure
Technical field
The invention belongs to field of speech recognition, especially a kind of audio recognition method based on deep learning.
Background technique
Automatic speech recognition technology has always very important status in artificial intelligence field.It is with HMM-GMM model The traditional voice identification technology of representative is once used as always mainstream, and it is decades-long to have ruled field of speech recognition.In recent years, it benefits In the breakthrough of deep learning, automatic speech recognition technology is also at the stage of rapid development.Currently, based on deep learning End-to-end speech identifying system has surmounted legacy speech recognition systems on the popularity degree in academia, and starts Gradually apply to actual production instead of legacy speech recognition systems.
Since the 1980s, it is based on mixed Gauss model/hidden Markov model (Gaussian Mixture Model/Hidden Markov Model, GMM/HMM) acoustic model be just widely used, HMM for handle voice when Variation in sequence, GMM is for completing the mapping that acoustics is input between Hidden Markov state.In recent years, it is based on depth nerve net The acoustic model of network (Deep Neural Network, DNN) is proved to possess in the voice recognition tasks of large vocabulary more preferable Performance, a large amount of neurons activity simulation acoustic feature on show it is more outstanding.Due to the property that DNN is linked completely, lead Cause it that cannot make full use of the structure partial in speech feature space.And convolutional neural networks (Convolutional Nerual Network, CNN) it can use its translation invariance to overcome the diversity of voice signal itself, and can be very The variation in speech feature space is explained well.Recurrent neural network (Recurrent Neural Network, RNN) passes through Recurrence is come the shortcomings that excavating the context-related information in sequence, overcome DNN to a certain extent.But RNN is in training The problem of it is easy to appear gradient disappearances, and information when being difficult to remember long.Shot and long term memory unit (Long Short-Term Memory, LSTM) enable the error at current time to preserve and be selectively transmitted to specific by specific door control unit Unit, so as to avoid gradient disappearance the problem of.Connect timing sorting algorithm (Connectionist Temporal Classifier, CTC) it was proposed by Grave etc. in 2006, it can be applied to end-to-end speech identifying system, portray phonetic feature The relevance of sequence and aligned phoneme sequence, and artificial alignment feature and phoneme need not be relied on.
Domestic and international relevant technical company is all in the end-to-end speech identification model for constantly researching and developing oneself at present.People studies in Baidu Member has delivered Deep speech for 2015, has delivered within 2016 Deep Speech2, and the two is combined using CLDNN and CTC Mode establishes speech recognition modeling, reaches excellent properties.Iflytek research team proposed depth complete sequence volume in 2016 Product neural network (DF-CNN, Deep Fully Convolutional NeuralNetwork) structure, utilizes a large amount of convolutional layers With the combination of pond layer, whole sentence voice is modeled, the ability to express of CNN is greatly strengthened.DFCNN is very more by accumulating This convolution pond layer it is right, it can be seen that very long history and Future Information, this guarantees DFCNN can outstanding earth's surface Up to voice it is long when correlation, it is more outstanding in robustness compared to RNN network structure.The researcher of IBM was in 2016 It is being delivered on ICASSP the article stated that can be trained using the technology of pond layer is connect again after the convolution kernel and multilayer convolution of 3x3 14 layers of (including full connection) Deep CNN model.The model is on Switchboard data set compared to tradition CNN application method Model can bring relatively about 10.6% WER to decline.MSRA team proposes residual error network in 2015, solves with model Depth down occur " degeneration " problem.Residual error network was also applied on speech recognition modeling later, had been proved good Effect.Google research team illustrates one kind for 2017 by Network-in-Network in icassp meeting (NiN), the acoustic model knot that Batch Normalization (BN) and Convolutional LSTM (ConvLSTM) is combined Structure.In the case where not having language model, which has reached 10.5% WER in WSJ voice recognition tasks.
CLDNN is always a kind of comparison in end-to-end speech identification model due to its simple construction and excellent performance Popular structure.But the depth of common CLDNN model is inadequate, and the feature of extraction is not abundant enough, and the speech recognition modeling of foundation is not It can achieve the effect that best.The long memory models (FC-LSTM) in short-term of full connection in its model cannot keep speech feature space Structure partial, and be easy over-fitting.
Summary of the invention
Present invention seek to address that the above problem of the prior art.Traditional CLDNN can be efficiently solved by proposing one kind In, the problem of LSTM is easy to cause over-fitting, overcomes and increases that model depth bring gradient disappears, gradient explosion and " degeneration " are asked The end-to-end speech recognition methods based on modified CLDNN structure of topic.Technical scheme is as follows:
A kind of end-to-end speech recognition methods based on modified CLDNN structure comprising following steps:
S1, it obtains voice data collection and is divided, voice data collection is divided into training set, cross validation collection and test Collection;
S2, all voice data are pre-processed, obtains the mel-frequency cepstrum coefficient MFCC of voice signal;
S3, building modified CLDNN network model, including the phonetic feature abstract being made of convolutional neural networks CNN Processing part, processing voice signal timing information the long memory models in short-term of residual error convolution and will treated feature space mapping To the deep neural network DNN of output layer;
S4, building speech recognition loss function, loss function use CTC loss;
S5, it is trained with modified CLDNN model of the training set to step S3, utilizes Adam arithmetic operators optimization step S4's Objective function;
S6, the model income cross validation to step S5 after trained, adjust the hyper parameter of model, obtain final network mould Type.
Further, the pre-treatment step of the step 2 include: preemphasis, framing, adding window, Fast Fourier Transform (FFT), Mel filtering and discrete cosine transform.
Further, the long memory models in short-term of the residual error convolution in the step S3 specifically: full connection length is remembered in short-term Recall the matrix product in model and replace with convolution algorithm and obtain the long memory models in short-term of convolution, residual error network knot is used to the model Structure obtains the long memory models in short-term of residual error convolution.
Further, the residual error network structure is used to construct deep layer network, connects skip connection by jump It is directly connected to shallow-layer network and deep layer network, so that gradient can be better transmitted to shallow-layer, residual error network is by multiple residual errors Block is constituted, and the depth residual error network structure being made of multiple residual blocks replaces the multilayer LSTM (length in traditional CLDNN model When memory models) structure.
Further, the step S4 loss function, loss function use CTC loss, specifically include:
Assuming that the size of tag element table L is K, list entries X=(x is given1,x2,...,xT), corresponding output label sequence Arrange Y=(y1,y2,...,yU), the task of CTC is that penalty values are fed back to neural network, are passed through under given list entries Adjustment neural network inner parameter maximizes the log probability of output label, i.e. max (lnP (Y | X)), CTC (connection timing Classification) sky label blank is also introduced to indicate the mapping for being not belonging to tag element table L;
Softmax layer after the last layer DNN is exported into the input as CTC, softmax output includes K+1 node The each element being mapped in L ∪ { blank }, entire CTC path probability are shown below:
Wherein ztFor in t moment, softmax obtains output vector,The corresponding posterior probability of k-th of label is represented, is The alignment problem between softmax output and sequence label is solved, list entries one-to-one CTC on frame-layer face is introduced Path p=(p1,p2,...,pT), sequence label Y is corresponded on the p of the path CTC by mapping Ф, since this mapping is a pair of More mapping a, so label can correspond to multiple paths CTC, so CTC of the probability of label Y by all this labels of correspondence Path probability and it is expressed as following formula:
The loss function of CTC and the sum of the negative logarithm for being defined as each training sample correct labeling, such as following formula:
Further, the step S5 utilizes the objective function of Adam arithmetic operators optimization step S4;
Calculate the gradient of t time step:
Firstly, calculating the index moving average of gradient, m0 is initialized as 0.The gradient of time step is dynamic before comprehensively considering Amount.1 coefficient of β is exponential decay rate, and control weight distribution (momentum and current gradient) usually takes the value close to 1, is defaulted as 0.9
mt1mt-1+(1-β1)gt
Second, the index moving average of gradient square is calculated, v0 is initialized as 0.2 coefficient of β is exponential decay rate, control The influence situation of gradient square before, is defaulted as 0.999.
Third will lead to mt and be partial to 0 since m0 is initialized as 0, especially in training initial stage.So need herein Bias correction is carried out to gradient mean value mt, reduces influence of the deviation to training initial stage.
4th, since v0 is initialized as 0 lead to that initial stage vt is trained to be biased to 0, it is corrected.
5th, undated parameter, initial learning rate α multiplied by gradient mean value and gradient variance the ratio between square root.Wherein write from memory Recognize learning rate α=0.001, ε=10^-8.
Further, the step S6 carries out cross validation to the model after step S5 training, adjusts the hyper parameter of model, Final network model is obtained, is specifically included:
Cross validation step:
1, weight is initialized, weighting value is the random number between -0.5 to 0.5.
2, dividing learning sample space C is N parts.
3, N-1 parts are taken out according to regulation sequence from learning data file and is used as training data sample.Remaining N parts of works For verify data sample.It completes step 4 and arrives step 7.
4, it is trained reading in a sample since training data sample.
5, it calculates this sample output error and always measures EP.Two layers of weight is modified until EP < (for defined error metrics), is read Enter next training sample.
6, until in N-1 parts of training samples all sample learnings terminate, generate one group of weight, verified with this group of weight computing Sample, calculate verifying sample is proved to be successful rate RATE=(meet EP < verifying number of samples)/(total verifying number of samples)
If 7, verifying sample success rate RATE > rate (rate is defined success rate), terminate the study of this wheel.It is no Then learn all verifying samples.
Hyper parameter:
Learning rate: learning rate refers to the amplitude size that network weight is updated in optimization algorithm.Different optimization algorithms are determined Fixed different learning rate.When learning rate is excessive, it may cause model and do not restrain, loss function constantly concussion up and down;Learning rate mistake It is small, cause model convergence rate partially slow, needs longer time training.Usual value is 0.01,0.001,0.0001.
Batch size: batch size is the sample number that trained neural network is sent into model each time, in convolutional neural networks In, big batch can usually make network more rapid convergence, but due to the limitation of memory source, batch is excessive to may result in Out of Memory With or program Kernel Panic.Usual value is 16,32,64,128.
The number of iterations: the number of iterations refers to that entire training set is input to the number that neural network is trained, when verifying is wrong When accidentally rate and training error rate differ smaller, it is believed that current iteration number is suitable;When becoming larger after authentication error takes the lead in becoming smaller Then illustrate that the number of iterations is excessive, needs to reduce the number of iterations, be otherwise easy to appear over-fitting.
It advantages of the present invention and has the beneficial effect that:
Present invention introduces the long memory models in short-term of convolution (Convolutional Long Short-Term Memory, ConvLSTM the FC-LSTM in common CLDNN model) is replaced, to improve model cannot keep space structure locality and appearance The problem of easy over-fitting.In order to deepen the problems such as model depth explodes without " degeneration ", gradient disappearance and gradient, the present invention Also introduce residual error network (Residual Network, ResNet).In order to which stacked multilayer ConvLSTM improves the performance of model And gradient disappearance, gradient explosion and " degeneration " problem does not occur, the present invention has merged ConvLSTM and residual error network structure, residual Poor ConvLSTM block structure is as shown in Figure 1.Based on the above structure, the present invention proposes improvement to traditional CLDNN structure.For biography The long memory models in short-term of full connection in system CLDNN model cannot keep the structure partial of feature space, and be easy to intend The problem of conjunction, replaces tradition using the depth residual error ConvLSTM network structure being made of multiple residual error ConvLSTM blocks Multilayer LSTM structure in CLDNN model makes model have better performance in the time relationship in processing phonetic feature, and It is not easy over-fitting.Improved CNN-ResconvLSTM-DNN model can by be superimposed more residual error ConvLSTM come Deeper model is established without gradient disappearance, gradient explosion and " degeneration " problem, can be played in voice recognition tasks Better performance, structure are as shown in Figure 2.
Detailed description of the invention
Fig. 1 is that the present invention provides preferred embodiment residual error convolution long memory models block structure in short-term;
Fig. 2 is modified CLDNN model structure proposed by the present invention;
Fig. 3 is flow chart of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, detailed Carefully describe.Described embodiment is only a part of the embodiments of the present invention.
The technical solution that the present invention solves above-mentioned technical problem is:
S1, voice data collection is divided, data set is divided into training set, cross validation collection and test set;
S2, all data are pre-processed, and then obtains the mel-frequency cepstrum coefficient (MFCC) of voice signal, it is pre- to locate Manage step are as follows:
Preemphasis: for passing through high-pass filter H (Z)=1- μ z because of signal-1
Framing: whole section of voice signal is divided into every frame 30ms, frame moves the segment of 10ms.
Adding window: add Hamming window for each frame signal
S ' (n)=S (n) * W (n)
Fast Fourier Transform (FFT): every frame is by Fast Fourier Transform (FFT) to obtain the Energy distribution on frequency spectrum.
Mel filtering: energy spectrum is passed through into one group of Mel filter group, the frequency response of single foot filter is defined as:
Wherein m is triangular filter centre frequency.
Calculate the logarithmic energy of each filter group output:
Discrete cosine transform:
S3, building modified CLDNN network model, model includes the phonetic feature being made of convolutional neural networks (CNN) Abstract processing part, the long part memory models (ResConvLstm) in short-term of residual error convolution for handling voice signal timing information With by part deep neural network (DNN) of treated feature space is mapped to output layer;
The long memory models in short-term of convolution are the extensions of the length that is fully connected memory models in short-term, it in the state that is input to and State to state convert in all have convolutional coding structure, this structure compared to common CNN more can performance characteristic time relationship, and phase Than connecting the more difficult over-fitting of LSTM entirely, such as following formula:
it=σ (Wxi*xt+Whi*ht-1+bi)
ft=σ (Wxf*xt+Whf*ht-1+bi)
ot=σ (Wxo*xt+Who*ht-1+bo)
σ is sigmoid activation primitive, it, ft, ot, ct, htIt respectively indicates the input gate of t moment, forget door, out gate, list Member input activation and unit output vector,Indicate that the element product of vector, W indicate to connect the weight matrix between not fellow disciple, b Represent corresponding bias vector.
Residual error network structure constructs deep layer network, is directly connected to shallow-layer by jump connection (skip connection) Network and deep layer network, so that gradient can be better transmitted to shallow-layer.Residual error network is by multiple residual block (Residual Block it) constitutes, if the input of residual block is xl, export as xl+1, then the structure of residual block may be expressed as:
xl+1=xl+F(xl,wl) (9)
F(xl,wl)=wlσ(wl-1xl-1) (10)
Wherein σ is activation primitive, so, for any xL, have:
Assuming that loss function C, available:
Wherein,It ensure that information can pass random layer x backl,It ensure that network is not in ladder The phenomenon that degree disappears.
In order to the long memory models in short-term of stacked multilayer convolution improve the performance of model and do not occur gradient disappear, gradient it is quick-fried Fried and " degeneration " problem, the present invention have merged convolution long memory models and residual error network structure in short-term, and residual error convolution length is remembered in short-term It is as shown in Figure 1 to recall model block structure.
Based on the above structure, the present invention proposes improvement to traditional CLDNN structure.For complete in traditional CLDNN model The long memory models in short-term of connection cannot keep the structure partial of feature space, and the problem of easy over-fitting, using by more The depth residual error ConvLSTM network structure of a residual error ConvLSTM block composition replaces the multilayer LSTM in traditional CLDNN model Structure makes model have better performance in the time relationship in processing phonetic feature, and is not easy over-fitting.It is improved CNN-ResconvLSTM-DNN model can be established by being superimposed more residual error ConvLSTM deeper model without Gradient disappears, gradient is exploded and " degeneration " problem, and better performance, structure such as Fig. 2 can be played in voice recognition tasks It is shown.
S4, building objective function, i.e. speech recognition word error rate (WER%), loss function use CTC loss;
Assuming that the size of tag element table L is K.Given list entries X=(x1,x2,...,xT), corresponding output label sequence Arrange Y=(y1,y2,...,yU).The task of CTC is that penalty values are fed back to neural network, are passed through under given list entries Adjustment neural network inner parameter maximizes the log probability of output label, i.e. max (lnP (Y | X)).CTC also introduces sky Label blank indicates mapping, such as pause, cough for being not belonging to tag element table L etc..
Softmax layer after the last layer DNN is exported into the input as CTC, softmax output includes K+1 node The each element being mapped in L ∪ { blank }.Entire CTC path probability is shown below:
Wherein ztFor in t moment, softmax obtains output vector,Represent the corresponding posterior probability of k-th of label.For The alignment problem between softmax output and sequence label is solved, list entries one-to-one CTC on frame-layer face is introduced Path p=(p1,p2,...,pT).Sequence label Y is corresponded on the p of the path CTC by mapping Ф, since this mapping is a pair of More mapping a, so label can correspond to multiple paths CTC.So CTC of the probability of label Y by all this labels of correspondence Path probability and it is expressed as following formula:
The loss function of CTC and the sum of the negative logarithm for being defined as each training sample correct labeling, such as following formula:
S5, it is trained with training the set pair analysis model, utilizes Adam arithmetic operators optimization objective function;
S6, collected using verifying to the model income cross validation after training, adjust the hyper parameter of model, obtain final network Model;
The above embodiment is interpreted as being merely to illustrate the present invention rather than limit the scope of the invention.? After the content for having read record of the invention, technical staff can be made various changes or modifications the present invention, these equivalent changes Change and modification equally falls into the scope of the claims in the present invention.

Claims (7)

1. a kind of end-to-end speech recognition methods based on modified CLDNN structure, which comprises the following steps:
S1, it obtains voice data collection and is divided, voice data collection is divided into training set, cross validation collection and test set;
S2, all voice data are pre-processed, obtains the mel-frequency cepstrum coefficient MFCC of voice signal;
S3, building modified CLDNN network model, including the phonetic feature abstract processing being made of convolutional neural networks CNN Partially, handle voice signal timing information the long memory models in short-term of residual error convolution and will treated that feature space is mapped to is defeated The deep neural network DNN of layer out;
S4, the loss function for constructing speech recognition, loss function use CTC loss;
S5, it is trained with modified CLDNN model of the training set to step S3, utilizes the target of Adam arithmetic operators optimization step S4 Function;
S6, collected using verifying to the model income cross validation after step S5 training, adjust the hyper parameter of model, obtain final net Network model.
2. a kind of end-to-end speech recognition methods based on modified CLDNN structure according to claim 1, feature exist In, the pre-treatment step of the step 2 include: preemphasis, framing, adding window, Fast Fourier Transform (FFT), Mel filtering and discrete remaining String transformation.
3. a kind of end-to-end speech recognition methods based on modified CLDNN structure according to claim 1, feature exist In the residual error convolution in the step S3 grows memory models in short-term specifically: to the matrix in the long memory models in short-term of full connection Product replaces with convolution algorithm and obtains the long memory models in short-term of convolution, obtains residual error convolution using residual error network structure to the model Long memory models in short-term.
4. a kind of end-to-end speech recognition methods based on modified CLDNN structure according to claim 3, feature exist In the residual error network structure is used to construct deep layer network, is directly connected to shallow-layer net by jump connection skip connection Network and deep layer network, so that gradient can be better transmitted to shallow-layer, residual error network is made of multiple residual blocks, by multiple residual The depth residual error network structure of poor block composition replaces the memory models structure in short-term of the multilayer LSTM long in traditional CLDNN model.
5. a kind of end-to-end speech recognition methods based on modified CLDNN structure according to claim 3, feature exist In, the step S4, loss function uses CTC loss, it specifically includes:
Assuming that the size of tag element table L is K, list entries X=(x is given1,x2,...,xT), corresponding output label sequence Y= (y1,y2,...,yU), the task of CTC is penalty values to be fed back to neural network, by adjusting mind under given list entries Maximize the log probability of output label through parameters within network, i.e. max (lnP (Y | X)), the classification of CTC connection timing is also drawn Empty label blank is entered to indicate the mapping for being not belonging to tag element table L;
Softmax layer after the last layer DNN is exported into the input as CTC, softmax output includes K+1 node mapping To each element in L ∪ { blank }, entire CTC path probability is shown below:
Wherein ztFor in t moment, softmax obtains output vector,The corresponding posterior probability of k-th of label is represented, in order to solve Alignment problem between softmax output and sequence label, introduces the list entries one-to-one path CTC p on frame-layer face =(p1,p2,...,pT), sequence label Y is corresponded on the p of the path CTC by mapping Ф, since this mapping is one-to-many reflects It penetrates, so a label can correspond to multiple paths CTC, so the probability of label Y is general by the path CTC of all this labels of correspondence Rate and it is expressed as following formula:
The loss function of CTC and the sum of the negative logarithm for being defined as each training sample correct labeling, such as following formula:
6. a kind of end-to-end speech recognition methods based on modified CLDNN structure according to claim 5, feature exist In the step S5 utilizes the objective function of Adam arithmetic operators optimization step S4;
Calculate the gradient of t time step:
gt=▽θJ(θt-1)
Firstly, calculating the index moving average of gradient, m0 is initialized as 0, the gradient momentum of time step, β 1 before comprehensively considering Coefficient is exponential decay rate, and control weight distribution (momentum and current gradient) usually takes the value close to 1, is defaulted as 0.9
mt1mt-1+(1-β1)gt
Second, the index moving average of gradient square is calculated, v0 is initialized as 0.2 coefficient of β is exponential decay rate, before control Gradient square influence situation, be defaulted as 0.999;
Third will lead to mt and be partial to 0 since m0 is initialized as 0, especially in training initial stage.So need to ladder herein It spends mean value mt and carries out bias correction, reduce influence of the deviation to training initial stage;
4th, since v0 is initialized as 0 lead to that initial stage vt is trained to be biased to 0, it is corrected;
5th, undated parameter, initial learning rate α multiplied by gradient mean value and gradient variance the ratio between square root.Wherein default is learned Habit rate α=0.001, ε=10^-8;
7. a kind of end-to-end speech recognition methods based on modified CLDNN structure according to claim 6, feature exist In the step S6 adjusts the hyper parameter of model, obtain final network mould to the model income cross validation after step S5 training Type specifically includes:
Cross validation step:
1, weight is initialized, weighting value is the random number between -0.5 to 0.5;
2, dividing learning sample space C is N parts;
3, N-1 parts are taken out according to regulation sequence from learning data file and is used as training data sample;Remaining N parts as verifying Data sample;It completes step 4 and arrives step 7;
4, it is trained reading in a sample since training data sample;
5, it calculates this sample output error and always measures EP;Two layers of weight is modified until EP < (for defined error metrics), under reading One training sample;
6, until in N-1 parts of training samples all sample learnings terminate, generate one group of weight, verify sample with this group of weight computing This, calculate verifying sample is proved to be successful rate RATE=(meet EP < verifying number of samples)/(total verifying number of samples)
If 7, verifying sample success rate RATE > rate (rate is defined success rate), terminate the study of this wheel;Otherwise it learns Practise all verifying samples;
Hyper parameter:
Learning rate: learning rate refers to the amplitude size that network weight is updated in optimization algorithm;Different optimization algorithms determines not Same learning rate;When learning rate is excessive, it may cause model and do not restrain, loss function constantly concussion up and down;Learning rate is too small then Cause model convergence rate partially slow, needs longer time training;Usual value is 0.01,0.001,0.0001;
Batch size: batch size is the sample number that trained neural network is sent into model each time, in convolutional neural networks, greatly Batch can usually make network more rapid convergence, but due to the limitation of memory source, batch is excessive may result in Out of Memory use or Program Kernel Panic;Usual value is 16,32,64,128;
The number of iterations: the number of iterations refers to that entire training set is input to the number that neural network is trained, when authentication error rate When smaller with training error rate difference, it is believed that current iteration number is suitable;It is then said when becoming larger after authentication error takes the lead in becoming smaller Bright the number of iterations is excessive, needs to reduce the number of iterations, is otherwise easy to appear over-fitting.
CN201910115486.6A 2019-02-14 2019-02-14 Method for establishing CLDNN structure applied to end-to-end speech recognition Active CN109767759B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910115486.6A CN109767759B (en) 2019-02-14 2019-02-14 Method for establishing CLDNN structure applied to end-to-end speech recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910115486.6A CN109767759B (en) 2019-02-14 2019-02-14 Method for establishing CLDNN structure applied to end-to-end speech recognition

Publications (2)

Publication Number Publication Date
CN109767759A true CN109767759A (en) 2019-05-17
CN109767759B CN109767759B (en) 2020-12-22

Family

ID=66456247

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910115486.6A Active CN109767759B (en) 2019-02-14 2019-02-14 Method for establishing CLDNN structure applied to end-to-end speech recognition

Country Status (1)

Country Link
CN (1) CN109767759B (en)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110148408A (en) * 2019-05-29 2019-08-20 上海电力学院 A kind of Chinese speech recognition method based on depth residual error
CN110309771A (en) * 2019-06-28 2019-10-08 南京丰厚电子有限公司 A kind of EAS sound magnetic system tag recognition algorithm based on GBDT-INSGAII
CN110335591A (en) * 2019-07-04 2019-10-15 广州云从信息科技有限公司 A kind of parameter management method, device, machine readable media and equipment
CN110443127A (en) * 2019-06-28 2019-11-12 天津大学 In conjunction with the musical score image recognition methods of residual error convolutional coding structure and Recognition with Recurrent Neural Network
CN110600053A (en) * 2019-07-30 2019-12-20 广东工业大学 Cerebral stroke dysarthria risk prediction method based on ResNet and LSTM network
CN110634476A (en) * 2019-10-09 2019-12-31 深圳大学 Method and system for rapidly building robust acoustic model
CN110659773A (en) * 2019-09-16 2020-01-07 杭州师范大学 Flight delay prediction method based on deep learning
CN110751944A (en) * 2019-09-19 2020-02-04 平安科技(深圳)有限公司 Method, device, equipment and storage medium for constructing voice recognition model
CN110942090A (en) * 2019-11-11 2020-03-31 北京迈格威科技有限公司 Model training method, image processing method, device, electronic equipment and storage medium
CN110992940A (en) * 2019-11-25 2020-04-10 百度在线网络技术(北京)有限公司 Voice interaction method, device, equipment and computer-readable storage medium
CN111009235A (en) * 2019-11-20 2020-04-14 武汉水象电子科技有限公司 Voice recognition method based on CLDNN + CTC acoustic model
CN111092798A (en) * 2019-12-24 2020-05-01 东华大学 Wearable system based on spoken language understanding
CN111243624A (en) * 2020-01-02 2020-06-05 武汉船舶通信研究所(中国船舶重工集团公司第七二二研究所) Method and system for evaluating personnel state
CN111401530A (en) * 2020-04-22 2020-07-10 上海依图网络科技有限公司 Recurrent neural network and training method thereof
CN111429947A (en) * 2020-03-26 2020-07-17 重庆邮电大学 Speech emotion recognition method based on multi-stage residual convolutional neural network
CN111898734A (en) * 2020-07-10 2020-11-06 中国科学院精密测量科学与技术创新研究院 NMR (nuclear magnetic resonance) relaxation time inversion method based on MLP (Multi-layer linear programming)
CN112289309A (en) * 2020-10-30 2021-01-29 西安工程大学 Robot voice control method based on deep learning
CN112560453A (en) * 2020-12-18 2021-03-26 平安银行股份有限公司 Voice information verification method and device, electronic equipment and medium
CN112651313A (en) * 2020-12-17 2021-04-13 国网上海市电力公司 Equipment nameplate double-intelligent identification method, storage medium and terminal
CN112652296A (en) * 2020-12-23 2021-04-13 北京华宇信息技术有限公司 Streaming voice endpoint detection method, device and equipment
CN112669827A (en) * 2020-12-28 2021-04-16 清华大学 Joint optimization method and system for automatic speech recognizer
CN112904220A (en) * 2020-12-30 2021-06-04 厦门大学 UPS (uninterrupted Power supply) health prediction method and system based on digital twinning and machine learning, electronic equipment and storable medium
CN113270097A (en) * 2021-05-18 2021-08-17 成都傅立叶电子科技有限公司 Unmanned mechanical control method, radio station voice instruction conversion method and device
CN113327590A (en) * 2021-04-15 2021-08-31 中标软件有限公司 Speech recognition method
CN113569992A (en) * 2021-08-26 2021-10-29 中国电子信息产业集团有限公司第六研究所 Abnormal data identification method and device, electronic equipment and storage medium
CN113852434A (en) * 2021-09-18 2021-12-28 中山大学 LSTM and ResNet assisted deep learning end-to-end intelligent communication method and system
CN114550706A (en) * 2022-02-21 2022-05-27 苏州市职业大学 Smart campus voice recognition method based on deep learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106328122A (en) * 2016-08-19 2017-01-11 深圳市唯特视科技有限公司 Voice identification method using long-short term memory model recurrent neural network
CN107562784A (en) * 2017-07-25 2018-01-09 同济大学 Short text classification method based on ResLCNN models
WO2018071389A1 (en) * 2016-10-10 2018-04-19 Google Llc Very deep convolutional neural networks for end-to-end speech recognition
CN108564940A (en) * 2018-03-20 2018-09-21 平安科技(深圳)有限公司 Audio recognition method, server and computer readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106328122A (en) * 2016-08-19 2017-01-11 深圳市唯特视科技有限公司 Voice identification method using long-short term memory model recurrent neural network
WO2018071389A1 (en) * 2016-10-10 2018-04-19 Google Llc Very deep convolutional neural networks for end-to-end speech recognition
CN107562784A (en) * 2017-07-25 2018-01-09 同济大学 Short text classification method based on ResLCNN models
CN108564940A (en) * 2018-03-20 2018-09-21 平安科技(深圳)有限公司 Audio recognition method, server and computer readable storage medium

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
《SUYOUN KIM ET AL.》: "《Joint CTC-attention based end-to-end speech recognition using multi task learning》", 《ICASSP 2017》 *
DIEDERIK P.KINGMA ET AL.: "《Adam:A method for stochastic optimization》", 《ICLR 2015》 *
SYLVAIN ARLOT: "《A survey of cross-validation procedures for model selection》", 《STATISTICS SURVEYS》 *
TARA N.SAINATH ET AL.: "《Convolutional,Long Short-Term Memory,fully connected Deep Neural Networks》", 《ICASSP 2015》 *
李刚等: "《有指导机器学习超参数的交叉验证智能优化》", 《西安工业大学学报》 *
李睿琪等: "《一种基于支持向量机的锂电池健康状态评估方法》", 《17TH CCSSTAE 2016》 *

Cited By (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110148408A (en) * 2019-05-29 2019-08-20 上海电力学院 A kind of Chinese speech recognition method based on depth residual error
CN110309771B (en) * 2019-06-28 2023-03-24 南京丰厚电子有限公司 GBDT-INSGAII-based EAS (Acoustic magnetic System) label identification algorithm
CN110309771A (en) * 2019-06-28 2019-10-08 南京丰厚电子有限公司 A kind of EAS sound magnetic system tag recognition algorithm based on GBDT-INSGAII
CN110443127A (en) * 2019-06-28 2019-11-12 天津大学 In conjunction with the musical score image recognition methods of residual error convolutional coding structure and Recognition with Recurrent Neural Network
CN110335591A (en) * 2019-07-04 2019-10-15 广州云从信息科技有限公司 A kind of parameter management method, device, machine readable media and equipment
CN110600053A (en) * 2019-07-30 2019-12-20 广东工业大学 Cerebral stroke dysarthria risk prediction method based on ResNet and LSTM network
CN110659773A (en) * 2019-09-16 2020-01-07 杭州师范大学 Flight delay prediction method based on deep learning
CN110751944A (en) * 2019-09-19 2020-02-04 平安科技(深圳)有限公司 Method, device, equipment and storage medium for constructing voice recognition model
WO2021051628A1 (en) * 2019-09-19 2021-03-25 平安科技(深圳)有限公司 Method, apparatus and device for constructing speech recognition model, and storage medium
CN110634476B (en) * 2019-10-09 2022-06-14 深圳大学 Method and system for rapidly building robust acoustic model
CN110634476A (en) * 2019-10-09 2019-12-31 深圳大学 Method and system for rapidly building robust acoustic model
CN110942090A (en) * 2019-11-11 2020-03-31 北京迈格威科技有限公司 Model training method, image processing method, device, electronic equipment and storage medium
CN110942090B (en) * 2019-11-11 2024-03-29 北京迈格威科技有限公司 Model training method, image processing device, electronic equipment and storage medium
CN111009235A (en) * 2019-11-20 2020-04-14 武汉水象电子科技有限公司 Voice recognition method based on CLDNN + CTC acoustic model
US11250854B2 (en) * 2019-11-25 2022-02-15 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for voice interaction, device and computer-readable storage medium
CN110992940A (en) * 2019-11-25 2020-04-10 百度在线网络技术(北京)有限公司 Voice interaction method, device, equipment and computer-readable storage medium
CN111092798A (en) * 2019-12-24 2020-05-01 东华大学 Wearable system based on spoken language understanding
CN111092798B (en) * 2019-12-24 2021-06-11 东华大学 Wearable system based on spoken language understanding
CN111243624A (en) * 2020-01-02 2020-06-05 武汉船舶通信研究所(中国船舶重工集团公司第七二二研究所) Method and system for evaluating personnel state
CN111243624B (en) * 2020-01-02 2023-04-07 武汉船舶通信研究所(中国船舶重工集团公司第七二二研究所) Method and system for evaluating personnel state
CN111429947A (en) * 2020-03-26 2020-07-17 重庆邮电大学 Speech emotion recognition method based on multi-stage residual convolutional neural network
CN111429947B (en) * 2020-03-26 2022-06-10 重庆邮电大学 Speech emotion recognition method based on multi-stage residual convolutional neural network
CN111401530B (en) * 2020-04-22 2021-04-09 上海依图网络科技有限公司 Training method for neural network of voice recognition device
WO2021212684A1 (en) * 2020-04-22 2021-10-28 上海依图网络科技有限公司 Recurrent neural network and training method therefor
CN111401530A (en) * 2020-04-22 2020-07-10 上海依图网络科技有限公司 Recurrent neural network and training method thereof
CN111898734A (en) * 2020-07-10 2020-11-06 中国科学院精密测量科学与技术创新研究院 NMR (nuclear magnetic resonance) relaxation time inversion method based on MLP (Multi-layer linear programming)
CN111898734B (en) * 2020-07-10 2023-06-23 中国科学院精密测量科学与技术创新研究院 NMR relaxation time inversion method based on MLP
CN112289309A (en) * 2020-10-30 2021-01-29 西安工程大学 Robot voice control method based on deep learning
CN112651313A (en) * 2020-12-17 2021-04-13 国网上海市电力公司 Equipment nameplate double-intelligent identification method, storage medium and terminal
CN112560453B (en) * 2020-12-18 2023-07-14 平安银行股份有限公司 Voice information verification method and device, electronic equipment and medium
CN112560453A (en) * 2020-12-18 2021-03-26 平安银行股份有限公司 Voice information verification method and device, electronic equipment and medium
CN112652296A (en) * 2020-12-23 2021-04-13 北京华宇信息技术有限公司 Streaming voice endpoint detection method, device and equipment
CN112669827A (en) * 2020-12-28 2021-04-16 清华大学 Joint optimization method and system for automatic speech recognizer
CN112669827B (en) * 2020-12-28 2022-08-02 清华大学 Joint optimization method and system for automatic speech recognizer
CN112904220A (en) * 2020-12-30 2021-06-04 厦门大学 UPS (uninterrupted Power supply) health prediction method and system based on digital twinning and machine learning, electronic equipment and storable medium
CN113327590A (en) * 2021-04-15 2021-08-31 中标软件有限公司 Speech recognition method
CN113270097B (en) * 2021-05-18 2022-05-17 成都傅立叶电子科技有限公司 Unmanned mechanical control method, radio station voice instruction conversion method and device
CN113270097A (en) * 2021-05-18 2021-08-17 成都傅立叶电子科技有限公司 Unmanned mechanical control method, radio station voice instruction conversion method and device
CN113569992A (en) * 2021-08-26 2021-10-29 中国电子信息产业集团有限公司第六研究所 Abnormal data identification method and device, electronic equipment and storage medium
CN113569992B (en) * 2021-08-26 2024-01-09 中国电子信息产业集团有限公司第六研究所 Abnormal data identification method and device, electronic equipment and storage medium
CN113852434A (en) * 2021-09-18 2021-12-28 中山大学 LSTM and ResNet assisted deep learning end-to-end intelligent communication method and system
CN113852434B (en) * 2021-09-18 2023-07-25 中山大学 LSTM and ResNet-assisted deep learning end-to-end intelligent communication method and system
CN114550706A (en) * 2022-02-21 2022-05-27 苏州市职业大学 Smart campus voice recognition method based on deep learning

Also Published As

Publication number Publication date
CN109767759B (en) 2020-12-22

Similar Documents

Publication Publication Date Title
CN109767759A (en) End-to-end speech recognition methods based on modified CLDNN structure
CN110556100B (en) Training method and system of end-to-end speech recognition model
CN109003601A (en) A kind of across language end-to-end speech recognition methods for low-resource Tujia language
CN110222163A (en) A kind of intelligent answer method and system merging CNN and two-way LSTM
CN110706692B (en) Training method and system of child voice recognition model
CN109801621A (en) A kind of audio recognition method based on residual error gating cycle unit
CN112509564A (en) End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism
CN110444208A (en) A kind of speech recognition attack defense method and device based on gradient estimation and CTC algorithm
CN106782511A (en) Amendment linear depth autoencoder network audio recognition method
CN109189925A (en) Term vector model based on mutual information and based on the file classification method of CNN
CN107408384A (en) The end-to-end speech recognition of deployment
CN109063820A (en) Utilize the data processing method of time-frequency combination Recognition with Recurrent Neural Network when long
CN103531199A (en) Ecological sound identification method on basis of rapid sparse decomposition and deep learning
CN110321418A (en) A kind of field based on deep learning, intention assessment and slot fill method
CN110379418A (en) A kind of voice confrontation sample generating method
CN109637526A (en) The adaptive approach of DNN acoustic model based on personal identification feature
CN107818080A (en) Term recognition methods and device
CN109448706A (en) Neural network language model compression method and system
CN111899766B (en) Speech emotion recognition method based on optimization fusion of depth features and acoustic features
CN110009025A (en) A kind of semi-supervised additive noise self-encoding encoder for voice lie detection
CN108461080A (en) A kind of Acoustic Modeling method and apparatus based on HLSTM models
CN111882042A (en) Automatic searching method, system and medium for neural network architecture of liquid state machine
Shi et al. Construction of English Pronunciation Judgment and Detection Model Based on Deep Learning Neural Networks Data Stream Fusion
Jin et al. Research on objective evaluation of recording audio restoration based on deep learning network
CN108388942A (en) Information intelligent processing method based on big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant