CN106782518A - A kind of audio recognition method based on layered circulation neutral net language model - Google Patents
A kind of audio recognition method based on layered circulation neutral net language model Download PDFInfo
- Publication number
- CN106782518A CN106782518A CN201611059843.4A CN201611059843A CN106782518A CN 106782518 A CN106782518 A CN 106782518A CN 201611059843 A CN201611059843 A CN 201611059843A CN 106782518 A CN106782518 A CN 106782518A
- Authority
- CN
- China
- Prior art keywords
- rnn
- character
- level
- word
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 230000007935 neutral effect Effects 0.000 title claims abstract description 19
- 230000008569 process Effects 0.000 claims abstract description 7
- 238000012549 training Methods 0.000 claims description 22
- 230000004913 activation Effects 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 8
- 230000004048 modification Effects 0.000 claims description 6
- 238000012986 modification Methods 0.000 claims description 6
- 238000003780 insertion Methods 0.000 claims description 5
- 230000037431 insertion Effects 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 230000000306 recurrent effect Effects 0.000 claims description 4
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 3
- 230000020411 cell activation Effects 0.000 claims description 3
- 230000006835 compression Effects 0.000 claims description 3
- 238000007906 compression Methods 0.000 claims description 3
- 239000012141 concentrate Substances 0.000 claims description 3
- 230000003750 conditioning effect Effects 0.000 claims description 3
- 230000007774 longterm Effects 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 230000000644 propagated effect Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims 1
- 239000000203 mixture Substances 0.000 claims 1
- 238000012545 processing Methods 0.000 claims 1
- 238000005516 engineering process Methods 0.000 description 2
- 230000001902 propagating effect Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Machine Translation (AREA)
Abstract
A kind of audio recognition method based on layered circulation neutral net language model proposed in the present invention, its main contents include:RNN structures, the character level Language Modeling with classification RNN are extended using the character level Language Modeling of RNN, with external clock and reset signal, speech recognition is carried out, its process is, first use the character level Language Modeling of RNN, then RNN structures are extended with external clock and reset signal, character level Language Modeling with classification RNN, finally carries out speech recognition.The present invention replaces traditional single clock RNN character level language models with based on layered circulation neutral net language model, with more preferable accuracy of identification, reduces the quantity of parameter;Language model have a large vocabulary, it is necessary to memory space it is smaller;Hierarchical language model can be extended to process the information of longer period, such as sentence, theme or other contexts.
Description
Technical field
The present invention relates to field of speech recognition, more particularly, to a kind of based on layered circulation neutral net language model
Audio recognition method.
Background technology
With the development of modern technologies, the character level language model (CLMs) based on Recognition with Recurrent Neural Network (RNN) is in voice
The fields such as identification, text generation and machine translation are widely used.The modeling of its word for being had no in nature is highly useful.
However, their performance is generally more very different than word level language model (WLMs).And, statistical language model needs big storage empty
Between, usually more than 1GB, because not only to consider substantial amounts of vocabulary, in addition it is also necessary to consider combinations thereof.
The present invention proposes a kind of audio recognition method based on layered circulation neutral net language model, its classification RNN
Framework is made up of the multiple modules with clock rates.Despite multi-clock structure, but input layer and output layer are all
With character level clock operation, this existing RNN character levels language model training method of permission can be applied directly without appointing
What is changed.First by the character level Language Modeling of RNN, then extend RNN structures with external clock and reset signal, with point
The character level Language Modeling of level RNN, finally carries out speech recognition.The present invention is replaced with based on layered circulation neutral net language model
Traditional single clock RNN character level language models are changed, with more preferable accuracy of identification, the quantity of parameter is reduced;Language model
Have a large vocabulary, it is necessary to memory space it is smaller;Hierarchical language model can be extended to process the information of longer period, such as sentence
Son, theme or other contexts.
The content of the invention
It is not high for accuracy of identification, the problems such as shared memory space is big, it is an object of the invention to provide one kind based on point
The audio recognition method of layer Recognition with Recurrent Neural Network language model, first by the character level Language Modeling of RNN, when then using outside
Clock and reset signal extension RNN structures, the character level Language Modeling with classification RNN finally carry out speech recognition.
To solve the above problems, the present invention provides a kind of speech recognition side based on layered circulation neutral net language model
Method, its main contents include:
(1) the character level Language Modeling of RNN is used;
(2) RNN structures are extended with external clock and reset signal;
(3) with the character level Language Modeling of classification RNN;
(4) speech recognition is carried out.
Wherein, it is described based on layered circulation neutral net language model, combine character level and word level language model
Advantageous feature;Recognition with Recurrent Neural Network (RNN) is made up of rudimentary RNNs and senior RNNs;Rudimentary RNN is input into and defeated using character level
Go out, and provide short-term embedded to the senior RNN operated as word level RNN;Senior RNN does not need complicated input and output,
Because it receives characteristic information from low level network, and in a compressed format sends back Character prediction information rudimentary;Therefore, when examining
When considering input and exporting, the network for being proposed is a character level language model (CLM), but it includes a word level model;It is low
Level module use character input clock, and higher level module using separate word space (<w>) operation;The hierarchical language model can be with
It is expanded, to process the information of longer period, such as sentence, theme or other contexts;Hierarchical language model can be with being based on
The character of text carries out end-to-end training.
Wherein, the character level Language Modeling of described use RNN, for training RNN CLMs, training data should turn first
It is changed to one-hot coding character vector sequence xt, wherein character include word boundary symbol<w>, or space, and optional sentence side
Boundary's symbol<s>;Training RNN, by the cross entropy loss reduction for exporting the softmax of the probability distribution of expression character late
Change to predict character late xt+1。
Wherein, described use external clock and reset signal extension RNN structures, most types of RNNs can be summarized
For
st=f (xt,st-1) (1)
yt=g (st) (2)
Wherein, xtIt is input, stIt is state, ytIt is the output of time step t, f () is recursive function, and g () is output
Function;For example, Elman networks can be expressed as
st=ht=σ (Whxxt+Whhht-1+bh) (3)
yt=ht (4)
Wherein, htIt is the activation of hidden layer, σ () is activation primitive, WhxAnd WhhIt is weight matrix, bhIt is bias vector;
Extensive form can also be converted to the LSTMs for forgeing door and peep-hole connection;LSTM layers of forward equation is as follows:
it=σ (Wixxt+Wihht-1+Wimmt-1+bi) (5)
ft=σ (Wfxxt+Wfhht-1+Wfmmt-1+bf) (6)
mt=ft°mt-1+it°tanh(Wmxxt+Wmhmt-1+bm) (7)
ot=σ (Woxxt+Wohht-1+Wommt+bo) (8)
ht=ot tanh(mt) (9)
Wherein, it, ftAnd otIt is respectively input gate, forgets the value of door and out gate, mtIt is memory cell activation, htIt is defeated
Go out activation, σ () is logic S type functions, and o is element intelligence multiplication operator;These equations can be by setting st=[mt,
ht] and yt=htTo summarize.
Further, described use external clock and reset signal extension RNN structures, any broad sense RNNs can be changed
To be incorporated to those RNNs of external timing signal, ct, such as
st=(1-ct)st-1+ctf(xt,st-1) (10)
yt=g (st) (11)
Wherein, ctIt is 0 or 1;RNN is only in ctIts state and output are updated when=1;Otherwise, c is worked astWhen=0, state and output
Value keeps identical with previous step;
By by st-10 is set to perform the replacement of RNN;Specifically, formula (10) is changed into
st=(1-ct)(1-rt)st-1+ctf(xt,(1-rt)st-1) (12)
Wherein, reset signal rt=0 or 1;Work as rtWhen=1, RNN forgets previous context;
If original RNN equations are differentiable, the curved-edge polygons with clock and reset signal are also differentiable;
It is therefore possible to use for the existing training algorithm based on gradient of RNNs, such as propagating (BPTT) by time reversal, come
Training extended version, and without carrying out any modification.
Wherein, the described character level Language Modeling with classification RNN, layering RNN (HRNN) framework for being proposed has
Some RNN modules with clock rates;Compared with higher level module use the clock rate slower than relatively low module, and compared with
Each senior clock module resets lower level module.
Further, the RNN modules of described clock rates, if L level, then RNN is by L submodule group
Into;Each submodule l external clock cl,tWith reset signal rl,tOperation, wherein, l=1 ..., L;Lowermost level module l=1 has
There is most fast clock rate, i.e., for all t, there is cl,t=1;And higher level module l > 1 have slower clock rate, and
cl,tCan only in cl-1,tIt is 1 when=1;And lower level module l<L is resetted by compared with high-level clock signal, i.e. rl,t=cl+1,t;
The hiding activation l of module<L is fed to next compared with higher level module l+1, postpones a time step, with avoid by
rl,t=cl+1,t, the undesirable reset of t=1;The hiding activation vector, or embedded vector, the short-term context letter comprising compression
Breath;Contribute to module to concentrate on by the module resets of higher level clock signal and only compress short term information;Next more senior mould
Block l+1 processes this short term information and can generate long-term context vector, and it is fed back to relatively low level block l;It is this upper and lower
Text is propagated without delay.
Further, described character level Language Modeling, using two grades of (L=2) HRNN, makes l=1 for character level module, l
=2 is word level module;Word level module is input into timing in word boundary,<w>, typically space character;Input and softmax output layers
Character level module is connected to, and current word boundary mark is (for example<w>Or<s>) information be assigned to word level module;Because
HRNN has expansible architecture, it is possible to extends HRNN CLM by adding Sentence-level module l=3, is sentence
Level context modeling;In this case, when input character for sentence boundary is marked<s>When, Sentence-level clock c3,tIt is changed into 1;This
Outward, word level module should be input into word boundary<w>It is input into sentence boundary<s>It is timed at both;Equally, Extended model with
Module including other higher levels, such as paragraph level module or theme MBM.
Further, described two-stage HRNN CLM architectures, with two types, two model each submodules have
Two LSTM layers;
In HLSTM-A frameworks, two LSTM layers in character level module all receives disposable code character input;Cause
This, the second layer of character level module is by the generation model of context vector conditioning;
In HLSTM-B, character level module the 2nd LSTM layers without being directly connected with character input;Conversely, from first
The word of LSTM layers of insertion is fed to the 2nd LSTM layers, and this causes that first and second layers of character level module work together, with
Next character probabilities are estimated when providing context vector;
Test result indicate that, HLSTM-B is more effective for CLM applications;
Because character level module is marked (i.e., by word boundary<w>Or blank) reset, so the context from word level module
Vector is the exclusive source of contextual information between word;Therefore, training pattern includes the probability distribution on next word to generate
Useful information context vector;From this view point, the word level module in HRNN CLM frameworks is considered word
Level RNN LM, wherein input is word insertion vector, output is the packed description symbol of next word probability.
Wherein, it is described to carry out speech recognition, phonetic entry is converted into spectrogram by Fourier transformation, using RNNs
Network is oriented search decoding, finally produces recognition result.
Brief description of the drawings
Fig. 1 is a kind of system flow of the audio recognition method based on layered circulation neutral net language model of the present invention
Figure.
Fig. 2 is that a kind of training of the audio recognition method based on layered circulation neutral net language model of the present invention is based on
The CLM of RNN.
Fig. 3 is a kind of layering RNN of the audio recognition method based on layered circulation neutral net language model of the present invention.
Fig. 4 is a kind of two-stage of the CLM of the audio recognition method based on layered circulation neutral net language model of the present invention
Layering LSTM (HLSTM) structure.
Specific embodiment
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase
Mutually combine, the present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.
Fig. 1 is a kind of system flow of the audio recognition method based on layered circulation neutral net language model of the present invention
Figure.It is main to include using the character level Language Modeling of RNN, RNN structures are extended with external clock and reset signal, with classification
The character level Language Modeling of RNN and carry out speech recognition.
Wherein, described use external clock and reset signal extension RNN structures, most types of RNNs can be summarized as
st=f (xt,st-1) (1)
yt=g (st) (2)
Wherein, xtIt is input, stIt is state, ytIt is the output of time step t, f () is recursive function, and g () is output
Function;For example, Elman networks can be expressed as
st=ht=σ (Whxxt+Whhht-1+bh) (3)
yt=ht (4)
Wherein, htIt is the activation of hidden layer, σ () is activation primitive, WhxAnd WhhIt is weight matrix, bhIt is bias vector;
Extensive form can also be converted to the LSTMs for forgeing door and peep-hole connection;LSTM layers of forward equation is as follows:
it=σ (Wixxt+Wihht-1+Wimmt-1+bi) (5)
ft=σ (Wfxxt+Wfhht-1+Wfmmt-1+bf) (6)
mt=ftоmt-1+itоtanh(Wmxxt+Wmhmt-1+bm) (7)
ot=σ (Woxxt+Wohht-1+Wommt+bo) (8)
ht=ot tanh(mt) (9)
Wherein, it, ftAnd otIt is respectively input gate, forgets the value of door and out gate, mtIt is memory cell activation, htIt is defeated
Go out activation, σ () is logic S type functions, and o is element intelligence multiplication operator;These equations can be by setting st=[mt,
ht] and yt=htTo summarize.
Further, described use external clock and reset signal extension RNN structures, any broad sense RNNs can be changed
To be incorporated to those RNNs of external timing signal, ct, such as
st=(1-ct)st-1+ctf(xt,st-1) (10)
yt=g (st) (11)
Wherein, ctIt is 0 or 1;RNN is only in ctIts state and output are updated when=1;Otherwise, c is worked astWhen=0, state and output
Value keeps identical with previous step;
By by st-10 is set to perform the replacement of RNN;Specifically, formula (10) is changed into
st=(1-ct)(1-rt)st-1+ctf(xt,(1-rt)st-1) (12)
Wherein, reset signal rt=0 or 1;Work as rtWhen=1, RNN forgets previous context;
If original RNN equations are differentiable, the curved-edge polygons with clock and reset signal are also differentiable;
It is therefore possible to use for the existing training algorithm based on gradient of RNNs, such as propagating (BPTT) by time reversal, come
Training extended version, and without carrying out any modification.
Wherein, it is described to carry out speech recognition, phonetic entry is converted into spectrogram by Fourier transformation, using RNNs
Network is oriented search decoding, finally produces recognition result.
Fig. 2 is that a kind of training of the audio recognition method based on layered circulation neutral net language model of the present invention is based on
The CLM of RNN.For training RNN CLMs, training data should first be converted to one-hot coding character vector sequence xt, wherein character
Including word boundary symbol<w>, or space, and optional sentence boundary symbol<s>;Training RNN, by making the next word of expression
The cross entropy minimization of loss of the softmax outputs of the probability distribution of symbol predicts character late xt+1。
Fig. 3 is a kind of layering RNN of the audio recognition method based on layered circulation neutral net language model of the present invention.Point
Layer RNN (HRNN) framework has some RNN modules with clock rates;Compared with higher level module using slower than relatively low module
Clock rate, and higher each clock module reset lower level module.
The RNN modules of clock rates, if L level, then RNN is made up of L submodule;Each submodule l
Use external clock cl,tWith reset signal rl,tOperation, wherein, l=1 ..., L;Lowermost level module l=1 has most fast when clock rate
Rate, i.e., for all t, there is cl,t=1;And higher level module l > 1 have slower clock rate, and cl,tCan only in cl-1,t
It is 1 when=1;And lower level module l<L is resetted by compared with high-level clock signal, i.e. rl,t=cl+1,t;
The hiding activation l of module<L is fed to next compared with higher level module l+1, postpones a time step, with avoid by
rl,t=cl+1,t, the undesirable reset of t=1;The hiding activation vector, or embedded vector, the short-term context letter comprising compression
Breath;Contribute to module to concentrate on by the module resets of higher level clock signal and only compress short term information;Next more senior mould
Block l+1 processes this short term information and can generate long-term context vector, and it is fed back to relatively low level block l;It is this upper and lower
Text is propagated without delay.
Character level Language Modeling, using two grades of (L=2) HRNN, makes l=1 for character level module, and l=2 is word level module;
Word level module is input into timing in word boundary,<w>, typically space character;Input and softmax output layers are connected to character level mould
Block, and current word boundary mark is (for example<w>Or<s>) information be assigned to word level module;Because HRNN has expansible
Architecture, it is possible to extend HRNN CLM by adding Sentence-level module l=3, is statement level context modeling;At this
In the case of kind, when input character for sentence boundary is marked<s>When, Sentence-level clock c3,tIt is changed into 1;Additionally, word level module should be
Word boundary is input into<w>It is input into sentence boundary<s>It is timed at both;Equally, Extended model is with including other higher levels
Module, such as paragraph level module or theme MBM.
Fig. 4 is a kind of two-stage of the CLM of the audio recognition method based on layered circulation neutral net language model of the present invention
Layering LSTM (HLSTM) structure.Two-stage HRNN CLM architectures have a two types, and two model each submodules have two
LSTM layers;
In HLSTM-A frameworks, two LSTM layers in character level module all receives disposable code character input;Cause
This, the second layer of character level module is by the generation model of context vector conditioning;
In HLSTM-B, character level module the 2nd LSTM layers without being directly connected with character input;Conversely, from first
The word of LSTM layers of insertion is fed to the 2nd LSTM layers, and this causes that first and second layers of character level module work together, with
Next character probabilities are estimated when providing context vector;
Test result indicate that, HLSTM-B is more effective for CLM applications;
Because character level module is marked (i.e., by word boundary<w>Or blank) reset, so the context from word level module
Vector is the exclusive source of contextual information between word;Therefore, training pattern includes the probability distribution on next word to generate
Useful information context vector;From this view point, the word level module in HRNN CLM frameworks is considered word
Level RNN LM, wherein input is word insertion vector, output is the packed description symbol of next word probability.
For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, without departing substantially from essence of the invention
In the case of god and scope, the present invention can be realized with other concrete forms.Additionally, those skilled in the art can be to this hair
Bright to carry out various changes and modification without departing from the spirit and scope of the present invention, these improvement also should be regarded as of the invention with modification
Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention
More and modification.
Claims (10)
1. a kind of audio recognition method based on layered circulation neutral net language model, it is characterised in that main to include using
The character level Language Modeling (one) of RNN;RNN structures (two) is extended with external clock and reset signal;Character with classification RNN
Level Language Modeling (three);Carry out speech recognition (four).
2. based on the language model based on layered circulation neutral net described in claims 1, it is characterised in that it is combined
The advantageous feature of character level and word level language model;Recognition with Recurrent Neural Network (RNN) is made up of rudimentary RNNs and senior RNNs;It is rudimentary
RNN is input into and is exported using character level, and provides short-term embedded to the senior RNN operated as word level RNN;Senior RNN is not
Complicated input and output is needed, because it receives characteristic information from low level network, and is in a compressed format believed Character prediction
Breath sends back rudimentary;Therefore, when considering input and exporting, the network for being proposed is a character level language model (CLM), but
It includes a word level model;Lower-level modules use character input clock, and higher level module using separate word space (<w>) fortune
OK;The hierarchical language model can be expanded, to process the information of longer period, such as sentence, theme or other contexts;Point
Layer language model can carry out end-to-end training with text based character.
3. the character level Language Modeling () of the use RNN being based on described in claims 1, it is characterised in that for training RNN
CLMs, training data should first be converted to one-hot coding character vector sequence xt, wherein character include word boundary symbol<w>, or
Space, and optional sentence boundary symbol<s>;Training RNN, by making the probability distribution of expression character late
The cross entropy minimization of loss of softmax outputs predicts character late xt+1。
4. RNN structures (two) is extended based on the use external clock and reset signal described in claims 1, it is characterised in that big
The RNNs of most types can be summarized as
st=f (xt,st-1) (1)
yt=g (st) (2)
Wherein, xtIt is input, stIt is state, ytIt is the output of time step t, f () is recursive function, and g () is output letter
Number;For example, Elman networks can be expressed as
st=ht=σ (Whxxt+Whhht-1+bh) (3)
yt=ht (4)
Wherein, htIt is the activation of hidden layer, σ () is activation primitive, WhxAnd WhhIt is weight matrix, bhIt is bias vector;
Extensive form can also be converted to the LSTMs for forgeing door and peep-hole connection;LSTM layers of forward equation is as follows:
it=σ (Wixxt+Wihht-1+Wimmt-1+bi) (5)
ft=σ (Wfxxt+Wfhht-1+Wfmmt-1+bf) (6)
ot=σ (Woxxt+Wohht-1+Wommt+bo) (8)
ht=ot tanh(mt) (9)
Wherein, it, ftAnd otIt is respectively input gate, forgets the value of door and out gate, mtIt is memory cell activation, htIt is that output swashs
Living, σ () is logic S type functions,It is element intelligence multiplication operator;These equations can be by setting st=[mt,ht] and
yt=htTo summarize.
5. based on use external clock and reset signal the extension RNN structures described in claims 4, it is characterised in that Ren Heguang
Adopted RNNs can be converted into those RNNs for being incorporated to external timing signal, ct, such as
st=(1-ct)st-1+ctf(xt,st-1) (10)
yt=g (st) (11)
Wherein, ctIt is 0 or 1;RNN is only in ctIts state and output are updated when=1;Otherwise, c is worked astWhen=0, state and output valve are protected
Hold identical with previous step;
By by st-10 is set to perform the replacement of RNN;Specifically, formula (10) is changed into
st=(1-ct)(1-rt)st-1+ctf(xt,(1-rt)st-1) (12)
Wherein, reset signal rt=0 or 1;Work as rtWhen=1, RNN forgets previous context;
If original RNN equations are differentiable, the curved-edge polygons with clock and reset signal are also differentiable;Cause
This, such as can propagate (BPTT) to instruct using the existing training algorithm based on gradient for RNNs by time reversal
Practice extended version, and without carrying out any modification.
6. based on the character level Language Modeling (three) with classification RNN described in claims 1, it is characterised in that proposed
Layering RNN (HRNN) framework there are some RNN modules with clock rates;Used than relatively low module compared with higher level module
Slower clock rate, and reset lower level module in higher each clock module.
7. RNN modules based on the clock rates described in claims 6, it is characterised in that if L level, then RNN by
L submodule composition;Each submodule l external clock cl,tWith reset signal rl,tOperation, wherein, l=1 ..., L;Lowermost level mould
Block l=1 has most fast clock rate, i.e., for all t, there is cl,t=1;And higher level module l > 1 have slower clock rate,
And cl,tCan only in cl-1,tIt is 1 when=1;And lower level module l<L is resetted by compared with high-level clock signal, i.e. rl,t=cl+1,t;
The hiding activation l of module<L is fed to next compared with higher level module l+1, one time step of delay, to avoid by rl,t=
cl+1,t, the undesirable reset of t=1;The hiding activation vector, or embedded vector, the short-term contextual information comprising compression;By
The module resets of higher level clock signal contribute to module to concentrate on only to compress short term information;It is next compared with higher level module l+1
Processing this short term information can generate long-term context vector, and it is fed back to relatively low level block l;This context is propagated
Without delay.
8. based on the character level Language Modeling described in claims 6, it is characterised in that use two grades of (L=2) HRNN, make l=
1 is character level module, and l=2 is word level module;Word level module is input into timing in word boundary,<w>, typically space character;Input
Character level module is connected to softmax output layers, and current word boundary mark is (for example<w>Or<s>) information be assigned to
Word level module;Because HRNN has expansible architecture, it is possible to extended by adding Sentence-level module l=3
HRNN CLM, are statement level context modeling;In this case, when input character for sentence boundary is marked<s>When, Sentence-level
Clock c3,tIt is changed into 1;Additionally, word level module should be input into word boundary<w>It is input into sentence boundary<s>It is timed at both;Together
Sample, Extended model is with the module including other higher levels, such as paragraph level module or theme MBM.
9. based on the two-stage HRNN CLM architectures described in claims 8, it is characterised in that two-stage HRNN CLM system knots
Structure has a two types, and two model each submodules have two LSTM layers;
In HLSTM-A frameworks, two LSTM layers in character level module all receives disposable code character input;Therefore, word
The second layer for according with level module is by the generation model of context vector conditioning;
In HLSTM-B, character level module the 2nd LSTM layers without being directly connected with character input;Conversely, from a LSTM
The embedded word of layer is fed to the 2nd LSTM layers, and this causes that first and second layers of character level module work together, to be given
Next character probabilities are estimated during context vector;
Test result indicate that, HLSTM-B is more effective for CLM applications;
Because character level module is marked (i.e., by word boundary<w>Or blank) reset, so the context vector from word level module
It is the exclusive source of contextual information between word;Therefore, training pattern includes having for the probability distribution on next word to generate
With the context vector of information;From this view point, the word level module in HRNN CLM frameworks is considered word level
RNNLM, wherein input is word insertion vector, output is the packed description symbol of next word probability.
10. it is based on carrying out speech recognition (four) described in claims 1, it is characterised in that phonetic entry is passed through into Fourier
Shift conversion is spectrogram, and search decoding is oriented using RNNs networks, finally produces recognition result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611059843.4A CN106782518A (en) | 2016-11-25 | 2016-11-25 | A kind of audio recognition method based on layered circulation neutral net language model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611059843.4A CN106782518A (en) | 2016-11-25 | 2016-11-25 | A kind of audio recognition method based on layered circulation neutral net language model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106782518A true CN106782518A (en) | 2017-05-31 |
Family
ID=58913229
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611059843.4A Withdrawn CN106782518A (en) | 2016-11-25 | 2016-11-25 | A kind of audio recognition method based on layered circulation neutral net language model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106782518A (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108153943A (en) * | 2017-12-08 | 2018-06-12 | 南京航空航天大学 | The behavior modeling method of power amplifier based on dock cycles neural network |
CN108175426A (en) * | 2017-12-11 | 2018-06-19 | 东南大学 | A kind of lie detecting method that Boltzmann machine is limited based on depth recursion type condition |
CN108492820A (en) * | 2018-03-20 | 2018-09-04 | 华南理工大学 | Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model |
CN108830287A (en) * | 2018-04-18 | 2018-11-16 | 哈尔滨理工大学 | The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method |
CN109003614A (en) * | 2018-07-31 | 2018-12-14 | 上海爱优威软件开发有限公司 | A kind of voice transmission method, voice-transmission system and terminal |
CN109086865A (en) * | 2018-06-11 | 2018-12-25 | 上海交通大学 | A kind of series model method for building up based on cutting Recognition with Recurrent Neural Network |
CN109147773A (en) * | 2017-06-16 | 2019-01-04 | 上海寒武纪信息科技有限公司 | A kind of speech recognition equipment and method |
CN110111797A (en) * | 2019-04-04 | 2019-08-09 | 湖北工业大学 | Method for distinguishing speek person based on Gauss super vector and deep neural network |
WO2019154210A1 (en) * | 2018-02-08 | 2019-08-15 | 腾讯科技(深圳)有限公司 | Machine translation method and device, and computer-readable storage medium |
CN110389996A (en) * | 2018-04-16 | 2019-10-29 | 国际商业机器公司 | Realize the full sentence recurrent neural network language model for being used for natural language processing |
CN111480197A (en) * | 2017-12-15 | 2020-07-31 | 三菱电机株式会社 | Speech recognition system |
CN112673421A (en) * | 2018-11-28 | 2021-04-16 | 谷歌有限责任公司 | Training and/or using language selection models to automatically determine a language for voice recognition of spoken utterances |
CN113077785A (en) * | 2019-12-17 | 2021-07-06 | 中国科学院声学研究所 | End-to-end multi-language continuous voice stream voice content identification method and system |
CN113362811A (en) * | 2021-06-30 | 2021-09-07 | 北京有竹居网络技术有限公司 | Model training method, speech recognition method, device, medium and equipment |
-
2016
- 2016-11-25 CN CN201611059843.4A patent/CN106782518A/en not_active Withdrawn
Non-Patent Citations (1)
Title |
---|
KYUYEON HWANG等: ""Character-Level LanguageModeling with Hierarchical Recurrent Neural Networks"", 《网页在线公开:HTTPS://ARXIV.ORG/ABS/1609.03777V1》 * |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109147773A (en) * | 2017-06-16 | 2019-01-04 | 上海寒武纪信息科技有限公司 | A kind of speech recognition equipment and method |
CN108153943A (en) * | 2017-12-08 | 2018-06-12 | 南京航空航天大学 | The behavior modeling method of power amplifier based on dock cycles neural network |
CN108153943B (en) * | 2017-12-08 | 2021-07-23 | 南京航空航天大学 | Behavior modeling method of power amplifier based on clock cycle neural network |
CN108175426A (en) * | 2017-12-11 | 2018-06-19 | 东南大学 | A kind of lie detecting method that Boltzmann machine is limited based on depth recursion type condition |
CN111480197B (en) * | 2017-12-15 | 2023-06-27 | 三菱电机株式会社 | Speech recognition system |
CN111480197A (en) * | 2017-12-15 | 2020-07-31 | 三菱电机株式会社 | Speech recognition system |
CN111401084A (en) * | 2018-02-08 | 2020-07-10 | 腾讯科技(深圳)有限公司 | Method and device for machine translation and computer readable storage medium |
US11593571B2 (en) | 2018-02-08 | 2023-02-28 | Tencent Technology (Shenzhen) Company Limited | Machine translation method, device, and computer-readable storage medium |
CN111401084B (en) * | 2018-02-08 | 2022-12-23 | 腾讯科技(深圳)有限公司 | Method and device for machine translation and computer readable storage medium |
WO2019154210A1 (en) * | 2018-02-08 | 2019-08-15 | 腾讯科技(深圳)有限公司 | Machine translation method and device, and computer-readable storage medium |
CN108492820B (en) * | 2018-03-20 | 2021-08-10 | 华南理工大学 | Chinese speech recognition method based on cyclic neural network language model and deep neural network acoustic model |
CN108492820A (en) * | 2018-03-20 | 2018-09-04 | 华南理工大学 | Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model |
CN110389996A (en) * | 2018-04-16 | 2019-10-29 | 国际商业机器公司 | Realize the full sentence recurrent neural network language model for being used for natural language processing |
CN108830287A (en) * | 2018-04-18 | 2018-11-16 | 哈尔滨理工大学 | The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method |
CN109086865A (en) * | 2018-06-11 | 2018-12-25 | 上海交通大学 | A kind of series model method for building up based on cutting Recognition with Recurrent Neural Network |
CN109086865B (en) * | 2018-06-11 | 2022-01-28 | 上海交通大学 | Sequence model establishing method based on segmented recurrent neural network |
CN109003614A (en) * | 2018-07-31 | 2018-12-14 | 上海爱优威软件开发有限公司 | A kind of voice transmission method, voice-transmission system and terminal |
CN112673421A (en) * | 2018-11-28 | 2021-04-16 | 谷歌有限责任公司 | Training and/or using language selection models to automatically determine a language for voice recognition of spoken utterances |
CN110111797A (en) * | 2019-04-04 | 2019-08-09 | 湖北工业大学 | Method for distinguishing speek person based on Gauss super vector and deep neural network |
CN113077785A (en) * | 2019-12-17 | 2021-07-06 | 中国科学院声学研究所 | End-to-end multi-language continuous voice stream voice content identification method and system |
CN113077785B (en) * | 2019-12-17 | 2022-07-12 | 中国科学院声学研究所 | End-to-end multi-language continuous voice stream voice content identification method and system |
CN113362811A (en) * | 2021-06-30 | 2021-09-07 | 北京有竹居网络技术有限公司 | Model training method, speech recognition method, device, medium and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106782518A (en) | A kind of audio recognition method based on layered circulation neutral net language model | |
WO2016101688A1 (en) | Continuous voice recognition method based on deep long-and-short-term memory recurrent neural network | |
JP7109302B2 (en) | Text generation model update method and text generation device | |
CN105244020B (en) | Prosodic hierarchy model training method, text-to-speech method and text-to-speech device | |
JP2020520492A (en) | Document abstract automatic extraction method, device, computer device and storage medium | |
TWI610295B (en) | Computer-implemented method of decompressing and compressing transducer data for speech recognition and computer-implemented system of speech recognition | |
Räsänen et al. | Modeling dependencies in multiple parallel data streams with hyperdimensional computing | |
CN110442721B (en) | Neural network language model, training method, device and storage medium | |
CN110083702B (en) | Aspect level text emotion conversion method based on multi-task learning | |
CN113641819A (en) | Multi-task sparse sharing learning-based argument mining system and method | |
CN112764738A (en) | Code automatic generation method and system based on multi-view program characteristics | |
CN110019795B (en) | Sensitive word detection model training method and system | |
WO2024193382A1 (en) | Knowledge injection and training methods and systems for knowledge-enhanced pre-trained language model | |
CN113238797A (en) | Code feature extraction method and system based on hierarchical comparison learning | |
JP2021117989A (en) | Language generation method, device and electronic apparatus | |
CN116306612A (en) | Word and sentence generation method and related equipment | |
CN113869324A (en) | Video common-sense knowledge reasoning implementation method based on multi-mode fusion | |
CN113901789A (en) | Gate-controlled hole convolution and graph convolution based aspect-level emotion analysis method and system | |
CN116431807B (en) | Text classification method and device, storage medium and electronic device | |
US20240071369A1 (en) | Pre-training method, pre-training device, and pre-training program | |
CN112650861A (en) | Personality prediction method, system and device based on task layering | |
US12020694B2 (en) | Efficiency adjustable speech recognition system | |
CN117371433B (en) | Processing method and device of title prediction model | |
WO2024052996A1 (en) | Learning device, conversion device, learning method, conversion method, and program | |
CN117828072B (en) | Dialogue classification method and system based on heterogeneous graph neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20170531 |
|
WW01 | Invention patent application withdrawn after publication |