CN106782518A - A kind of audio recognition method based on layered circulation neutral net language model - Google Patents

A kind of audio recognition method based on layered circulation neutral net language model Download PDF

Info

Publication number
CN106782518A
CN106782518A CN201611059843.4A CN201611059843A CN106782518A CN 106782518 A CN106782518 A CN 106782518A CN 201611059843 A CN201611059843 A CN 201611059843A CN 106782518 A CN106782518 A CN 106782518A
Authority
CN
China
Prior art keywords
rnn
character
level
word
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201611059843.4A
Other languages
Chinese (zh)
Inventor
夏春秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Vision Technology Co Ltd
Original Assignee
Shenzhen Vision Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Vision Technology Co Ltd filed Critical Shenzhen Vision Technology Co Ltd
Priority to CN201611059843.4A priority Critical patent/CN106782518A/en
Publication of CN106782518A publication Critical patent/CN106782518A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models

Abstract

A kind of audio recognition method based on layered circulation neutral net language model proposed in the present invention, its main contents include:RNN structures, the character level Language Modeling with classification RNN are extended using the character level Language Modeling of RNN, with external clock and reset signal, speech recognition is carried out, its process is, first use the character level Language Modeling of RNN, then RNN structures are extended with external clock and reset signal, character level Language Modeling with classification RNN, finally carries out speech recognition.The present invention replaces traditional single clock RNN character level language models with based on layered circulation neutral net language model, with more preferable accuracy of identification, reduces the quantity of parameter;Language model have a large vocabulary, it is necessary to memory space it is smaller;Hierarchical language model can be extended to process the information of longer period, such as sentence, theme or other contexts.

Description

A kind of audio recognition method based on layered circulation neutral net language model
Technical field
The present invention relates to field of speech recognition, more particularly, to a kind of based on layered circulation neutral net language model Audio recognition method.
Background technology
With the development of modern technologies, the character level language model (CLMs) based on Recognition with Recurrent Neural Network (RNN) is in voice The fields such as identification, text generation and machine translation are widely used.The modeling of its word for being had no in nature is highly useful. However, their performance is generally more very different than word level language model (WLMs).And, statistical language model needs big storage empty Between, usually more than 1GB, because not only to consider substantial amounts of vocabulary, in addition it is also necessary to consider combinations thereof.
The present invention proposes a kind of audio recognition method based on layered circulation neutral net language model, its classification RNN Framework is made up of the multiple modules with clock rates.Despite multi-clock structure, but input layer and output layer are all With character level clock operation, this existing RNN character levels language model training method of permission can be applied directly without appointing What is changed.First by the character level Language Modeling of RNN, then extend RNN structures with external clock and reset signal, with point The character level Language Modeling of level RNN, finally carries out speech recognition.The present invention is replaced with based on layered circulation neutral net language model Traditional single clock RNN character level language models are changed, with more preferable accuracy of identification, the quantity of parameter is reduced;Language model Have a large vocabulary, it is necessary to memory space it is smaller;Hierarchical language model can be extended to process the information of longer period, such as sentence Son, theme or other contexts.
The content of the invention
It is not high for accuracy of identification, the problems such as shared memory space is big, it is an object of the invention to provide one kind based on point The audio recognition method of layer Recognition with Recurrent Neural Network language model, first by the character level Language Modeling of RNN, when then using outside Clock and reset signal extension RNN structures, the character level Language Modeling with classification RNN finally carry out speech recognition.
To solve the above problems, the present invention provides a kind of speech recognition side based on layered circulation neutral net language model Method, its main contents include:
(1) the character level Language Modeling of RNN is used;
(2) RNN structures are extended with external clock and reset signal;
(3) with the character level Language Modeling of classification RNN;
(4) speech recognition is carried out.
Wherein, it is described based on layered circulation neutral net language model, combine character level and word level language model Advantageous feature;Recognition with Recurrent Neural Network (RNN) is made up of rudimentary RNNs and senior RNNs;Rudimentary RNN is input into and defeated using character level Go out, and provide short-term embedded to the senior RNN operated as word level RNN;Senior RNN does not need complicated input and output, Because it receives characteristic information from low level network, and in a compressed format sends back Character prediction information rudimentary;Therefore, when examining When considering input and exporting, the network for being proposed is a character level language model (CLM), but it includes a word level model;It is low Level module use character input clock, and higher level module using separate word space (<w>) operation;The hierarchical language model can be with It is expanded, to process the information of longer period, such as sentence, theme or other contexts;Hierarchical language model can be with being based on The character of text carries out end-to-end training.
Wherein, the character level Language Modeling of described use RNN, for training RNN CLMs, training data should turn first It is changed to one-hot coding character vector sequence xt, wherein character include word boundary symbol<w>, or space, and optional sentence side Boundary's symbol<s>;Training RNN, by the cross entropy loss reduction for exporting the softmax of the probability distribution of expression character late Change to predict character late xt+1
Wherein, described use external clock and reset signal extension RNN structures, most types of RNNs can be summarized For
st=f (xt,st-1) (1)
yt=g (st) (2)
Wherein, xtIt is input, stIt is state, ytIt is the output of time step t, f () is recursive function, and g () is output Function;For example, Elman networks can be expressed as
st=ht=σ (Whxxt+Whhht-1+bh) (3)
yt=ht (4)
Wherein, htIt is the activation of hidden layer, σ () is activation primitive, WhxAnd WhhIt is weight matrix, bhIt is bias vector;
Extensive form can also be converted to the LSTMs for forgeing door and peep-hole connection;LSTM layers of forward equation is as follows:
it=σ (Wixxt+Wihht-1+Wimmt-1+bi) (5)
ft=σ (Wfxxt+Wfhht-1+Wfmmt-1+bf) (6)
mt=ft°mt-1+it°tanh(Wmxxt+Wmhmt-1+bm) (7)
ot=σ (Woxxt+Wohht-1+Wommt+bo) (8)
ht=ot tanh(mt) (9)
Wherein, it, ftAnd otIt is respectively input gate, forgets the value of door and out gate, mtIt is memory cell activation, htIt is defeated Go out activation, σ () is logic S type functions, and o is element intelligence multiplication operator;These equations can be by setting st=[mt, ht] and yt=htTo summarize.
Further, described use external clock and reset signal extension RNN structures, any broad sense RNNs can be changed To be incorporated to those RNNs of external timing signal, ct, such as
st=(1-ct)st-1+ctf(xt,st-1) (10)
yt=g (st) (11)
Wherein, ctIt is 0 or 1;RNN is only in ctIts state and output are updated when=1;Otherwise, c is worked astWhen=0, state and output Value keeps identical with previous step;
By by st-10 is set to perform the replacement of RNN;Specifically, formula (10) is changed into
st=(1-ct)(1-rt)st-1+ctf(xt,(1-rt)st-1) (12)
Wherein, reset signal rt=0 or 1;Work as rtWhen=1, RNN forgets previous context;
If original RNN equations are differentiable, the curved-edge polygons with clock and reset signal are also differentiable; It is therefore possible to use for the existing training algorithm based on gradient of RNNs, such as propagating (BPTT) by time reversal, come Training extended version, and without carrying out any modification.
Wherein, the described character level Language Modeling with classification RNN, layering RNN (HRNN) framework for being proposed has Some RNN modules with clock rates;Compared with higher level module use the clock rate slower than relatively low module, and compared with Each senior clock module resets lower level module.
Further, the RNN modules of described clock rates, if L level, then RNN is by L submodule group Into;Each submodule l external clock cl,tWith reset signal rl,tOperation, wherein, l=1 ..., L;Lowermost level module l=1 has There is most fast clock rate, i.e., for all t, there is cl,t=1;And higher level module l > 1 have slower clock rate, and cl,tCan only in cl-1,tIt is 1 when=1;And lower level module l<L is resetted by compared with high-level clock signal, i.e. rl,t=cl+1,t
The hiding activation l of module<L is fed to next compared with higher level module l+1, postpones a time step, with avoid by rl,t=cl+1,t, the undesirable reset of t=1;The hiding activation vector, or embedded vector, the short-term context letter comprising compression Breath;Contribute to module to concentrate on by the module resets of higher level clock signal and only compress short term information;Next more senior mould Block l+1 processes this short term information and can generate long-term context vector, and it is fed back to relatively low level block l;It is this upper and lower Text is propagated without delay.
Further, described character level Language Modeling, using two grades of (L=2) HRNN, makes l=1 for character level module, l =2 is word level module;Word level module is input into timing in word boundary,<w>, typically space character;Input and softmax output layers Character level module is connected to, and current word boundary mark is (for example<w>Or<s>) information be assigned to word level module;Because HRNN has expansible architecture, it is possible to extends HRNN CLM by adding Sentence-level module l=3, is sentence Level context modeling;In this case, when input character for sentence boundary is marked<s>When, Sentence-level clock c3,tIt is changed into 1;This Outward, word level module should be input into word boundary<w>It is input into sentence boundary<s>It is timed at both;Equally, Extended model with Module including other higher levels, such as paragraph level module or theme MBM.
Further, described two-stage HRNN CLM architectures, with two types, two model each submodules have Two LSTM layers;
In HLSTM-A frameworks, two LSTM layers in character level module all receives disposable code character input;Cause This, the second layer of character level module is by the generation model of context vector conditioning;
In HLSTM-B, character level module the 2nd LSTM layers without being directly connected with character input;Conversely, from first The word of LSTM layers of insertion is fed to the 2nd LSTM layers, and this causes that first and second layers of character level module work together, with Next character probabilities are estimated when providing context vector;
Test result indicate that, HLSTM-B is more effective for CLM applications;
Because character level module is marked (i.e., by word boundary<w>Or blank) reset, so the context from word level module Vector is the exclusive source of contextual information between word;Therefore, training pattern includes the probability distribution on next word to generate Useful information context vector;From this view point, the word level module in HRNN CLM frameworks is considered word Level RNN LM, wherein input is word insertion vector, output is the packed description symbol of next word probability.
Wherein, it is described to carry out speech recognition, phonetic entry is converted into spectrogram by Fourier transformation, using RNNs Network is oriented search decoding, finally produces recognition result.
Brief description of the drawings
Fig. 1 is a kind of system flow of the audio recognition method based on layered circulation neutral net language model of the present invention Figure.
Fig. 2 is that a kind of training of the audio recognition method based on layered circulation neutral net language model of the present invention is based on The CLM of RNN.
Fig. 3 is a kind of layering RNN of the audio recognition method based on layered circulation neutral net language model of the present invention.
Fig. 4 is a kind of two-stage of the CLM of the audio recognition method based on layered circulation neutral net language model of the present invention Layering LSTM (HLSTM) structure.
Specific embodiment
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combine, the present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.
Fig. 1 is a kind of system flow of the audio recognition method based on layered circulation neutral net language model of the present invention Figure.It is main to include using the character level Language Modeling of RNN, RNN structures are extended with external clock and reset signal, with classification The character level Language Modeling of RNN and carry out speech recognition.
Wherein, described use external clock and reset signal extension RNN structures, most types of RNNs can be summarized as
st=f (xt,st-1) (1)
yt=g (st) (2)
Wherein, xtIt is input, stIt is state, ytIt is the output of time step t, f () is recursive function, and g () is output Function;For example, Elman networks can be expressed as
st=ht=σ (Whxxt+Whhht-1+bh) (3)
yt=ht (4)
Wherein, htIt is the activation of hidden layer, σ () is activation primitive, WhxAnd WhhIt is weight matrix, bhIt is bias vector;
Extensive form can also be converted to the LSTMs for forgeing door and peep-hole connection;LSTM layers of forward equation is as follows:
it=σ (Wixxt+Wihht-1+Wimmt-1+bi) (5)
ft=σ (Wfxxt+Wfhht-1+Wfmmt-1+bf) (6)
mt=ftоmt-1+itоtanh(Wmxxt+Wmhmt-1+bm) (7)
ot=σ (Woxxt+Wohht-1+Wommt+bo) (8)
ht=ot tanh(mt) (9)
Wherein, it, ftAnd otIt is respectively input gate, forgets the value of door and out gate, mtIt is memory cell activation, htIt is defeated Go out activation, σ () is logic S type functions, and o is element intelligence multiplication operator;These equations can be by setting st=[mt, ht] and yt=htTo summarize.
Further, described use external clock and reset signal extension RNN structures, any broad sense RNNs can be changed To be incorporated to those RNNs of external timing signal, ct, such as
st=(1-ct)st-1+ctf(xt,st-1) (10)
yt=g (st) (11)
Wherein, ctIt is 0 or 1;RNN is only in ctIts state and output are updated when=1;Otherwise, c is worked astWhen=0, state and output Value keeps identical with previous step;
By by st-10 is set to perform the replacement of RNN;Specifically, formula (10) is changed into
st=(1-ct)(1-rt)st-1+ctf(xt,(1-rt)st-1) (12)
Wherein, reset signal rt=0 or 1;Work as rtWhen=1, RNN forgets previous context;
If original RNN equations are differentiable, the curved-edge polygons with clock and reset signal are also differentiable; It is therefore possible to use for the existing training algorithm based on gradient of RNNs, such as propagating (BPTT) by time reversal, come Training extended version, and without carrying out any modification.
Wherein, it is described to carry out speech recognition, phonetic entry is converted into spectrogram by Fourier transformation, using RNNs Network is oriented search decoding, finally produces recognition result.
Fig. 2 is that a kind of training of the audio recognition method based on layered circulation neutral net language model of the present invention is based on The CLM of RNN.For training RNN CLMs, training data should first be converted to one-hot coding character vector sequence xt, wherein character Including word boundary symbol<w>, or space, and optional sentence boundary symbol<s>;Training RNN, by making the next word of expression The cross entropy minimization of loss of the softmax outputs of the probability distribution of symbol predicts character late xt+1
Fig. 3 is a kind of layering RNN of the audio recognition method based on layered circulation neutral net language model of the present invention.Point Layer RNN (HRNN) framework has some RNN modules with clock rates;Compared with higher level module using slower than relatively low module Clock rate, and higher each clock module reset lower level module.
The RNN modules of clock rates, if L level, then RNN is made up of L submodule;Each submodule l Use external clock cl,tWith reset signal rl,tOperation, wherein, l=1 ..., L;Lowermost level module l=1 has most fast when clock rate Rate, i.e., for all t, there is cl,t=1;And higher level module l > 1 have slower clock rate, and cl,tCan only in cl-1,t It is 1 when=1;And lower level module l<L is resetted by compared with high-level clock signal, i.e. rl,t=cl+1,t
The hiding activation l of module<L is fed to next compared with higher level module l+1, postpones a time step, with avoid by rl,t=cl+1,t, the undesirable reset of t=1;The hiding activation vector, or embedded vector, the short-term context letter comprising compression Breath;Contribute to module to concentrate on by the module resets of higher level clock signal and only compress short term information;Next more senior mould Block l+1 processes this short term information and can generate long-term context vector, and it is fed back to relatively low level block l;It is this upper and lower Text is propagated without delay.
Character level Language Modeling, using two grades of (L=2) HRNN, makes l=1 for character level module, and l=2 is word level module; Word level module is input into timing in word boundary,<w>, typically space character;Input and softmax output layers are connected to character level mould Block, and current word boundary mark is (for example<w>Or<s>) information be assigned to word level module;Because HRNN has expansible Architecture, it is possible to extend HRNN CLM by adding Sentence-level module l=3, is statement level context modeling;At this In the case of kind, when input character for sentence boundary is marked<s>When, Sentence-level clock c3,tIt is changed into 1;Additionally, word level module should be Word boundary is input into<w>It is input into sentence boundary<s>It is timed at both;Equally, Extended model is with including other higher levels Module, such as paragraph level module or theme MBM.
Fig. 4 is a kind of two-stage of the CLM of the audio recognition method based on layered circulation neutral net language model of the present invention Layering LSTM (HLSTM) structure.Two-stage HRNN CLM architectures have a two types, and two model each submodules have two LSTM layers;
In HLSTM-A frameworks, two LSTM layers in character level module all receives disposable code character input;Cause This, the second layer of character level module is by the generation model of context vector conditioning;
In HLSTM-B, character level module the 2nd LSTM layers without being directly connected with character input;Conversely, from first The word of LSTM layers of insertion is fed to the 2nd LSTM layers, and this causes that first and second layers of character level module work together, with Next character probabilities are estimated when providing context vector;
Test result indicate that, HLSTM-B is more effective for CLM applications;
Because character level module is marked (i.e., by word boundary<w>Or blank) reset, so the context from word level module Vector is the exclusive source of contextual information between word;Therefore, training pattern includes the probability distribution on next word to generate Useful information context vector;From this view point, the word level module in HRNN CLM frameworks is considered word Level RNN LM, wherein input is word insertion vector, output is the packed description symbol of next word probability.
For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, without departing substantially from essence of the invention In the case of god and scope, the present invention can be realized with other concrete forms.Additionally, those skilled in the art can be to this hair Bright to carry out various changes and modification without departing from the spirit and scope of the present invention, these improvement also should be regarded as of the invention with modification Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and modification.

Claims (10)

1. a kind of audio recognition method based on layered circulation neutral net language model, it is characterised in that main to include using The character level Language Modeling (one) of RNN;RNN structures (two) is extended with external clock and reset signal;Character with classification RNN Level Language Modeling (three);Carry out speech recognition (four).
2. based on the language model based on layered circulation neutral net described in claims 1, it is characterised in that it is combined The advantageous feature of character level and word level language model;Recognition with Recurrent Neural Network (RNN) is made up of rudimentary RNNs and senior RNNs;It is rudimentary RNN is input into and is exported using character level, and provides short-term embedded to the senior RNN operated as word level RNN;Senior RNN is not Complicated input and output is needed, because it receives characteristic information from low level network, and is in a compressed format believed Character prediction Breath sends back rudimentary;Therefore, when considering input and exporting, the network for being proposed is a character level language model (CLM), but It includes a word level model;Lower-level modules use character input clock, and higher level module using separate word space (<w>) fortune OK;The hierarchical language model can be expanded, to process the information of longer period, such as sentence, theme or other contexts;Point Layer language model can carry out end-to-end training with text based character.
3. the character level Language Modeling () of the use RNN being based on described in claims 1, it is characterised in that for training RNN CLMs, training data should first be converted to one-hot coding character vector sequence xt, wherein character include word boundary symbol<w>, or Space, and optional sentence boundary symbol<s>;Training RNN, by making the probability distribution of expression character late The cross entropy minimization of loss of softmax outputs predicts character late xt+1
4. RNN structures (two) is extended based on the use external clock and reset signal described in claims 1, it is characterised in that big The RNNs of most types can be summarized as
st=f (xt,st-1) (1)
yt=g (st) (2)
Wherein, xtIt is input, stIt is state, ytIt is the output of time step t, f () is recursive function, and g () is output letter Number;For example, Elman networks can be expressed as
st=ht=σ (Whxxt+Whhht-1+bh) (3)
yt=ht (4)
Wherein, htIt is the activation of hidden layer, σ () is activation primitive, WhxAnd WhhIt is weight matrix, bhIt is bias vector;
Extensive form can also be converted to the LSTMs for forgeing door and peep-hole connection;LSTM layers of forward equation is as follows:
it=σ (Wixxt+Wihht-1+Wimmt-1+bi) (5)
ft=σ (Wfxxt+Wfhht-1+Wfmmt-1+bf) (6)
ot=σ (Woxxt+Wohht-1+Wommt+bo) (8)
ht=ot tanh(mt) (9)
Wherein, it, ftAnd otIt is respectively input gate, forgets the value of door and out gate, mtIt is memory cell activation, htIt is that output swashs Living, σ () is logic S type functions,It is element intelligence multiplication operator;These equations can be by setting st=[mt,ht] and yt=htTo summarize.
5. based on use external clock and reset signal the extension RNN structures described in claims 4, it is characterised in that Ren Heguang Adopted RNNs can be converted into those RNNs for being incorporated to external timing signal, ct, such as
st=(1-ct)st-1+ctf(xt,st-1) (10)
yt=g (st) (11)
Wherein, ctIt is 0 or 1;RNN is only in ctIts state and output are updated when=1;Otherwise, c is worked astWhen=0, state and output valve are protected Hold identical with previous step;
By by st-10 is set to perform the replacement of RNN;Specifically, formula (10) is changed into
st=(1-ct)(1-rt)st-1+ctf(xt,(1-rt)st-1) (12)
Wherein, reset signal rt=0 or 1;Work as rtWhen=1, RNN forgets previous context;
If original RNN equations are differentiable, the curved-edge polygons with clock and reset signal are also differentiable;Cause This, such as can propagate (BPTT) to instruct using the existing training algorithm based on gradient for RNNs by time reversal Practice extended version, and without carrying out any modification.
6. based on the character level Language Modeling (three) with classification RNN described in claims 1, it is characterised in that proposed Layering RNN (HRNN) framework there are some RNN modules with clock rates;Used than relatively low module compared with higher level module Slower clock rate, and reset lower level module in higher each clock module.
7. RNN modules based on the clock rates described in claims 6, it is characterised in that if L level, then RNN by L submodule composition;Each submodule l external clock cl,tWith reset signal rl,tOperation, wherein, l=1 ..., L;Lowermost level mould Block l=1 has most fast clock rate, i.e., for all t, there is cl,t=1;And higher level module l > 1 have slower clock rate, And cl,tCan only in cl-1,tIt is 1 when=1;And lower level module l<L is resetted by compared with high-level clock signal, i.e. rl,t=cl+1,t
The hiding activation l of module<L is fed to next compared with higher level module l+1, one time step of delay, to avoid by rl,t= cl+1,t, the undesirable reset of t=1;The hiding activation vector, or embedded vector, the short-term contextual information comprising compression;By The module resets of higher level clock signal contribute to module to concentrate on only to compress short term information;It is next compared with higher level module l+1 Processing this short term information can generate long-term context vector, and it is fed back to relatively low level block l;This context is propagated Without delay.
8. based on the character level Language Modeling described in claims 6, it is characterised in that use two grades of (L=2) HRNN, make l= 1 is character level module, and l=2 is word level module;Word level module is input into timing in word boundary,<w>, typically space character;Input Character level module is connected to softmax output layers, and current word boundary mark is (for example<w>Or<s>) information be assigned to Word level module;Because HRNN has expansible architecture, it is possible to extended by adding Sentence-level module l=3 HRNN CLM, are statement level context modeling;In this case, when input character for sentence boundary is marked<s>When, Sentence-level Clock c3,tIt is changed into 1;Additionally, word level module should be input into word boundary<w>It is input into sentence boundary<s>It is timed at both;Together Sample, Extended model is with the module including other higher levels, such as paragraph level module or theme MBM.
9. based on the two-stage HRNN CLM architectures described in claims 8, it is characterised in that two-stage HRNN CLM system knots Structure has a two types, and two model each submodules have two LSTM layers;
In HLSTM-A frameworks, two LSTM layers in character level module all receives disposable code character input;Therefore, word The second layer for according with level module is by the generation model of context vector conditioning;
In HLSTM-B, character level module the 2nd LSTM layers without being directly connected with character input;Conversely, from a LSTM The embedded word of layer is fed to the 2nd LSTM layers, and this causes that first and second layers of character level module work together, to be given Next character probabilities are estimated during context vector;
Test result indicate that, HLSTM-B is more effective for CLM applications;
Because character level module is marked (i.e., by word boundary<w>Or blank) reset, so the context vector from word level module It is the exclusive source of contextual information between word;Therefore, training pattern includes having for the probability distribution on next word to generate With the context vector of information;From this view point, the word level module in HRNN CLM frameworks is considered word level RNNLM, wherein input is word insertion vector, output is the packed description symbol of next word probability.
10. it is based on carrying out speech recognition (four) described in claims 1, it is characterised in that phonetic entry is passed through into Fourier Shift conversion is spectrogram, and search decoding is oriented using RNNs networks, finally produces recognition result.
CN201611059843.4A 2016-11-25 2016-11-25 A kind of audio recognition method based on layered circulation neutral net language model Withdrawn CN106782518A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611059843.4A CN106782518A (en) 2016-11-25 2016-11-25 A kind of audio recognition method based on layered circulation neutral net language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611059843.4A CN106782518A (en) 2016-11-25 2016-11-25 A kind of audio recognition method based on layered circulation neutral net language model

Publications (1)

Publication Number Publication Date
CN106782518A true CN106782518A (en) 2017-05-31

Family

ID=58913229

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611059843.4A Withdrawn CN106782518A (en) 2016-11-25 2016-11-25 A kind of audio recognition method based on layered circulation neutral net language model

Country Status (1)

Country Link
CN (1) CN106782518A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108153943A (en) * 2017-12-08 2018-06-12 南京航空航天大学 The behavior modeling method of power amplifier based on dock cycles neural network
CN108175426A (en) * 2017-12-11 2018-06-19 东南大学 A kind of lie detecting method that Boltzmann machine is limited based on depth recursion type condition
CN108492820A (en) * 2018-03-20 2018-09-04 华南理工大学 Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model
CN108830287A (en) * 2018-04-18 2018-11-16 哈尔滨理工大学 The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method
CN109003614A (en) * 2018-07-31 2018-12-14 上海爱优威软件开发有限公司 A kind of voice transmission method, voice-transmission system and terminal
CN109086865A (en) * 2018-06-11 2018-12-25 上海交通大学 A kind of series model method for building up based on cutting Recognition with Recurrent Neural Network
CN109147773A (en) * 2017-06-16 2019-01-04 上海寒武纪信息科技有限公司 A kind of speech recognition equipment and method
CN110111797A (en) * 2019-04-04 2019-08-09 湖北工业大学 Method for distinguishing speek person based on Gauss super vector and deep neural network
WO2019154210A1 (en) * 2018-02-08 2019-08-15 腾讯科技(深圳)有限公司 Machine translation method and device, and computer-readable storage medium
CN110389996A (en) * 2018-04-16 2019-10-29 国际商业机器公司 Realize the full sentence recurrent neural network language model for being used for natural language processing
CN111480197A (en) * 2017-12-15 2020-07-31 三菱电机株式会社 Speech recognition system
CN112673421A (en) * 2018-11-28 2021-04-16 谷歌有限责任公司 Training and/or using language selection models to automatically determine a language for voice recognition of spoken utterances
CN113077785A (en) * 2019-12-17 2021-07-06 中国科学院声学研究所 End-to-end multi-language continuous voice stream voice content identification method and system
CN113362811A (en) * 2021-06-30 2021-09-07 北京有竹居网络技术有限公司 Model training method, speech recognition method, device, medium and equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KYUYEON HWANG等: ""Character-Level LanguageModeling with Hierarchical Recurrent Neural Networks"", 《网页在线公开:HTTPS://ARXIV.ORG/ABS/1609.03777V1》 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109147773A (en) * 2017-06-16 2019-01-04 上海寒武纪信息科技有限公司 A kind of speech recognition equipment and method
CN108153943A (en) * 2017-12-08 2018-06-12 南京航空航天大学 The behavior modeling method of power amplifier based on dock cycles neural network
CN108153943B (en) * 2017-12-08 2021-07-23 南京航空航天大学 Behavior modeling method of power amplifier based on clock cycle neural network
CN108175426A (en) * 2017-12-11 2018-06-19 东南大学 A kind of lie detecting method that Boltzmann machine is limited based on depth recursion type condition
CN111480197B (en) * 2017-12-15 2023-06-27 三菱电机株式会社 Speech recognition system
CN111480197A (en) * 2017-12-15 2020-07-31 三菱电机株式会社 Speech recognition system
CN111401084A (en) * 2018-02-08 2020-07-10 腾讯科技(深圳)有限公司 Method and device for machine translation and computer readable storage medium
US11593571B2 (en) 2018-02-08 2023-02-28 Tencent Technology (Shenzhen) Company Limited Machine translation method, device, and computer-readable storage medium
CN111401084B (en) * 2018-02-08 2022-12-23 腾讯科技(深圳)有限公司 Method and device for machine translation and computer readable storage medium
WO2019154210A1 (en) * 2018-02-08 2019-08-15 腾讯科技(深圳)有限公司 Machine translation method and device, and computer-readable storage medium
CN108492820B (en) * 2018-03-20 2021-08-10 华南理工大学 Chinese speech recognition method based on cyclic neural network language model and deep neural network acoustic model
CN108492820A (en) * 2018-03-20 2018-09-04 华南理工大学 Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model
CN110389996A (en) * 2018-04-16 2019-10-29 国际商业机器公司 Realize the full sentence recurrent neural network language model for being used for natural language processing
CN108830287A (en) * 2018-04-18 2018-11-16 哈尔滨理工大学 The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method
CN109086865A (en) * 2018-06-11 2018-12-25 上海交通大学 A kind of series model method for building up based on cutting Recognition with Recurrent Neural Network
CN109086865B (en) * 2018-06-11 2022-01-28 上海交通大学 Sequence model establishing method based on segmented recurrent neural network
CN109003614A (en) * 2018-07-31 2018-12-14 上海爱优威软件开发有限公司 A kind of voice transmission method, voice-transmission system and terminal
CN112673421A (en) * 2018-11-28 2021-04-16 谷歌有限责任公司 Training and/or using language selection models to automatically determine a language for voice recognition of spoken utterances
CN110111797A (en) * 2019-04-04 2019-08-09 湖北工业大学 Method for distinguishing speek person based on Gauss super vector and deep neural network
CN113077785A (en) * 2019-12-17 2021-07-06 中国科学院声学研究所 End-to-end multi-language continuous voice stream voice content identification method and system
CN113077785B (en) * 2019-12-17 2022-07-12 中国科学院声学研究所 End-to-end multi-language continuous voice stream voice content identification method and system
CN113362811A (en) * 2021-06-30 2021-09-07 北京有竹居网络技术有限公司 Model training method, speech recognition method, device, medium and equipment

Similar Documents

Publication Publication Date Title
CN106782518A (en) A kind of audio recognition method based on layered circulation neutral net language model
CN109902293B (en) Text classification method based on local and global mutual attention mechanism
WO2016101688A1 (en) Continuous voice recognition method based on deep long-and-short-term memory recurrent neural network
CN105244020B (en) Prosodic hierarchy model training method, text-to-speech method and text-to-speech device
JP7109302B2 (en) Text generation model update method and text generation device
Arisoy et al. Bidirectional recurrent neural network language models for automatic speech recognition
JP2020520492A (en) Document abstract automatic extraction method, device, computer device and storage medium
CN111461004B (en) Event detection method and device based on graph attention neural network and electronic equipment
TWI610295B (en) Computer-implemented method of decompressing and compressing transducer data for speech recognition and computer-implemented system of speech recognition
US20230196202A1 (en) System and method for automatic building of learning machines using learning machines
CN110083702B (en) Aspect level text emotion conversion method based on multi-task learning
CN110442721B (en) Neural network language model, training method, device and storage medium
CN110019795B (en) Sensitive word detection model training method and system
CN112764738A (en) Code automatic generation method and system based on multi-view program characteristics
CN114881035B (en) Training data augmentation method, device, equipment and storage medium
CN113238797A (en) Code feature extraction method and system based on hierarchical comparison learning
CN113157941B (en) Service characteristic data processing method, service characteristic data processing device, text generating method, text generating device and electronic equipment
CN113641819A (en) Multi-task sparse sharing learning-based argument mining system and method
CN108475346A (en) Neural random access machine
CN113869324A (en) Video common-sense knowledge reasoning implementation method based on multi-mode fusion
JP2021117989A (en) Language generation method, device and electronic apparatus
CN116306612A (en) Word and sentence generation method and related equipment
CN112650861A (en) Personality prediction method, system and device based on task layering
CN113901789A (en) Gate-controlled hole convolution and graph convolution based aspect-level emotion analysis method and system
WO2020250279A1 (en) Model learning device, method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20170531

WW01 Invention patent application withdrawn after publication