CN106782518A

CN106782518A - A kind of audio recognition method based on layered circulation neutral net language model

Info

Publication number: CN106782518A
Application number: CN201611059843.4A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2016-11-25
Filing date: 2016-11-25
Publication date: 2017-05-31

Abstract

A kind of audio recognition method based on layered circulation neutral net language model proposed in the present invention, its main contents include：RNN structures, the character level Language Modeling with classification RNN are extended using the character level Language Modeling of RNN, with external clock and reset signal, speech recognition is carried out, its process is, first use the character level Language Modeling of RNN, then RNN structures are extended with external clock and reset signal, character level Language Modeling with classification RNN, finally carries out speech recognition.The present invention replaces traditional single clock RNN character level language models with based on layered circulation neutral net language model, with more preferable accuracy of identification, reduces the quantity of parameter；Language model have a large vocabulary, it is necessary to memory space it is smaller；Hierarchical language model can be extended to process the information of longer period, such as sentence, theme or other contexts.

Description

A kind of audio recognition method based on layered circulation neutral net language model

Technical field

The present invention relates to field of speech recognition, more particularly, to a kind of based on layered circulation neutral net language model Audio recognition method.

Background technology

With the development of modern technologies, the character level language model (CLMs) based on Recognition with Recurrent Neural Network (RNN) is in voice The fields such as identification, text generation and machine translation are widely used.The modeling of its word for being had no in nature is highly useful. However, their performance is generally more very different than word level language model (WLMs).And, statistical language model needs big storage empty Between, usually more than 1GB, because not only to consider substantial amounts of vocabulary, in addition it is also necessary to consider combinations thereof.

The present invention proposes a kind of audio recognition method based on layered circulation neutral net language model, its classification RNN Framework is made up of the multiple modules with clock rates.Despite multi-clock structure, but input layer and output layer are all With character level clock operation, this existing RNN character levels language model training method of permission can be applied directly without appointing What is changed.First by the character level Language Modeling of RNN, then extend RNN structures with external clock and reset signal, with point The character level Language Modeling of level RNN, finally carries out speech recognition.The present invention is replaced with based on layered circulation neutral net language model Traditional single clock RNN character level language models are changed, with more preferable accuracy of identification, the quantity of parameter is reduced；Language model Have a large vocabulary, it is necessary to memory space it is smaller；Hierarchical language model can be extended to process the information of longer period, such as sentence Son, theme or other contexts.

The content of the invention

It is not high for accuracy of identification, the problems such as shared memory space is big, it is an object of the invention to provide one kind based on point The audio recognition method of layer Recognition with Recurrent Neural Network language model, first by the character level Language Modeling of RNN, when then using outside Clock and reset signal extension RNN structures, the character level Language Modeling with classification RNN finally carry out speech recognition.

To solve the above problems, the present invention provides a kind of speech recognition side based on layered circulation neutral net language model Method, its main contents include：

(1) the character level Language Modeling of RNN is used；

(2) RNN structures are extended with external clock and reset signal；

(3) with the character level Language Modeling of classification RNN；

(4) speech recognition is carried out.

Wherein, it is described based on layered circulation neutral net language model, combine character level and word level language model Advantageous feature；Recognition with Recurrent Neural Network (RNN) is made up of rudimentary RNNs and senior RNNs；Rudimentary RNN is input into and defeated using character level Go out, and provide short-term embedded to the senior RNN operated as word level RNN；Senior RNN does not need complicated input and output, Because it receives characteristic information from low level network, and in a compressed format sends back Character prediction information rudimentary；Therefore, when examining When considering input and exporting, the network for being proposed is a character level language model (CLM), but it includes a word level model；It is low Level module use character input clock, and higher level module using separate word space (<w>) operation；The hierarchical language model can be with It is expanded, to process the information of longer period, such as sentence, theme or other contexts；Hierarchical language model can be with being based on The character of text carries out end-to-end training.

Wherein, the character level Language Modeling of described use RNN, for training RNN CLMs, training data should turn first It is changed to one-hot coding character vector sequence x_t, wherein character include word boundary symbol<w>, or space, and optional sentence side Boundary's symbol<s>；Training RNN, by the cross entropy loss reduction for exporting the softmax of the probability distribution of expression character late Change to predict character late x_t+1。

Wherein, described use external clock and reset signal extension RNN structures, most types of RNNs can be summarized For

s_t=f (x_t,s_t-1) (1)

y_t=g (s_t) (2)

Wherein, x_tIt is input, s_tIt is state, y_tIt is the output of time step t, f () is recursive function, and g () is output Function；For example, Elman networks can be expressed as

s_t=h_t=σ (W_hxx_t+W_hhh_t-1+b_h) (3)

y_t=h_t (4)

Wherein, h_tIt is the activation of hidden layer, σ () is activation primitive, W_hxAnd W_hhIt is weight matrix, b_hIt is bias vector；

Extensive form can also be converted to the LSTMs for forgeing door and peep-hole connection；LSTM layers of forward equation is as follows：

i_t=σ (W_ixx_t+W_ihh_t-1+W_imm_t-1+b_i) (5)

f_t=σ (W_fxx_t+W_fhh_t-1+W_fmm_t-1+b_f) (6)

m_t=f_t°m_t-1+i_t°tanh(W_mxx_t+W_mhm_t-1+b_m) (7)

o_t=σ (W_oxx_t+W_ohh_t-1+W_omm_t+b_o) (8)

h_t=o_t tanh(m_t) (9)

Wherein, i_t, f_tAnd o_tIt is respectively input gate, forgets the value of door and out gate, m_tIt is memory cell activation, h_tIt is defeated Go out activation, σ () is logic S type functions, and o is element intelligence multiplication operator；These equations can be by setting s_t=[m_t, h_t] and y_t=h_tTo summarize.

Further, described use external clock and reset signal extension RNN structures, any broad sense RNNs can be changed To be incorporated to those RNNs of external timing signal, c_t, such as

s_t=(1-c_t)s_t-1+c_tf(x_t,s_t-1) (10)

y_t=g (s_t) (11)

Wherein, c_tIt is 0 or 1；RNN is only in c_tIts state and output are updated when=1；Otherwise, c is worked as_tWhen=0, state and output Value keeps identical with previous step；

By by s_t-10 is set to perform the replacement of RNN；Specifically, formula (10) is changed into

s_t=(1-c_t)(1-r_t)s_t-1+c_tf(x_t,(1-r_t)s_t-1) (12)

Wherein, reset signal r_t=0 or 1；Work as r_tWhen=1, RNN forgets previous context；

If original RNN equations are differentiable, the curved-edge polygons with clock and reset signal are also differentiable； It is therefore possible to use for the existing training algorithm based on gradient of RNNs, such as propagating (BPTT) by time reversal, come Training extended version, and without carrying out any modification.

Wherein, the described character level Language Modeling with classification RNN, layering RNN (HRNN) framework for being proposed has Some RNN modules with clock rates；Compared with higher level module use the clock rate slower than relatively low module, and compared with Each senior clock module resets lower level module.

Further, the RNN modules of described clock rates, if L level, then RNN is by L submodule group Into；Each submodule l external clock c_l,tWith reset signal r_l,tOperation, wherein, l=1 ..., L；Lowermost level module l=1 has There is most fast clock rate, i.e., for all t, there is c_l,t=1；And higher level module l ＞ 1 have slower clock rate, and c_l,tCan only in c_l-1,tIt is 1 when=1；And lower level module l<L is resetted by compared with high-level clock signal, i.e. r_l,t=c_l+1,t；

The hiding activation l of module<L is fed to next compared with higher level module l+1, postpones a time step, with avoid by r_l,t=c_l+1,t, the undesirable reset of t=1；The hiding activation vector, or embedded vector, the short-term context letter comprising compression Breath；Contribute to module to concentrate on by the module resets of higher level clock signal and only compress short term information；Next more senior mould Block l+1 processes this short term information and can generate long-term context vector, and it is fed back to relatively low level block l；It is this upper and lower Text is propagated without delay.

Further, described character level Language Modeling, using two grades of (L=2) HRNN, makes l=1 for character level module, l =2 is word level module；Word level module is input into timing in word boundary,<w>, typically space character；Input and softmax output layers Character level module is connected to, and current word boundary mark is (for example<w>Or<s>) information be assigned to word level module；Because HRNN has expansible architecture, it is possible to extends HRNN CLM by adding Sentence-level module l=3, is sentence Level context modeling；In this case, when input character for sentence boundary is marked<s>When, Sentence-level clock c_3,tIt is changed into 1；This Outward, word level module should be input into word boundary<w>It is input into sentence boundary<s>It is timed at both；Equally, Extended model with Module including other higher levels, such as paragraph level module or theme MBM.

Further, described two-stage HRNN CLM architectures, with two types, two model each submodules have Two LSTM layers；

In HLSTM-A frameworks, two LSTM layers in character level module all receives disposable code character input；Cause This, the second layer of character level module is by the generation model of context vector conditioning；

In HLSTM-B, character level module the 2nd LSTM layers without being directly connected with character input；Conversely, from first The word of LSTM layers of insertion is fed to the 2nd LSTM layers, and this causes that first and second layers of character level module work together, with Next character probabilities are estimated when providing context vector；

Test result indicate that, HLSTM-B is more effective for CLM applications；

Because character level module is marked (i.e., by word boundary<w>Or blank) reset, so the context from word level module Vector is the exclusive source of contextual information between word；Therefore, training pattern includes the probability distribution on next word to generate Useful information context vector；From this view point, the word level module in HRNN CLM frameworks is considered word Level RNN LM, wherein input is word insertion vector, output is the packed description symbol of next word probability.

Wherein, it is described to carry out speech recognition, phonetic entry is converted into spectrogram by Fourier transformation, using RNNs Network is oriented search decoding, finally produces recognition result.

Brief description of the drawings

Fig. 1 is a kind of system flow of the audio recognition method based on layered circulation neutral net language model of the present invention Figure.

Fig. 2 is that a kind of training of the audio recognition method based on layered circulation neutral net language model of the present invention is based on The CLM of RNN.

Fig. 3 is a kind of layering RNN of the audio recognition method based on layered circulation neutral net language model of the present invention.

Fig. 4 is a kind of two-stage of the CLM of the audio recognition method based on layered circulation neutral net language model of the present invention Layering LSTM (HLSTM) structure.

Specific embodiment

It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combine, the present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.

Fig. 1 is a kind of system flow of the audio recognition method based on layered circulation neutral net language model of the present invention Figure.It is main to include using the character level Language Modeling of RNN, RNN structures are extended with external clock and reset signal, with classification The character level Language Modeling of RNN and carry out speech recognition.

Wherein, described use external clock and reset signal extension RNN structures, most types of RNNs can be summarized as

s_t=f (x_t,s_t-1) (1)

y_t=g (s_t) (2)

s_t=h_t=σ (W_hxx_t+W_hhh_t-1+b_h) (3)

y_t=h_t (4)

i_t=σ (W_ixx_t+W_ihh_t-1+W_imm_t-1+b_i) (5)

f_t=σ (W_fxx_t+W_fhh_t-1+W_fmm_t-1+b_f) (6)

m_t=f_tоm_t-1+i_tоtanh(W_mxx_t+W_mhm_t-1+b_m) (7)

o_t=σ (W_oxx_t+W_ohh_t-1+W_omm_t+b_o) (8)

h_t=o_t tanh(m_t) (9)

s_t=(1-c_t)s_t-1+c_tf(x_t,s_t-1) (10)

y_t=g (s_t) (11)

s_t=(1-c_t)(1-r_t)s_t-1+c_tf(x_t,(1-r_t)s_t-1) (12)

Fig. 2 is that a kind of training of the audio recognition method based on layered circulation neutral net language model of the present invention is based on The CLM of RNN.For training RNN CLMs, training data should first be converted to one-hot coding character vector sequence x_t, wherein character Including word boundary symbol<w>, or space, and optional sentence boundary symbol<s>；Training RNN, by making the next word of expression The cross entropy minimization of loss of the softmax outputs of the probability distribution of symbol predicts character late x_t+1。

Fig. 3 is a kind of layering RNN of the audio recognition method based on layered circulation neutral net language model of the present invention.Point Layer RNN (HRNN) framework has some RNN modules with clock rates；Compared with higher level module using slower than relatively low module Clock rate, and higher each clock module reset lower level module.

The RNN modules of clock rates, if L level, then RNN is made up of L submodule；Each submodule l Use external clock c_l,tWith reset signal r_l,tOperation, wherein, l=1 ..., L；Lowermost level module l=1 has most fast when clock rate Rate, i.e., for all t, there is c_l,t=1；And higher level module l ＞ 1 have slower clock rate, and c_l,tCan only in c_l-1,t It is 1 when=1；And lower level module l<L is resetted by compared with high-level clock signal, i.e. r_l,t=c_l+1,t；

Character level Language Modeling, using two grades of (L=2) HRNN, makes l=1 for character level module, and l=2 is word level module； Word level module is input into timing in word boundary,<w>, typically space character；Input and softmax output layers are connected to character level mould Block, and current word boundary mark is (for example<w>Or<s>) information be assigned to word level module；Because HRNN has expansible Architecture, it is possible to extend HRNN CLM by adding Sentence-level module l=3, is statement level context modeling；At this In the case of kind, when input character for sentence boundary is marked<s>When, Sentence-level clock c_3,tIt is changed into 1；Additionally, word level module should be Word boundary is input into<w>It is input into sentence boundary<s>It is timed at both；Equally, Extended model is with including other higher levels Module, such as paragraph level module or theme MBM.

Fig. 4 is a kind of two-stage of the CLM of the audio recognition method based on layered circulation neutral net language model of the present invention Layering LSTM (HLSTM) structure.Two-stage HRNN CLM architectures have a two types, and two model each submodules have two LSTM layers；

Test result indicate that, HLSTM-B is more effective for CLM applications；

For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, without departing substantially from essence of the invention In the case of god and scope, the present invention can be realized with other concrete forms.Additionally, those skilled in the art can be to this hair Bright to carry out various changes and modification without departing from the spirit and scope of the present invention, these improvement also should be regarded as of the invention with modification Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and modification.

Claims

1. a kind of audio recognition method based on layered circulation neutral net language model, it is characterised in that main to include using The character level Language Modeling (one) of RNN；RNN structures (two) is extended with external clock and reset signal；Character with classification RNN Level Language Modeling (three)；Carry out speech recognition (four).

2. based on the language model based on layered circulation neutral net described in claims 1, it is characterised in that it is combined The advantageous feature of character level and word level language model；Recognition with Recurrent Neural Network (RNN) is made up of rudimentary RNNs and senior RNNs；It is rudimentary RNN is input into and is exported using character level, and provides short-term embedded to the senior RNN operated as word level RNN；Senior RNN is not Complicated input and output is needed, because it receives characteristic information from low level network, and is in a compressed format believed Character prediction Breath sends back rudimentary；Therefore, when considering input and exporting, the network for being proposed is a character level language model (CLM), but It includes a word level model；Lower-level modules use character input clock, and higher level module using separate word space (<w>) fortune OK；The hierarchical language model can be expanded, to process the information of longer period, such as sentence, theme or other contexts；Point Layer language model can carry out end-to-end training with text based character.

3. the character level Language Modeling () of the use RNN being based on described in claims 1, it is characterised in that for training RNN CLMs, training data should first be converted to one-hot coding character vector sequence x_t, wherein character include word boundary symbol<w>, or Space, and optional sentence boundary symbol<s>；Training RNN, by making the probability distribution of expression character late The cross entropy minimization of loss of softmax outputs predicts character late x_t+1。

4. RNN structures (two) is extended based on the use external clock and reset signal described in claims 1, it is characterised in that big The RNNs of most types can be summarized as

s_t=f (x_t,s_t-1) (1)

y_t=g (s_t) (2)

Wherein, x_tIt is input, s_tIt is state, y_tIt is the output of time step t, f () is recursive function, and g () is output letter Number；For example, Elman networks can be expressed as

s_t=h_t=σ (W_hxx_t+W_hhh_t-1+b_h) (3)

y_t=h_t (4)

i_t=σ (W_ixx_t+W_ihh_t-1+W_imm_t-1+b_i) (5)

f_t=σ (W_fxx_t+W_fhh_t-1+W_fmm_t-1+b_f) (6)

o_t=σ (W_oxx_t+W_ohh_t-1+W_omm_t+b_o) (8)

h_t=o_t tanh(m_t) (9)

Wherein, i_t, f_tAnd o_tIt is respectively input gate, forgets the value of door and out gate, m_tIt is memory cell activation, h_tIt is that output swashs Living, σ () is logic S type functions,It is element intelligence multiplication operator；These equations can be by setting s_t=[m_t,h_t] and y_t=h_tTo summarize.

5. based on use external clock and reset signal the extension RNN structures described in claims 4, it is characterised in that Ren Heguang Adopted RNNs can be converted into those RNNs for being incorporated to external timing signal, c_t, such as

s_t=(1-c_t)s_t-1+c_tf(x_t,s_t-1) (10)

y_t=g (s_t) (11)

Wherein, c_tIt is 0 or 1；RNN is only in c_tIts state and output are updated when=1；Otherwise, c is worked as_tWhen=0, state and output valve are protected Hold identical with previous step；

s_t=(1-c_t)(1-r_t)s_t-1+c_tf(x_t,(1-r_t)s_t-1) (12)

If original RNN equations are differentiable, the curved-edge polygons with clock and reset signal are also differentiable；Cause This, such as can propagate (BPTT) to instruct using the existing training algorithm based on gradient for RNNs by time reversal Practice extended version, and without carrying out any modification.

6. based on the character level Language Modeling (three) with classification RNN described in claims 1, it is characterised in that proposed Layering RNN (HRNN) framework there are some RNN modules with clock rates；Used than relatively low module compared with higher level module Slower clock rate, and reset lower level module in higher each clock module.

7. RNN modules based on the clock rates described in claims 6, it is characterised in that if L level, then RNN by L submodule composition；Each submodule l external clock c_l,tWith reset signal r_l,tOperation, wherein, l=1 ..., L；Lowermost level mould Block l=1 has most fast clock rate, i.e., for all t, there is c_l,t=1；And higher level module l ＞ 1 have slower clock rate, And c_l,tCan only in c_l-1,tIt is 1 when=1；And lower level module l<L is resetted by compared with high-level clock signal, i.e. r_l,t=c_l+1,t；

The hiding activation l of module<L is fed to next compared with higher level module l+1, one time step of delay, to avoid by r_l,t= c_l+1,t, the undesirable reset of t=1；The hiding activation vector, or embedded vector, the short-term contextual information comprising compression；By The module resets of higher level clock signal contribute to module to concentrate on only to compress short term information；It is next compared with higher level module l+1 Processing this short term information can generate long-term context vector, and it is fed back to relatively low level block l；This context is propagated Without delay.

8. based on the character level Language Modeling described in claims 6, it is characterised in that use two grades of (L=2) HRNN, make l= 1 is character level module, and l=2 is word level module；Word level module is input into timing in word boundary,<w>, typically space character；Input Character level module is connected to softmax output layers, and current word boundary mark is (for example<w>Or<s>) information be assigned to Word level module；Because HRNN has expansible architecture, it is possible to extended by adding Sentence-level module l=3 HRNN CLM, are statement level context modeling；In this case, when input character for sentence boundary is marked<s>When, Sentence-level Clock c_3,tIt is changed into 1；Additionally, word level module should be input into word boundary<w>It is input into sentence boundary<s>It is timed at both；Together Sample, Extended model is with the module including other higher levels, such as paragraph level module or theme MBM.

9. based on the two-stage HRNN CLM architectures described in claims 8, it is characterised in that two-stage HRNN CLM system knots Structure has a two types, and two model each submodules have two LSTM layers；

In HLSTM-A frameworks, two LSTM layers in character level module all receives disposable code character input；Therefore, word The second layer for according with level module is by the generation model of context vector conditioning；

In HLSTM-B, character level module the 2nd LSTM layers without being directly connected with character input；Conversely, from a LSTM The embedded word of layer is fed to the 2nd LSTM layers, and this causes that first and second layers of character level module work together, to be given Next character probabilities are estimated during context vector；

Test result indicate that, HLSTM-B is more effective for CLM applications；

Because character level module is marked (i.e., by word boundary<w>Or blank) reset, so the context vector from word level module It is the exclusive source of contextual information between word；Therefore, training pattern includes having for the probability distribution on next word to generate With the context vector of information；From this view point, the word level module in HRNN CLM frameworks is considered word level RNNLM, wherein input is word insertion vector, output is the packed description symbol of next word probability.

10. it is based on carrying out speech recognition (four) described in claims 1, it is characterised in that phonetic entry is passed through into Fourier Shift conversion is spectrogram, and search decoding is oriented using RNNs networks, finally produces recognition result.