CN104700828A

CN104700828A - Deep long-term and short-term memory recurrent neural network acoustic model establishing method based on selective attention principles

Info

Publication number: CN104700828A
Application number: CN201510122982.6A
Authority: CN
Inventors: 杨毅; 孙甲松
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2015-03-19
Filing date: 2015-03-19
Publication date: 2015-06-10
Anticipated expiration: 2035-03-19
Also published as: WO2016145850A1; CN104700828B

Abstract

Disclosed is a deep long-term and short-term memory recurrent neural network acoustic model establishing method based on selective attention principles. According to the deep long-term and short-term memory recurrent neural network acoustic model establishing method based on the selective attention principles, attention gate units are added inside a deep long-term and short-term memory recurrent neural network acoustic model to represent instantaneous function change of auditory cortex neurons; the gate units are different in other gate units in that the other gate units are in one-to-one correspondence with time series, while the attention gate units represent short-term plasticity effects and accordingly have intervals in the time series; through the neural network acoustic model obtained by training mass voice data containing Cross-talk noise, robustness feature extraction of the Cross-talk noise and establishment of robust acoustic models can be achieved; the aim of improving the robustness of the acoustic models can be achieve by inhibiting influence of non-target flow on feature extraction. The deep long-term and short-term memory recurrent neural network acoustic model establishing method based on the selective attention principles can be widely applied to multiple voice recognition-related machine learning fields of speaker recognition, keyword recognition, man-machine interaction and the like.

Description

Based on the construction method of the degree of depth shot and long term memory Recognition with Recurrent Neural Network acoustic model of selective attention principle

Technical field

The invention belongs to Audiotechnica field, particularly a kind of construction method of the memory of the degree of depth shot and long term based on selective attention principle Recognition with Recurrent Neural Network acoustic model.

Background technology

Along with developing rapidly of infotech, speech recognition technology has possessed the condition of large-scale commercial.Current speech recognition mainly adopts the continuous speech recognition technology of Corpus--based Method model, and its main target is found the word sequence of the maximum probability representated by it.The task of the Continuous Speech Recognition System of Corpus--based Method model finds the word sequence of the maximum probability representated by it, generally includes the search coding/decoding method building acoustic model and language model and correspondence thereof.Along with the fast development of acoustic model and language model, the performance of speech recognition system is greatly improved under desirable acoustic enviroment, existing deep neural network-Hidden Markov Model (HMM) (Deep Neural Network-HiddenMarkov Model, DNN-HMM) tentatively ripe, automatically validity feature can be extracted by the method for machine learning, and contextual information modeling that can be corresponding to multiframe voice, but this each layer of class model has the parameter of 1,000,000 magnitudes, and the input of lower one deck is last output, therefore need to use GPU equipment to train DNN acoustic model, training time is long, the characteristic of nonlinearity and parameter sharing also makes DNN be difficult to carry out parameter adaptive.

There is oriented cycles to express the neural network of network internal dynamic time characteristic in Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN), is used widely in handwriting recongnition and language model etc. between a kind of unit.Voice signal is complicated time varying signal, and Different time scales has complicated correlativity, and therefore compared to deep neural network, the circulation linkage function that Recognition with Recurrent Neural Network has is more suitable for processing this kind of complex time sequence data.

As the one of Recognition with Recurrent Neural Network, shot and long term memory (Long Short-Term Memory, LSTM) model is more suitable for process and the delayed and long time series that the time is indefinite of predicted events than Recognition with Recurrent Neural Network.The multi-level sign ability of deep neural network is then combined with the contextual ability of Recognition with Recurrent Neural Network flexible utilization long span by the degree of depth LSTM-RNN acoustic model adding memory module (memory block) that University of Toronto proposes, and makes the phoneme recognition error rate based on TIMIT storehouse be down to 17.1%.

But there is gradient dispersion (vanishinggradient) problem in the gradient descent method used in Recognition with Recurrent Neural Network, namely in the process that the weight of network is adjusted, along with the network number of plies increases, gradient layer-by-layer dissipates, and causes it more and more less to the effect of weight adjusting.The two layer depth LSTM-RNN acoustic models that Google proposes, add Linear Circulation projection layer (Recurrent Projection Layer), for solving gradient dispersion problem in former degree of depth LSTM-RNN model.Contrast experiment shows, the frame accuracy (Frame Accuracy) of RNN and speed of convergence thereof are obviously inferior to LSTM-RNN and DNN; In Word Error Rate and speed of convergence thereof, the Word Error Rate of best DNN after training several weeks is 11.3%; And two layer depth LSTM-RNN models training after 48 hours Word Error Rate be reduced to 10.9%, train after 100/200 hour, Word Error Rate is reduced to 10.7/10.5 (%).

The degree of depth that University of Munich proposes two-way shot and long term memory Recognition with Recurrent Neural Network (DeepBidirectional Long Short-Term Memory Recurrent Neural Networks, DBLSTM-RNN) acoustic model, separate forward direction layer and rear to layer is defined in each circulation layer of neural network, and use the acoustic feature of many hidden layers to input to carry out more high-rise sign, the projection of supervised learning realization character and enhancing are carried out to noise and reverberation simultaneously.The method, on 2013PASCALCHiME data set, achieves Word Error Rate and is reduced to 22% from 55% of baseline in signal to noise ratio (S/N ratio) [-6dB, 9dB] scope.

But the complicacy of practical acoustic environment still has a strong impact on and disturbs the performance of Continuous Speech Recognition System, even if utilize the DNN acoustic model method of current main flow, comprise noise, music, spoken language, to repeat etc. under complicated environmental condition continuous speech recognition data set on also can only obtain about 70% discrimination, in Continuous Speech Recognition System, the noise immunity of acoustic model and robustness still have much room for improvement.

Along with the fast development of acoustic model and language model, the performance of speech recognition system is greatly improved under desirable acoustic enviroment, existing DNN-HMM model is tentatively ripe, automatically validity feature can be extracted by the method for machine learning, and contextual information modeling that can be corresponding to multiframe voice.But most of recognition system is still very responsive for the change of acoustic enviroment, the requirement of Practical Performance particularly can not be met under cross-talk noise (two people or many people speak simultaneously) interference.Compared with deep neural network acoustic model, between the unit in Recognition with Recurrent Neural Network acoustic model, there is oriented cycles, effectively can describe the dynamic time characteristic of neural network inside, be more suitable for processing the speech data with complex time sequence.And shot and long term Memory Neural Networks is more suitable for process and the delayed and long time series that the time is indefinite of predicted events than Recognition with Recurrent Neural Network, the acoustic model therefore for building speech recognition can obtain better effect.

The phenomenon of selective attention is there is in human brain when processing the voice of complex scene, its cardinal principle is: human brain has the ability of sense of hearing selective attention, in auditory cortex region by top-down controlling mechanism, realize the object suppressing non-targeted stream and strengthen target stream.Research shows, in the process of selective attention, short-term plasticity (Short-Term Plasticity) effect of auditory cortex adds the separating capacity to sound.When notice is concentrated very much, can start to carry out enhancing process to sound objects in 50 milliseconds at primary auditory cortex.

Summary of the invention

In order to overcome the shortcoming of above-mentioned prior art, a kind of degree of depth shot and long term based on selective attention principle is the object of the present invention is to provide to remember the construction method of Recognition with Recurrent Neural Network acoustic model, establish the degree of depth shot and long term memory Recognition with Recurrent Neural Network acoustic model based on selective attention principle, gate cell is noted by increasing in degree of depth shot and long term memory Recognition with Recurrent Neural Network acoustic model, characterize the neuronic instantaneous function of auditory cortex to change, notice that gate cell and other gate cell differences are, other gate cells and time series one_to_one corresponding, and what notice that gate cell embodies is short-term plasticity effect, therefore in time series, there is interval.By carrying out a large amount of speech datas comprising Cross-talk noise training the above-mentioned neural network acoustic model obtained, can realize extracting the robust features of Cross-talk noise and the structure of robust acoustic model, by the object suppressing non-targeted stream can to reach the robustness improving acoustic model to the impact of feature extraction.

To achieve these goals, the technical solution used in the present invention is:

Based on a continuous speech recognition method for selective attention principle, comprise the steps:

The first step, builds the degree of depth shot and long term memory Recognition with Recurrent Neural Network based on selective attention principle

A shot and long term memory Recognition with Recurrent Neural Network is defined as from being input to hidden layer, the output that the degree of depth refers to each shot and long term memory Recognition with Recurrent Neural Network is the input that next shot and long term remembers Recognition with Recurrent Neural Network, repetition like this, last shot and long term remembers the output of output as whole system of Recognition with Recurrent Neural Network; In each shot and long term memory Recognition with Recurrent Neural Network, voice signal x _tfor the input of t, x _t-1for the input in t-1 moment, by that analogy, x=[x is input as in T.T. length ₁..., x _t] wherein t ∈ [1, T], T are the T.T. length of voice signal; The shot and long term memory Recognition with Recurrent Neural Network of t by noting door, input gate, out gate, forget door, memory cell, tanh function, hidden layer, multiplier form, the shot and long term memory Recognition with Recurrent Neural Network in t-1 moment by input gate, out gate, forget door, memory cell, tanh function, hidden layer, multiplier form; Hidden layer in T.T. length exports as y=[y ₁..., y _t];

Second step, builds the degree of depth shot and long term memory Recognition with Recurrent Neural Network acoustic model based on selective attention principle

On the basis of the first step, exist at interval of the degree of depth shot and long term memory Recognition with Recurrent Neural Network that the s moment is corresponding and note door, there is not attention door in the degree of depth shot and long term memory Recognition with Recurrent Neural Network in other moment, that is, the degree of depth shot and long term memory Recognition with Recurrent Neural Network that there is attention door based on the degree of depth shot and long term memory Recognition with Recurrent Neural Network acoustic model of selective attention principle by interval forms.

How disturbing at complex environment, particularly identify under cross-talk noise, be one of difficult point of speech recognition always, hinders the large-scale application of speech recognition.Compared with prior art, the present invention uses for reference human brain exists selective attention phenomenon when processing the voice of complex scene and realizes suppressing non-targeted stream and strengthening target stream, gate cell is noted by increasing in degree of depth shot and long term memory recurrent neural network acoustic model, characterize the neuronic instantaneous function of auditory cortex to change, notice that gate cell and other gate cell differences are, other gate cells and time series one_to_one corresponding, and what notice that gate cell embodies is short-term plasticity effect, therefore there is interval in time series.The continuous speech recognition data set that some comprise Cross-talk noise is adopted in this way, performance more better than deep neural network method can be obtained.

Accompanying drawing explanation

Fig. 1 is the degree of depth shot and long term based on selective attention principle of the present invention memory Recognition with Recurrent Neural Network process flow diagram.

Fig. 2 is the degree of depth shot and long term Memory Neural Networks acoustic model process flow diagram based on selective attention principle of the present invention.

Embodiment

Embodiments of the present invention are described in detail below in conjunction with drawings and Examples.

The present invention utilizes the degree of depth shot and long term memory Recognition with Recurrent Neural Network acoustic model based on selective attention principle, achieves continuous speech recognition.But models and methods provided by the invention is not limited to continuous speech recognition, also can be any method and apparatus relevant with speech recognition.

The present invention mainly comprises the steps:

As shown in Figure 1, inputting 101 with input 102 is t and t-1 moment voice signal input x _tand x _t-1(t ∈ [1, T], T are the T.T. length of voice signal); The shot and long term memory Recognition with Recurrent Neural Network of t by noting door 103, input gate 104, forget door 105, memory cell 106, out gate 107, tanh function 108, tanh function 109, hidden layer 110, multiplier 122 and multiplier 123 form; The shot and long term memory Recognition with Recurrent Neural Network in t-1 moment by input gate 112, forget door 113, memory cell 114, out gate 115, tanh function 116, tanh function 117, hidden layer 118, multiplier 120 and multiplier 121 and form.T and t-1 moment hidden layer export and are respectively output 111 and export 119.

Wherein, input 102 is simultaneously as input gate 112, the input forgeing door 113, out gate 115 and tanh function 116, multiplier 120 is sent in the output of input gate 112 and the output of tanh function 116, output after computing is as the input of memory cell 114, the output of memory cell 114 is as the input of tanh function 117, multiplier 121 is sent in the output of tanh function 117 and the output of out gate 115, output after computing is as the input of hidden layer 118, and the output of hidden layer 118 is output 119.

Input 101, the output of memory cell 114 and the output of multiplier 121 are jointly as the input noting door 103, the output of attention door 103 and the output of multiplier 121 are jointly as the input of tanh function 108, note the output of door 103, the output of memory cell 114 and the output of multiplier 121 are common as input gate 104 respectively, forget the input of door 105 and out gate 107, multiplier 124 is sent in the output of the output and memory cell 114 of forgeing door 105, multiplier 122 is sent in the output of input gate 104 and the output of tanh function 108, the output of multiplier 124 and the output of multiplier 122 are as the input of memory cell 106, the output of memory cell 106 is as the input of tanh function 109, multiplier 123 is sent in the output of tanh function 109 and the output of out gate 107, the output of multiplier 123 is as the input of hidden layer 110, the output of hidden layer 110 is output 111.

That is: in the parameter in t ∈ [1, T] moment according to following formulae discovery:

G _{atten_t}＝sigmoid(W _axx _t+W _amm _t-1+W _acCell _t-1+b _a)

G _{input_t}＝sigmoid(W _iaG _{atten_t}+W _imm _t-1+W _icCell _t-1+b _i)

G _{forget_t}＝sigmoid(W _faG _{atten_t}+W _fmm _t-1+W _fcCell _t-1+b _f)

Cell _t＝G _{forget_t}⊙Cell _t-1+G _{input_t}⊙tanh(W _caG _{atten_t}+W _cmm _t-1+b _c)

G _{output_t}＝sigmoid(W _oaG _{atten_t}+W _omm _t-1+W _ocCell _t-1+b _o)

m _t＝G _{output_t}⊙tanh(Cell _t)

y _t＝softmax _k(W _ymm _t+b _y)

Wherein G _{atten_t}for t notes the output of door 103, G _{input_t}for the output of t input gate 104, G _{forget_t}for t forgets the output of door 105, Cell _tfor the output of t memory cell 106, G _{output_t}for the output of t out gate 107, m _tfor the input of t hidden layer 110, y _tfor the output 111 of t; x _tfor the input 101, m of t _t-1for the input of t-1 moment hidden layer 118, Cell _t-1for the output of t-1 moment memory cell 114; W _axfor t notices that an a and t input the weight between x, W _amfor t notes the weight between an a and t-1 moment hidden layer input m, W _acfor t notes the weight between an a and t-1 moment memory cell c, W _iafor t input gate i and t note the weight between an a, W _imfor t input gate i and t-1 moment hidden layer input the weight between m, W _icfor the weight between t input gate i and t-1 moment memory cell c, W _fafor t forgets the weight that a f and t note between an a, W _fmfor t forgets the weight between a f and t-1 moment hidden layer input m, W _fcfor t forgets the weight between a f and t-1 moment memory cell c, W _cafor t memory cell c and t note the weight between an a, W _cmfor t memory cell c and t-1 moment hidden layer input the weight between m, W _oafor t out gate o and t note the weight between an a, W _omfor t out gate o and t-1 moment hidden layer input the weight between m, W _ocfor the weight between t out gate o and t-1 moment memory cell c; b _afor noting the departure of door a, b _ifor the departure of input gate i, b _ffor forgeing the departure of a f, b _cfor the departure of memory cell c, b _ofor the departure of out gate o, b _yfor exporting the departure of y, different b represents different departures; And have wherein x _krepresent the input of kth ∈ [1, K] individual softmax function, l ∈ [1, K] is for all summation; ⊙ represents matrix element and is multiplied.

On the basis of the first step, exist at interval of the degree of depth shot and long term memory Recognition with Recurrent Neural Network that s (s=5) moment is corresponding and note door, there is not attention door in the degree of depth shot and long term memory Recognition with Recurrent Neural Network in other moment, that is, the degree of depth shot and long term memory Recognition with Recurrent Neural Network that there is attention door based on the degree of depth shot and long term memory Recognition with Recurrent Neural Network acoustic model of selective attention principle by interval forms.Be illustrated in figure 2 the set up memory of the degree of depth shot and long term based on selective attention principle Recognition with Recurrent Neural Network acoustic model, the degree of depth shot and long term memory Recognition with Recurrent Neural Network of t exists notes door 201, the degree of depth shot and long term memory Recognition with Recurrent Neural Network in t-s moment exists notes door 202, so circulates.

Claims

1., based on a construction method for the degree of depth shot and long term memory Recognition with Recurrent Neural Network acoustic model of selective attention principle, comprise the steps:

In the parameter in t ∈ [1, T] moment according to following formulae discovery:

G _{atten_t}＝sigmoid(W _axx _t+W _amm _t-1+W _acCell _t-1+b _a)

G _{input_t}＝sigmoid(W _iaG _{atten_t}+W _imm _t-1+W _icCell _t-1+b _i)

G _{forget_t}＝sigmoid(W _faG _{atten_t}+W _fmm _t-1+W _fcCell _t-1+b _f)

G _{output_t}＝sigmoid(W _oaG _{atten_t}+W _omm _t-1+W _ocCell _t-1+b _o)

m _t＝G _{output_t}⊙tanh(Cell _t)

y _t＝soft max _k(W _ymm _t+b _y)

Wherein G _{atten_t}for t notes the output of door, G _{input_t}for the output of t input gate, G _{forget_t}for t forgets the output of door, Cell _tfor the output of t memory cell, G _{output_t}for the output of t out gate, m _tfor the input of t hidden layer, y _tfor the output of t; x _tfor the input of t, m _t-1for the input of t-1 moment hidden layer, Cell _t-1for the output of t-1 moment memory cell; W _axfor t notices that an a and t input the weight between x, W _amfor t notes the weight between an a and t-1 moment hidden layer input m, W _acfor t notes the weight between an a and t-1 moment memory cell c, W _iafor t input gate i and t note the weight between an a, W _imfor t input gate i and t-1 moment hidden layer input the weight between m, W _icfor the weight between t input gate i and t-1 moment memory cell c, W _fafor t forgets the weight that a f and t note between an a, W _fmfor t forgets the weight between a f and t-1 moment hidden layer input m, W _fcfor t forgets the weight between a f and t-1 moment memory cell c, W _cafor t memory cell c and t note the weight between an a, W _cmfor t memory cell c and t-1 moment hidden layer input the weight between m, W _oafor t out gate o and t note the weight between an a, W _omfor t out gate o and t-1 moment hidden layer input the weight between m, W _ocfor the weight between t out gate o and t-1 moment memory cell c; b _afor noting the departure of door a, b _ifor the departure of input gate i, b _ffor forgeing the departure of a f, b _cfor the departure of memory cell c, b _ofor the departure of out gate o, b _yfor exporting the departure of y, different b represents different departures; And have

sigmoid (x) = \frac{1}{1 + e^{- x}}, \tanh (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}, soft \max_{k} (x) = \frac{e^{x_{k}}}{Σ_{l = 1}^{K} e^{x_{l}}},

Wherein x _krepresent the input of kth ∈ [1, K] individual sof tmax function, l ∈ [1, K] is for all summation; ⊙ represents matrix element and is multiplied;

2., according to claim 1 based on the construction method of the degree of depth shot and long term memory Recognition with Recurrent Neural Network acoustic model of selective attention principle, it is characterized in that, described s=5.