WO2016101688A1

WO2016101688A1 - Continuous voice recognition method based on deep long-and-short-term memory recurrent neural network

Info

Publication number: WO2016101688A1
Application number: PCT/CN2015/092380
Authority: WO
Inventors: 杨毅; 孙甲松
Original assignee: 清华大学
Priority date: 2014-12-25
Filing date: 2015-10-21
Publication date: 2016-06-30
Also published as: CN104538028A; CN104538028B

Abstract

A continuous voice recognition method based on a deep long-and-short-term memory recurrent neural network, comprising: utilizing a noisy voice signal (302) and an original pure voice signal (301) as training samples; constructing two deep long-and-short-term memory recurrent neural network modules (303, 305) having the same structure; conducting a cross entropy calculation between each deep long-and-short-term memory layer (102) of the two modules (303, 305) to obtain the difference therebetween; updating a cross entropy parameter via a linear circulation projection layer (108); and finally acquiring a deep long-and-short-term memory recurrent neural network acoustic model robust to environmental noise. The method constructs a deep long-and-short-term memory recurrent neural network acoustic model, thus increasing a recognition efficiency of a continuous noisy voice signal, addressing the problem that the majority of calculations need to be completed on a CPU device as a result of the large scale of deep neural network (DNN) parameters, having a low calculation complexity and a fast convergence rate, and being widely applicable to a plurality of machine learning fields such as speaker recognition, key word recognition and human-machine interaction related to voice recognition.

Description

Continuous speech recognition method based on deep long-term and short-term memory cycle neural network

Technical field

The invention belongs to the field of audio technology, and in particular relates to a continuous speech recognition method based on a deep long-term and short-term memory cycle neural network.

Background technique

With the rapid development of information technology, speech recognition technology has been subject to large-scale commercialization. At present, speech recognition mainly adopts continuous speech recognition technology based on statistical model, and its main goal is to find the most probable word sequence represented by a given speech sequence. Continuous speech recognition systems usually include acoustic models, language models and decoding methods. Acoustic modeling methods, as the core technology of continuous speech recognition, have developed rapidly in recent years. The commonly used acoustic model is the Gaussian Mixture Model-Hidden Markov Model (GMM-HMM). The principle is: training the mixed Gaussian model to obtain the probability that each frame feature belongs to each phoneme state. The Markov model obtains the transition probabilities between the phoneme states and themselves, and accordingly obtains the probability that each phoneme state sequence produces the current speech feature vector sequence. Considering the phenomenon of Coarticulation, the phonemes are further divided into different modeling units according to different contexts (Context Dependent), which is called CD-GMM-HMM method.

In 2011, Microsoft proposed to replace the Gaussian model in the traditional acoustic model with Deep Neural Network (DNN), which constitutes a new CD-DNN-HMM model, and the expression ability of the DNN model and the order of the CD-HMM model. The combination of modeling capabilities is based on multi-layer transformation of acoustic features and optimization of feature extraction and acoustic modeling on the same network. Compared with the traditional GMM-HMM model framework, the error rate of the DNN-HMM model on the English continuous speech recognition library is reduced by about 30%. However, each layer of the DNN has a million-level parameter, and the input of the next layer is the last output, so the general calculation cost is large, and the effect is not good when the speech speed is different and the long-term sequence needs to be processed. good.

Recurrent Neural Network (RNN) is a kind of neural network with a directed loop to express the dynamic time characteristics of the network. It is widely used in handwriting recognition and language modeling. Speech signals are complex time-varying signals with complex correlations on different time scales. Therefore, compared with deep neural networks, cyclic neural networks have a loop-connecting function that is more suitable for processing such complex time series data. As a kind of cyclic neural network, the Long Short-Term Memory (LSTM) model is more suitable than the cyclic neural network to process and predict long-term sequences with delayed events and uncertain time. The deep LSTM-RNN acoustic model proposed by the University of Toronto with the addition of a memory block combines the multi-level representation capabilities of deep neural networks with the ability of cyclic neural networks to flexibly utilize long-span contexts, resulting in errors in phoneme recognition based on TIMIT libraries. The rate dropped to 17.1%.

However, the gradient descent method used in the cyclic neural network has the problem of vanishing gradient, that is, in the process of adjusting the weight of the network, as the number of network layers increases, the gradient dissipates layer by layer, causing the weight to be adjusted. The effect is getting smaller and smaller. Google's proposed two-layer depth LSTM-RNN acoustic model adds a linear Recurrent Projection Layer to the previous depth LSTM-RNN model to solve the gradient dissipation problem. Contrastive experiments show that the frame accuracy rate (Frame Accuracy) and its convergence speed of RNN are obviously lower than LSTM-RNN and DNN; in terms of word error rate and its convergence speed, the word error rate of the best DNN after training for several weeks 11.3%; while the two-layer depth LSTM-RNN model reduced the word error rate to 10.9% after 48 hours of training, after 100/200 hours of training, the word error rate decreased to 10.7/10.5 (%).

However, the complexity of the actual acoustic environment still seriously affects and interferes with the performance of continuous speech recognition systems, even with the best deep neural network methods available, on continuous speech recognition data sets under complex conditions including noise, music, spoken language, and repetition. Only 70% recognition rate can be obtained, and the noise immunity and robustness of the acoustic model in the continuous speech recognition system need to be improved. In addition, the parameters of the deep neural network method are large, and most of the calculation work needs to be done on the GPU device, and the ordinary CPU is difficult to perform. Therefore, such methods have a certain distance from the requirements of large-scale commercialization.

Summary of the invention

In order to overcome the above disadvantages of the prior art, it is an object of the present invention to provide a depth-based The continuous speech recognition method of short-term memory loop neural network improves the speech recognition rate of noisy continuous speech signals, and has the characteristics of low computational complexity and fast convergence, which is suitable for implementation on ordinary CPU.

In order to achieve the above object, the technical solution adopted by the present invention is:

A continuous speech recognition method based on deep long-term and short-term memory cyclic neural network, comprising:

Step one, establishing two deep long-term and short-term memory cyclic neural network modules having identical structures including a plurality of long-term and short-term memory layers and a linear cyclic projection layer;

Step two, respectively sending the original pure speech signal and the noisy signal as input to the two modules of step one;

Step 3: Calculate the cross-entropy of all the parameters of the corresponding long-short-term memory layer in the two modules to measure the information distribution difference between the two modules, and implement the cross-entropy parameter update through the linear cyclic projection layer 2;

In step four, continuous speech recognition is achieved by comparing the final update result with the final output of the deep long-term memory loop neural network module with the original pure speech signal as input.

In the deep long-term and short-term memory cycle neural network module, the speech signal x=[x ₁ ,...,x _T ] is input to the entire module, and also serves as the input of the first long-term and short-term memory layer, the first long The output of the short-term memory layer is the input of the first linear loop projection layer. The output of the first linear loop projection layer is the input of the next linear loop projection layer, and the output of the next linear loop projection layer is used as the next linear loop. The input of the projection layer, and so on, wherein the output of the last linear loop projection layer is the output of the entire deep long-term memory loop neural network module in the deep long-term memory loop neural network module with the original pure speech signal as input. [y ₁ ,...,y _T ], T is the length of time of the speech signal, and in the deep long-term memory loop neural network module with the noisy signal as input, the output of the last linear loop projection layer is discarded.

The long-term and short-term memory layer is composed of a memory cell, an input gate, an output gate, an forgetting gate, a tanh function, and a multiplier, wherein the long-term and short-term memory layer, that is, the long-term and short-term memory neural network sub-module, is long at t∈[1,T] The parameters in the short-term memory neural network sub-module are calculated as follows:

G _input = sigmoid(W _ix x+W _ic Cell'+b _i )

G _forget =sigmoid(W _fx x+W _fc Cell'+b _f )

Cell=m'+G _forget ⊙Cell'+G _input ⊙tanh(W _cx x)⊙m'+b _c

G _output = sigmoid(W _ox x+W _oc Cell'+b _o )

m=tanh(G _output ⊙Cell⊙m')

y=soft max _k (W _ym m+b _y )

Where G _input is the output of the input gate, G _forget is the output of the forgetting gate, Cell is the output of the memory cell, Cell' is the output of the memory cell at time t-1, G _output is the output of the output gate, and G' _output is t- The output of the output gate is 1 time, m is the output of the linear cyclic projection layer, m' is the output of the linear cyclic projection layer at time t-1; x is the input of the whole long-term and short-term memory cycle neural network module, and y is a long-term and short-term memory cycle The output of the neural network sub-module; b _i is the deviation of the input gate i, b _f is the deviation of the forgetting gate f, b _c is the deviation of the memory cell c, b _o is the deviation of the output gate o, b _y is The amount of deviation of the output y, different b represents a different amount of deviation; W _ix is the weight between the input gate i and the input x, W _ic is the weight between the input gate i and the memory cell c, and W _fx is the forgetting gate f The weight between the input x and W _fc is the weight between the forgetting gate f and the memory cell c, W _oc is the weight between the output gate o and the memory cell c, and W _ym is the weight between the output y and the output m And have

Where x _k represents the input of the _kth ∈[1,K] soft max functions, l∈[1,K] is used for all

Summation; ⊙ represents the multiplication of matrix elements.

In the two deep long-term and short-term memory loop neural network modules, the output of a long-term and short-term memory neural network sub-module at the same level is taken as two inputs of an update sub-module, and one update sub-module is composed of cross-entropy and linear loop. The projection layer is composed of two, and the plurality of update submodules are connected in series to form an update module, and the output of one update submodule is input of the next update submodule, and the output of the last submodule is output of the entire update module.

The cross entropy in the update submodule is calculated according to the following formula:

d(x ₁ ,x ₂ )=∫x ₁ lnx ₂ dt-∫x ₂ lnx ₁ dt

Where d is the cross entropy, x ₁ and x ₂ represent the two inputs of the update sub-module, respectively, ie the long-short-term memory neural network sub-module in the long-short-term memory neural network module with the original pure speech signal and the noisy signal as input. Output;

The output of the linear loop projection layer 2 is calculated as follows:

y'=soft max _k (W _y' d+b _y' )

Where y' is the output vector of the entire update module, W _y represents the weight of the parameter update output to the output of the linear loop projection layer, d represents the cross entropy, and b _{y '} represents the amount of deviation.

The existing deep neural network acoustic model has good performance in a quiet environment, but fails in the case where the environmental noise is large and the signal to noise ratio is drastically reduced. Compared with the deep neural network acoustic model, there is a directed cycle between the elements in the acoustic neural network acoustic model of the present invention, which can effectively describe the dynamic time characteristics inside the neural network, and is more suitable for processing voice data with complex time series. Long- and short-term memory neural networks are more suitable than cyclic neural networks to process and predict long-term sequences with delayed events and uncertain time. Therefore, acoustic models used to construct speech recognition can achieve better results. Furthermore, in the deep and long-term memory cycle neural network acoustic model structure, it is necessary to reduce the influence of noise characteristics on neural network parameters and improve the noise immunity and robustness of speech recognition system under environmental noise interference.

DRAWINGS

1 is a flow chart of a deep long-term and short-term memory neural network model of the present invention.

2 is a flow chart of the deep long-term and short-term memory cycle neural network update module of the present invention.

3 is a flow chart of the acoustic model of the robust deep long-term memory neural network of the present invention.

detailed description

Embodiments of the present invention will be described in detail below with reference to the drawings and embodiments.

The present invention proposes a method and apparatus for robust deep long-term memory neural network acoustic models, in particular, for continuous speech recognition scenarios. These methods and apparatus are not limited to continuous speech recognition, but can be any method and apparatus related to speech recognition.

Step 1. Establish two deep long-term and short-term memory cyclic neural network modules including two long-short-term memory layers and a linear cyclic projection layer, respectively, and send the original pure speech signal and the noisy signal as input to the two of step one respectively. Modules.

1 is a flow chart of a deep long-term and short-term memory cycle neural network module according to the present invention, including the following contents:

Input 101 is a speech signal x=[x ₁ ,...,x _T ] (T is the length of time of the speech signal); within the box is a long-term and short-term memory layer 102, that is, a long-term and short-term memory neural network sub-module, the sub-module The module is composed of a memory cell 103, an input gate 104, an output gate 105, a forgetting gate 106, a tanh function 107, and a multiplier; the output of the long-term and short-term memory neural network sub-module is input to the linear cyclic projection layer 108, and the linear cyclic projection layer 108 The output is y=[y ₁ ,...,y _T ], ie the output 109,109 of the long-term and short-term memory-cycle neural network sub-module is used as the input to the next long-short-term memory neural network sub-module, so that it is repeated multiple times.

The parameters in the short-term memory neural network sub-module at t∈[1,T] are calculated according to the following formula:

G _input = sigmoid(W _ix x+W _ic Cell'+b _i )

G _forget =sigmoid(W _fx x+W _fc Cell'+b _f )

Cell=m'+G _forget ⊙Cell'+G _input ⊙tanh(W _cx x)⊙m'+b _c

G _output = sigmoid(W _ox x+W _oc Cell'+b _o )

m=tanh(G _output ⊙Cell⊙m')

y=soft max _k (W _ym m+b _y )

Summation; ⊙ represents the multiplication of matrix elements.

Step 2: Calculate the cross-entropy of all the parameters of the corresponding long-short-term memory layer in the two modules to measure the information distribution difference between the two modules, and implement the cross-entropy parameter update through the linear cyclic projection layer 2.

2 is a flow chart of a deep-long-term and short-term memory cycle neural network update module according to the present invention, which includes the following contents: the original pure speech signal and the noisy signal (ie, the original pure speech signal after being interfered by environmental noise) are respectively used as the depth in FIG. The input of the long-term and short-term memory cycle neural network module can respectively obtain the outputs of two long-term and short-term memory neural network sub-modules (ie, the block of FIG. 1), and the two outputs are used as the input 201 of the update module; The update submodule 202 of the update module, the update submodule 202 is composed of a cross entropy 203 and a linear loop projection layer two 204; the output of the update submodule 202 is used as an input of the next update submodule, so that the loop is repeated multiple times; the last updater The output of the module is the output 205 of the entire update module.

The cross entropy 203 of the update submodule 202 is calculated according to the following formula:

d(x ₁ ,x ₂ )=∫x ₁ lnx ₂ dt-∫x ₂ lnx ₁ dt

Where d is the cross entropy, and x ₁ and x ₂ represent the two inputs of the update module, that is, the outputs of the two long-term and short-term memory loop neural networks respectively input from the original pure speech signal and the noisy signal.

The output of the linear loop projection layer 204 is calculated as follows:

y'=soft max _k (W _y' d+b _y' )

Where y' is the output 205 of the entire module, W _y represents the weight of the cross entropy 203 output to the linear cyclic projection layer 204, d represents the cross entropy, b _{y '} represents the amount of deviation, and

Summing.

Step 3, by comparing the final update results with the depth of the input with the original pure speech signal The final output of the memory-cycled neural network module enables continuous speech recognition.

3 is a flow chart of an acoustic deep model of the robust deep long-term memory neural network of the present invention, including the following contents:

From left to right: the deep long-term memory cycle neural network module 303 with the original pure speech signal 301 as input, the deep long-term memory cycle neural network update module 304, and the noisy signal (that is, the original after being disturbed by environmental noise) The pure speech signal 302 is the input deep long-term memory loop neural network module 305, wherein the parameters are calculated as shown in steps 1 and 2, and the final output is the original pure speech signal as the input of the deep long-term memory loop neural network module output 306. And the output 307 of the deep long-term memory loop neural network update module.

Claims

A continuous speech recognition method based on deep long-term and short-term memory cyclic neural network, which comprises:

Step one, establishing two deep long-term and short-term memory cyclic neural network modules having identical structures including a plurality of long-term and short-term memory layers and a linear cyclic projection layer;

Step two, respectively sending the original pure speech signal and the noisy signal as input to the two modules of step one;

Step 3: Calculate the cross-entropy of all the parameters of the corresponding long-short-term memory layer in the two modules to measure the information distribution difference between the two modules, and implement the cross-entropy parameter update through the linear cyclic projection layer 2;

In step four, continuous speech recognition is achieved by comparing the final update result with the final output of the deep long-term memory loop neural network module with the original pure speech signal as input.
The continuous speech recognition method based on the deep long-term and short-term memory cycle neural network according to claim 1, wherein in the deep long-term and short-term memory cycle neural network module, the speech signal x=[x 1 ,...,x T As the input to the entire module, and also as the input of the first long-term and short-term memory layer, the output of the first long-term and short-term memory layer is the input of the first linear cyclic projection layer, and the output of the first linear cyclic projection layer is used as The input of the next linear loop projection layer, the output of the next linear loop projection layer is used as the input of the next linear loop projection layer, and so on, wherein the deep long-term memory loop neural network module with the original pure speech signal as input The output of the last linear loop projection layer is used as the output of the whole deep long-term memory loop neural network module y=[y 1 ,...,y T ], where T is the time length of the speech signal, and the noisy signal is Entering the depth of the long-term memory loop in the neural network module, the output of the last linear loop projection layer is discarded.
The continuous speech recognition method based on the deep long-term and short-term memory cycle neural network according to claim 1 or 2, wherein the long-term and short-term memory layer is composed of a memory cell, an input gate, an output gate, an forgetting gate, a tanh function, and a multiplier Long-term and short-term memory neural network The complex module, the parameters in the short-term memory neural network sub-module at t∈[1,T] time is calculated according to the following formula:

G input = sigmoid(W ix x+W ic Cell'+b i )

G forget =sigmoid(W fx x+W fc Cell'+b f )

Cell=m'+G forget ⊙Cell'+G input ⊙tanh(W cx x)⊙m'+b c

G output = sigmoid(W ox x+W oc Cell'+b o )

m=tanh(G output ⊙Cell⊙m')

y=soft max k (W ym m+b y )

Where G input is the output of the input gate, G forget is the output of the forgetting gate, Cell is the output of the memory cell, Cell' is the output of the memory cell at time t-1, G output is the output of the output gate, and G' output is t- The output of the output gate is 1 time, m is the output of the linear cyclic projection layer, m' is the output of the linear cyclic projection layer at time t-1; x is the input of the whole long-term and short-term memory cycle neural network module, and y is a long-term and short-term memory cycle The output of the neural network sub-module; b i is the deviation of the input gate i, b f is the deviation of the forgetting gate f, b c is the deviation of the memory cell c, b o is the deviation of the output gate o, b y is The amount of deviation of the output y, different b represents a different amount of deviation; W ix is the weight between the input gate i and the input x, W ic is the weight between the input gate i and the memory cell c, and W fx is the forgetting gate f The weight between the input x and W fc is the weight between the forgetting gate f and the memory cell c, W oc is the weight between the output gate o and the memory cell c, and W ym is the weight between the output y and the output m And have

Where x k represents the input of the kth ∈[1,K] softmax functions, l∈[1,K] is used for all
Summation; ⊙ represents the multiplication of matrix elements.
The continuous speech recognition method based on the deep long-term and short-term memory cycle neural network according to claim 3, wherein the two deep long-term and short-term memory cyclic neural network modules respectively take a long-term and short-term memory neural network at the same level The output of the submodule acts as two inputs to an update submodule, and an update submodule consists of cross entropy and linear loop projection layer two, multiple updates The submodules are connected in series to form an update module. The output of one update submodule is used as the input of the next update submodule, and the output of the last submodule is the output of the entire update module.
The continuous speech recognition method based on the deep long-term and short-term memory cycle neural network according to claim 4, wherein the cross entropy in the update submodule is calculated according to the following formula:

d(x 1 ,x 2 )=∫x 1 ln x 2 dt-∫x 2 lnx 1 dt

Where d is the cross entropy, x 1 and x 2 represent the two inputs of the update sub-module, respectively, ie the long-short-term memory neural network sub-module in the long-short-term memory neural network module with the original pure speech signal and the noisy signal as input. Output;

The output of the linear loop projection layer 2 is calculated as follows:

y'=soft max k (W y' d+b y' )

Where d is the cross entropy, y' is the output vector of the entire update module, W y represents the weight of the parameter update output to the output of the linear loop projection layer, u represents the cross entropy, and b y ' represents the amount of deviation.