WO2021212684A1

WO2021212684A1 - Recurrent neural network and training method therefor

Info

Publication number: WO2021212684A1
Application number: PCT/CN2020/105359
Authority: WO
Inventors: 康燕斌; 张志齐
Original assignee: 上海依图网络科技有限公司
Priority date: 2020-04-22
Filing date: 2020-07-29
Publication date: 2021-10-28
Also published as: CN111401530A; CN111401530B

Abstract

Disclosed is a recurrent neural network, comprising: a baseline model, which is formed by connecting two LSTM network layers; an extension model, the extension model comprising a plurality of residual network layers. Each residual network layer is formed by connecting one LSTM network layer and one addition function layer; an input end of the residual network layer is connected to an output of the previous network layer, two input ends of the addition function layer are respectively connected to an output of the LSTM network layer of the residual network layer and the output of the previous network layer, and an output of the addition function layer is used as the output of the residual network layer. Also disclosed is a training method for the recurrent neural network. According to the present invention, the depth of the recurrent neural network using the LSTM network layer can be deepened, and the training effect and performance can be improved.

Description

Recurrent neural network and its training method

Technical field

The present invention relates to speech recognition, in particular to a cyclic neural network. The invention also relates to a training method of the recurrent neural network.

Background technique

As shown in Figure 1, it is a model structure diagram of an existing speech recognition device; the existing recurrent neural network (RNN) is formed by connecting two layers of long and short-term memory (LSTM) network layers 102.

In Figure 1, the cyclic neural network is used in a speech recognition device.

The speech recognition device includes: a convolution layer (convolution layer) 101, the cyclic neural network, a fully connected layer (Fully connected Layer) 103, and a Connectionist Temporal Classification (CTC) layer 104.

The convolutional layer 101 receives sound spectrum signals, the output of the convolutional layer 101 is connected to the recurrent neural network, and the recurrent deep network is connected to the CTC layer 104 through the fully connected layer 103. The CTC layer 104 improves the CTC loss function and is used to train the speech signal.

The number of layers of the convolutional layer 101 is 1 to 3, and the convolutional layer 101 is usually an invariant convolution layer (Invariant convolution layer).

The fully connected layer 103 is more than one layer.

In the recurrent neural network, the LSTM network layer 102 is formed by connecting multiple LSTM network nodes 105. In FIG. 1, the LSTM network layer 102 is a bidirectional network layer. That is, in the width direction of each LSTM network layer 102, different LSTM network nodes 105 can transmit information to each other, as shown by the two arrow lines of the dashed circle 106. In the LSTM network node 105, a forgetting gate is usually set to control the influence of the output of the previous LSTM network node 105 on the LSTM network node 105. The control function of the forgetting gate adopts the sigmoid function with output 0 or 1, which is input to the LSTM network node. 105 sets up a multiplication module, inputs the control signal output by the forget gate and the corresponding other nodes into the signal phase layer of the LSTM network node 105, and can control whether the corresponding input signal is input or not input to the LSTM network node 105. In addition to the forget gate, the LSMT network node 105 also includes input gates and output gates, both of which multiply the signals of 0 and 1 and the corresponding signals to realize selective input of signals and control the flow of information.

The disadvantage of the existing recurrent neural network composed of the LSMT network layer 102 is that only about 2 layers of recurrent neural network can be used; when the number of layers increases, the training cannot be converged, or the training effect is significantly worse than that of the shallow network. It is not possible to further improve the performance of the cyclic network.

technical problem

Type a paragraph describing the technical problem here.

Technical solutions

Type a paragraph describing the technical solution here.

Beneficial effect

Type a paragraph describing the beneficial effect here.

Description of the drawings

The present invention will be further described in detail below in conjunction with the drawings and specific embodiments:

Figure 1 is a model structure diagram of an existing speech recognition device;

Figure 2 is a model structure diagram of a speech recognition device according to an embodiment of the present invention;

Fig. 3 is a flowchart of a recurrent neural network training method according to an embodiment of the present invention.

The best mode of the present invention

As shown in Figure 2, it is a model structure diagram of a speech recognition device according to an embodiment of the present invention; the cyclic neural network of the embodiment of the present invention includes:

The baseline model is formed by connecting layer 2 of the 2-layer LSTM network.

An extended model, the extended model includes a multi-layer residual network layer 3, the residual network layer 3 of each layer is formed by connecting a layer of LSTM network layer 2 and a layer of addition function layer, the residual network layer 3 The input terminal is connected to the output of the upper network layer, and the two input terminals of the addition function layer are respectively connected to the output of the LSTM network layer 2 of the residual network layer 3 and the output of the upper network layer. The addition function The output of the layer is used as the output of the residual network layer 3.

The depth of the residual network layer 3 included in the extended model is 1 to 7 layers, and the depth of the cyclic neural network is 3 to 9 layers.

The extension depth of the extension model is confirmed by training. When the training result becomes worse when a layer of the residual network is added, the depth before the increased residual network is taken as the depth of the recurrent neural network.

In the embodiment of the present invention, the recurrent neural network is used in a speech recognition device.

The speech recognition device includes: a convolutional layer 1, the recurrent neural network, a fully connected layer 4, and a CTC layer 5.

The convolutional layer 1 receives sound spectrum signals, the output of the convolutional layer 1 is connected to the recurrent neural network, and the recurrent deep network is connected to the CTC layer 5 through the fully connected layer 4. The CTC layer 5 improves the CTC loss function and is used to train the speech signal.

The number of layers of the convolutional layer 1 is 1 to 3, and the convolutional layer 1 is usually an invariant convolutional layer.

The fully connected layer 4 is one or more layers.

In the recurrent neural network, each network layer includes the same network node; for LSTM network layer 2, the network nodes are all LSTM network nodes 6; for residual network layer 3, the network nodes are all residual network nodes 8. As shown in FIG. 2, it can be seen that the residual network node 8 is composed of an LSTM network node 6 and an addition function node 9. The addition function node 9 is also represented by ADD in FIG. Describe the additive function layer.

Each network layer in the cyclic neural network is a bidirectional network layer. That is, in the width direction of each network layer, different network nodes can transmit information to each other, as shown by the two arrow lines of the dashed circle 7. In Figure 2, each network layer only describes the detailed information of the network nodes of one network layer, and three points are used to indicate that the network layer contains more network nodes.

In the depth direction of the recurrent neural network, the number of network nodes in each network layer is the same and has a one-to-one correspondence.

For one residual network node 8, the output of the previous network node is input to the LSTM network node 6 and the addition function node 9, respectively, and the output of the LSTM network node 6 in the residual network node 8 is also input to all The addition function node 9 uses the output of the addition function node 9 as the output of the residual network node 8. When the K+1 network layer is the residual network layer 3, the output signal of the residual network node 8 corresponding to the residual network layer 3 can be expressed by the following formula:

output_{k+1}=LSTM_{k+1}(output_k)+output_k;

Wherein, output_{k+1} represents the output of the residual network node 8 of the K+1 network layer, that is, the output of the addition function node 9;

output_{k} represents the output of the residual network node 8 of the K-th network layer, that is, the output of the addition function node 9;

LSTM_{k+1}() represents the functional expression of the LSTM network node 6 in the residual network node 8 of the K+1 network layer;

LSTM_{k+1}(output_k) represents the output of the LSTM network node 6 in the residual network node 8 of the K+1 network layer when the input is output_k.

For the baseline model, that is, the first two LSTM network layers 2, the output signal of each LSTM network node 6 is: LSTM_{k}(output_{k-1}); LSTM_{k}() represents the K-th LSTM The functional expression of the LSTM network node 6 of the network layer 2; LSTM_{k}(output_{k-1}) represents the LSTM network of the K-th LSTM network layer 2 when the input is output_{k-1} The output of node 6.

In the embodiment of the present invention, the recurrent neural network adds a residual network layer 3 on the basis of a baseline model composed of two layers of LSTM network layer 2. The residual network layer 3 is formed by connecting the LSTM network layer 2 and the additive function layer. The difference network layer 3 can increase the depth of the recurrent neural network while maintaining convergence, and finally can realize the improvement of the network depth, and thus can improve the training effect and performance.

As shown in FIG. 3, it is a flowchart of a cyclic neural network training method according to an embodiment of the present invention; the training method of a cyclic neural network according to an embodiment of the present invention includes the following steps:

Step 1: Provide a baseline model of the recurrent neural network, the baseline model is formed by connecting two layers of the LSTM network. Step one corresponds to the step marked 301 in FIG. 3.

Step 2: Initialize the baseline model, and this initialization corresponds to the step marked 302 in FIG. 3.

The training of the recurrent neural network starts from the first layer of the LSTM network layer 2. In FIG. 3, the training step of the LSTM network layer 2 of the first layer is not directly illustrated, and it is included in the initialization step. The step corresponding to mark 303 in FIG. 3 starts from K=2, and when K is greater than 2, it corresponds to the subsequent training of the extension model.

Step 3: Add an extended model based on the baseline model. The extended model includes a multi-layer residual network layer 3. The residual network layer 3 of each layer consists of a layer of LSTM network layer 2 and a layer of addition function Layer connection is formed, the input end of the residual network layer 3 is connected to the output of the upper network layer, and the two input ends of the addition function layer are respectively connected to the output of the LSTM network layer 2 of the residual network layer 3 and The output of the upper network layer, and the output of the addition function layer is used as the output of the residual network layer 3.

Each time a layer of the residual network layer 3 is added, the training of the recurrent neural network is performed once, that is, the training corresponding to the mark 303. The sub-steps of adding the residual network layer 3 include:

Step 31: Add a new layer of the residual network layer 3 to make the newly added layer 3 of the residual network layer K+1. The previous K-layer network layers have been trained, and the trained model is used The first K-layer network layer is initialized, and the K+1-th layer network is initialized with random parameters.

As shown in the step corresponding to mark 307, after adding a layer of the residual network layer 3, in order to facilitate the loop training, K is usually reset, and K=K+1.

After that, as shown in the step corresponding to the mark 308, after the K value is reset, the first K-1 network layer is initialized with the trained parameter pair, and the K-th network layer is initialized with random parameters.

Step 32: Train the residual network layer 3 of the K+1th layer. That is, the step indicated by mark 303 is performed.

Step 33: Perform a performance test to check whether the promotion value of the performance test result is greater than the threshold value. That is, the step indicated by mark 304 is performed.

The steps corresponding to reference mark 304 are as follows:

If the promotion value of the performance test result is greater than the threshold value, step 34 is performed. The threshold in step 33 is 3%.

If the promotion value of the performance test result is less than the threshold value, step 35 is performed.

Step 34: Add the residual network layer 3 of the K+1th layer to the recurrent neural network, and then repeat step 31.

Step 35: As shown in the step corresponding to the mark 309, the training ends, stop adding the residual network layer 3, and use the existing K-layer network layer as the recurrent neural network.

The method of the embodiment of the present invention can realize that: the depth of the residual network layer 3 included in the extended model is 1 to 7 layers, and the depth of the cyclic neural network is 3 to 9 layers.

In the method of the embodiment of the present invention, the recurrent neural network is used in a speech recognition device.

The fully connected layer 4 is one or more layers.

output_{k+1}=LSTM_{k+1}(output_k)+output_k;

The present invention has been described in detail through specific embodiments above, but these do not constitute a limitation to the present invention. Without departing from the principle of the present invention, those skilled in the art can make many modifications and improvements, which should also be regarded as the protection scope of the present invention.

Claims

A recurrent neural network, characterized in that it includes:

The baseline model is formed by connecting two LSTM network layers;

An extended model, the extended model includes a multi-layer residual network layer, the residual network layer of each layer is formed by connecting an LSTM network layer and an additive function layer, and the input terminal of the residual network layer is connected to The output of a network layer, the two input ends of the addition function layer are respectively connected to the output of the LSTM network layer of the residual network layer and the output of the upper network layer, and the output of the addition function layer is used as the The output of the residual network layer.
The cyclic neural network according to claim 1, wherein the residual network layer included in the extended model has a depth of 1 to 7 layers, and the cyclic neural network has a depth of 3 to 9 layers.
The cyclic neural network according to claim 2, wherein the extension depth of the extension model is confirmed by training, and when a layer of the residual network is added, the training result becomes worse, then the increased residual network The previous depth is the depth of the recurrent neural network.
The cyclic neural network according to claim 1, wherein the cyclic neural network is used in a speech recognition device.
5. The recurrent neural network according to claim 4, wherein the speech recognition device comprises: a convolutional layer, the recurrent neural network, a fully connected layer and a CTC layer;

The convolutional layer receives sound spectrum signals, the output of the convolutional layer is connected to the recurrent neural network, and the recurrent deep network is connected to the CTC layer through the fully connected layer.
The cyclic neural network according to claim 5, wherein the convolutional layer is 1 to 3 layers.
The cyclic neural network according to claim 5, wherein the fully connected layer is more than one layer.
The recurrent neural network according to claim 5, characterized in that: in the recurrent neural network, each network layer includes the same network node; for the LSTM network layer, the network nodes are all LSTM network nodes; for the residual network Layer, network nodes are all residual network nodes.
8. The cyclic neural network of claim 8, wherein each network layer in the cyclic neural network is a bidirectional network layer.
A method for training a recurrent neural network, which is characterized in that it comprises the following steps:

Step 1: Provide a baseline model of the recurrent neural network, which is formed by connecting two LSTM network layers;

Step 2: Initialize the baseline model, and start training the recurrent neural network from the first layer of the LSTM network layer;

Step 3: Add an extended model on the basis of the baseline model. The extended model includes a multi-layer residual network layer, and the residual network layer of each layer is formed by connecting an LSTM network layer and an additive function layer The input end of the residual network layer is connected to the output of the upper network layer, and the two input ends of the addition function layer are respectively connected to the output of the LSTM network layer of the residual network layer and the output of the upper network layer. Output, the output of the addition function layer is used as the output of the residual network layer;

Each time a layer of the residual network layer is added, the training of the recurrent neural network is performed once, and the sub-steps of adding the residual network layer include:

Step 31. Add a new layer of the residual network layer to make the newly added layer of the residual network layer K+1. The previous K-layer network layers have been trained, and the trained model is used to compare the previous layer. The K layer network layer is initialized, and the K+1 layer network is initialized with random parameters;

Step 32: Train the residual network layer of the K+1th layer;

Step 33: Perform a performance test to check whether the promotion value of the performance test result is greater than the threshold;

If the promotion value of the performance test result is greater than the threshold value, proceed to step 34;

If the promotion value of the performance test result is less than the threshold value, proceed to step 35;

Step 34: Add the residual network layer of the K+1th layer to the recurrent neural network, and then repeat step 31;

Step 35: After the training is over, stop adding the residual network layer, and use the existing K-layer network layer as the recurrent neural network.
10. The training method of a recurrent neural network according to claim 10, wherein the residual network layer included in the extended model has a depth of 1 to 7 layers, and the depth of the recurrent neural network is 3 to 9 layers.
The method for training a recurrent neural network according to claim 10, wherein the threshold in step 33 is 3%.
The training method of a recurrent neural network according to claim 10, wherein the recurrent neural network is used in a speech recognition device.
The training method of a recurrent neural network according to claim 13, wherein the speech recognition device comprises: a convolutional layer, the recurrent neural network, a fully connected layer and a CTC layer;

The convolutional layer receives sound spectrum signals, the output of the convolutional layer is connected to the recurrent neural network, and the recurrent deep network is connected to the CTC layer through the fully connected layer.
The training method of a recurrent neural network according to claim 14, wherein the convolutional layer is 1 to 3 layers.
The training method of a recurrent neural network according to claim 14, wherein the fully connected layer is more than one layer.
The training method of a recurrent neural network according to claim 14, characterized in that: in the recurrent neural network, each network layer includes the same network node; for the LSTM network layer, the network nodes are all LSTM network nodes; In the residual network layer, the network nodes are all residual network nodes.
The training method of a recurrent neural network according to claim 17, wherein each network layer in the recurrent neural network is a bidirectional network layer.