CN114596843A

CN114596843A - Fusion method based on end-to-end voice recognition model and language model

Info

Publication number: CN114596843A
Application number: CN202210242872.3A
Authority: CN
Inventors: 柳宇非; 张伟彬; 邢晓芬; 徐向民
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-03-11
Filing date: 2022-03-11
Publication date: 2022-06-07

Abstract

S1, training an end-to-end speech recognition model by using speech and text pairs, and training an external language model by using text data; s2, independently taking out the trained decoder part of the voice recognition model and forming an independent model; s3, training the independent model by using training data to text and obtaining an estimation model of the internal language model after convergence; s4, decoding the score fusion of the speech recognition model, the external language model and the estimation model of the internal language model to obtain a decoding result. The algorithm can improve the recognition accuracy after the speech recognition model and the language model are fused, and has wide application prospect in the field of speech recognition.

Description

Fusion method based on end-to-end voice recognition model and language model

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a fusion method based on an end-to-end voice recognition model and a language model.

Background

The most classical speech recognition method at present is based on a method of combining a Hidden Markov Model (HMM) and a Neural Network (DNN). Although the method well utilizes the short-time stationary characteristic of the voice signal, the method still has the defects of acoustic model, pronunciation dictionary, multi-model cascade of the language model, inconsistent model training target, large decoding space and the like. The invention of end-to-end voice recognition simplifies the whole voice recognition process, and the training targets are simple and consistent.

Currently, end-to-end speech recognition models can be mainly classified into three categories: continuous Time Classification (CTC), Recurrent Neural Network-Transducer (RNN-Transducer) and Attention-based sequence models (Attention-based End-to-End Model, A-E2E). The sequence model based on the attention mechanism aligns frame-level speech signals and character sequences by adopting the attention mechanism, and the accuracy of the sequence model is the highest in the end-to-end speech recognition model. The end-to-end speech recognition framework is largely divided into three parts, an encoder, a decoder and an attention mechanism. It is also important to obtain a language model with better recognition effect. The Fusion algorithm of the language model and the speech recognition model which is mainstream at present is a Shallow Fusion technique (SF). With this technique, the fusion technique works well for the traditional speech recognition model, but the promotion of the end-to-end speech recognition model is very limited. This is mainly because, unlike the conventional speech recognition Model, the end-to-end speech recognition Model models the entire sentence, and therefore an Internal Language Model (ILM) is inevitably learned. This internal language model may affect the fusion of the speech recognition model and the external language model. As the end-to-end model is more and more widely used, more and more solutions are proposed, among which the best known is the sensitivity Ratio method proposed by Masashi Sugiyama. In the method, a small language model is trained on data trained by a speech recognition model to approximate ILM, and the approximate ILM is subtracted when the speech recognition model is fused with an external language model so as to achieve the purpose of reducing the influence of ILM. Based on the sensitivity Ratio method, microsoft provides an Internal Language Model Estimation technology (Internal Language Model Estimation), which can directly and accurately estimate a Language Model in a speech recognition Model, so that an estimated more accurate ILM is subtracted in a fusion stage, and great performance improvement is achieved. However, the ILME method proposed by microsoft is only applicable to an end-to-end speech recognition model with a bidirectional long-short term memory network encoder, and cannot be applied to the newly proposed transform encoder and former encoder, so that its application is greatly limited. Meanwhile, the method proposed by Microsoft has no self-adaptive function, so that the optimal effect cannot be obtained even if the method is applied TO an END-TO-END speech recognition MODEL with a BLSTM encoder (INTERNAL LANGUAGE MODEL ESTIMATION FOR DOMAIN-ADAPTIVE END-TO-END).

Disclosure of Invention

Aiming at the defects of the existing language model fusion technology and the internal language model estimation technology, the invention provides a fusion method based on an end-to-end speech recognition model and a language model, and mainly solves the problem that the existing algorithm for internal language estimation does not have self-adaptive capacity. Meanwhile, the prior art has limited estimation accuracy on the internal language model, and has limited precision improvement in fusion. The main application scene of the invention is an end-to-end speech recognition model based on an attention mechanism, which is called end-to-end speech recognition for short, an internal language model in the end-to-end speech recognition model is estimated in a model training mode, and the estimated internal language model is subtracted in a model reasoning and decoding stage. Compared with the traditional language model fusion technology, the method can greatly improve the recognition accuracy rate of the fused end-to-end speech recognition model and the external language model. Meanwhile, the method can be applied to all the attention-based speech recognition models including models based on a converter encoder, a BLSTM encoder and a Transformer encoder, and can also be applied to models of an LSTM decoder and a Transformer decoder, so that the method has wider application range.

The invention is realized by at least one of the following technical schemes.

A fusion method based on an end-to-end speech recognition model and a language model comprises the following steps:

s1, training an end-to-end speech recognition model by using speech and text pairs, and training an external language model by using text data; the end-to-end speech recognition model includes an encoder, a decoder, and an attention mechanism;

s2, independently taking out the trained decoder of the end-to-end speech recognition model and forming an independent model;

s3, training the independent model by using training data to text separately, and obtaining an estimation model of the internal language model after convergence;

and S4, decoding the score fusion of the end-to-end speech recognition model, the external language model and the estimation model of the internal language model to obtain a decoding result.

Further, the trained decoder of the end-to-end speech recognition model is taken out separately and forms an independent model, specifically: modifying the topology of the decoder to form an independent model by: the attention mechanism is replaced with a fully connected network.

Further, the independent model is trained independently by text part data of the originally trained end-to-end speech recognition model, and an estimation model of the internal language model is obtained after convergence;

and during training, fixing the parameters originally belonging to the decoder, only updating the parameters of the added full-connection network, wherein the parameters needing to be updated comprise the weight and the bias parameters in the newly added full-connection network, and obtaining an estimation model of the internal language model after convergence.

Further, decoding is performed by using the Beam Search algorithm, and the score in decoding is calculated as: subtracting the score of the estimation model of the internal language model after adding the score of the external language model to the score of the end-to-end speech recognition model; the score is calculated by inputting the normalized probability distribution output by the corresponding model into a natural logarithm function.

Further, the score weights of the end-to-end speech recognition model, the external language model and the estimation model of the internal language model are controlled by setting two fusion weights, respectively.

Further, the end-to-end speech recognition model is a recurrent neural network language model or a Transformer language model.

Further, the end-to-end speech recognition model must contain an attention mechanism.

Further, the encoder is a convolution enhanced transform encoder, a bidirectional long-short term memory network encoder, or a transform encoder.

Further, the decoder is a long-short term memory network decoder or a transform decoder.

Further, the attention mechanism is not limited to an additive attention mechanism, a position-sensitive attention mechanism, or a monotonic attention mechanism.

Compared with the prior art, the invention has the beneficial effects that: the invention realizes the estimation of the internal language model by changing the attention mechanism of the end-to-end speech recognition model, and can be applied to all end-to-end speech recognition models with the attention mechanism. Compared with the traditional language model fusion algorithm, the method can greatly improve the effect of fusing the end-to-end speech recognition model and the language model.

Drawings

FIG. 1 is a block diagram illustrating an overall structure of a fusion method based on an end-to-end speech recognition model and a language model according to an embodiment;

FIG. 2 is a block diagram of an embodiment speech recognition model;

FIG. 3 is a block diagram of an internal language model estimation model modified from the decoder portion of the speech recognition model in an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Example 1

As shown in fig. 1, fig. 2, and fig. 3, an end-to-end speech recognition model and language model based fusion method is used, and the selected speech recognition model in this embodiment is composed of a transformer encoder, an additive attention mechanism, and an LSTM decoder. Wherein the former encoder is composed of 12 layers, each layer has a width of 512 dimensions, and the number of self-attention mechanisms of the encoder is eight. Random discarding is used in the training process to prevent overfitting of the model. The decoder is composed of two layers of long and short term memory networks, each layer having a width of 2048. The language model uses an RNN language model and is composed of three layers of LSTM networks, the width of a hidden layer of each LSTM network is 2048-dimensional, random discarding is used in the training process, and overfitting of the model is prevented. In this example, the speech recognition model was trained using the chinese universal data set, and the medical dialogue data set was selected as the test set. The external language model in this example is trained using a large volume of medical corpus text in order to match the test set. In the implementation, 40-dimensional MFCC features are extracted from a selected voice data set and serve as input audio features, the feature extraction window is 25 milliseconds long, and the window moves by 10 milliseconds. The method specifically comprises the following steps:

s1, training an end-to-end speech recognition model by using the speech and text pairs in the Chinese universal data training set. The learning rate is exponentially decreased when the end-to-end speech recognition model is trained, the learning rate is 0.015 at the beginning, and the learning rate is exponentially attenuated to 0.0015 after 400000 iterations. And training an external language model using the medical corpus text data. When the language model is trained, the learning rate adopts an exponential descent method, the learning rate is 0.015 at the beginning, and the learning rate is exponentially attenuated to 0.0015 after 100000 iterations;

the language model used in the implementation is a long-term and short-term memory network language model, and the formula is as follows:

wherein

A prediction output representing an external language model at the current time, y_iRepresenting words entered at time i, y_i-1...y₀Text sequence composed of all words from 0 to the time before the current time, LSTM represents long-short term memory network, softmax represents activation function, and the text sequence is output after the activation function is activated by the softmax

Is the normalized probability.

As another embodiment, a Transformer language model is applied to replace the long-short term memory network language model, and the Transformer language model can be expressed as:

wherein, the Transformer represents a Transformer network, the softmax represents an activation function, and the activation function is output after passing through the softmax

Is the normalized probability.

And S2, taking out the decoder part of the trained speech recognition model to form an independent model, wherein the internal language model estimation model can be obtained by replacing the attention mechanism in the original speech recognition model by a full-connection network and removing the encoder part. The specific formula is as follows:

wherein all corner markers indicate the moment of decoding,

is the hidden state of the autoregressive decoder at the ith moment

And predicted output of previous time

And jointly calculating. The hidden state may be provided by a fully connected network

Conversion to a query vector for a current time

Content vector

By query vectors

Through a process ofConnecting network

Then directly obtaining the product. Full connectivity network FNN for use therein_ilmeThe method is characterized in that the method is a two-layer fully-connected network with the width of 512 dimensions, and an activation function is RELU;

representing the concatenation of the hidden state with the content vector at the current time. Get the current moment content vector as the end-to-end speech recognition model

Then, the hidden state of the current time is compared with the hidden state of the current time

Sending the spliced data to a full-connection network FNN₂And obtaining the normalized probability output of the current moment after the softmax activation function.

S3, training the independent model by the text in the Chinese universal data set, wherein the learning rate during training is exponentially decreased to 0.015 at the beginning and exponentially decreased to 0.0015 after 10000 iterations. During training, the parameters originally belonging to the decoder in the fixed model only update the parameters of the newly added full-connection network, and the specific operation formula is as follows:

Loss_ilm＝CE(y^ilm，GT)

wherein,

is the state of the autoregressive decoder, the hidden state being the hidden state at the previous decoding moment

And predicted output of previous time

And jointly calculating. The hidden state can be converted into the current query vector by a full-connection network

Content vector

By query vectors

Directly obtained after passing through a full-connection network and outputs a content vector by using a current time attention mechanism

And hidden state at the current time

The prediction output of the current moment can be obtained through a full-connection network after splicing

Representing the predicted output of the internal language model estimation model, GT representing the standard output at all times, CE representing the cross entropy Loss function, Loss_ilmIs a loss function used to train the internal language model estimation model. Using gradient descent algorithm during training, and using loss function to parameters in newly-added fully-connected network

Partial derivative of (2)

And updating parameters in the newly-added fully-connected network, wherein the parameters needing to be updated comprise the weight parameters and the bias parameters in the newly-added fully-connected network. Meanwhile, the parameters in the encoder are fixed, and the parameters of the Decoder refer to the parameters in the Decoder network Decoder and FNN₁And FNN₂Weight and bias parameters in (1).

The converged model is called an estimation model for the internal language model;

s4, decoding the medical dialogue data set by using the Beam Search algorithm, wherein the score calculation method during decoding is as follows: the score of the speech recognition model is added to the score of the external language model, and then the score of the estimation model of the internal language model is subtracted, and the calculation formula of the fusion mode is as follows.

Wherein

And

respectively representing the current timeThe output of the voice recognition model, the external language model and the estimated internal language model normalizes the probability, and the log represents the logarithm of the natural number. Lambda [ alpha ]_elmAnd λ_ilmThe fusion weights of the external language model and the internal language model are respectively represented, the two data need to be set according to specific conditions, and the effect is better when the two parameters are equal. score_iThe score after the fusion of the current moment is shown, and the subsequent beam search algorithm decodes according to the score and outputs the final result

S5, the decoding result is the decoding result of the fusion of the speech recognition model and the external language model.

Example 2

The speech recognition model used by the implementation and used for training end to end can be expressed as a structure, speech to be recognized is input into an encoder of the speech recognition model after being extracted into a feature sequence through features, and is output to an attention mechanism module for storage and standby after being encoded by the encoder. The decoder calculates the predicted output of the current moment by using an autoregressive mode and integrating an attention mechanism, and the specific formula is as follows:

H＝Conformer(X)

q_i＝FNN₁(s_i)

c_i＝Attention(H,q_i)

wherein X ═ X₁,x₂,…,x_t,…,x_T]For the sequence of audio features to be recognized, where x_tRepresents the audio characteristics of the t-th frame, and X belongs to R^T×dT is the length of the audio sequence, and d is the characteristic dimension; h ═ H₁,h₂,…,h_t,…,h_T]Is the output after being encoded by the encoder, h_tIs the coding corresponding to the acoustic characteristics at time tAnd (6) outputting the codes. s_iIs the state of the autoregressive decoder, the hidden state being the hidden state s from the previous decoding moment_i-1And the predicted output of the previous moment is calculated together. The hidden state can be converted into a query vector q at the current moment by a fully-connected network_iThe attention mechanism calculates a corresponding content vector c based on the query vector_iOutputting the content vector and the hidden state s at the current moment by using the attention mechanism at the current moment_iThe prediction output of the current moment can be obtained through a full-connection network after splicing

This process is then repeated, in an autoregressive manner, until the stop symbol is predicted and decoding is stopped.

The attention mechanism in the end-to-end speech recognition model has the following functions: the decoder acquires the acoustic information processed by the encoder through an attention mechanism.

As another embodiment, a bidirectional long-short term memory network encoder or a Transformer encoder may be applied instead of the former encoder, and these two encoders may be respectively expressed as:

H＝BLSTM(X)

H＝Transformer(X)。

example 3

The implementation of the fusion method based on the end-to-end voice recognition model and the language model comprises the following steps:

s1, training an end-to-end speech recognition model by using speech and text pairs, and training an external language model by using text data; the end-to-end speech recognition model comprises an encoder, a decoder and an attention mechanism, and the decoder acquires acoustic information processed by the encoder through the attention mechanism;

the encoder is a Conformer encoder, a BLSTM encoder or a Transformer encoder;

the decoder is an LSTM decoder or a transform decoder;

the attention mechanism is an additive attention mechanism, a position-sensitive attention mechanism or a monotonic attention mechanism. The end-to-end speech recognition model can also be formed by combining the modules.

And S2, independently taking out the trained decoder of the end-to-end speech recognition model, and replacing the attention of the model with a full-connection network with a double-layer width of 512 to form an independent model. The first two layers of this fully connected network use RELU activation functions, the last output layer does not use activation functions;

and S3, training the independent model by using training data to text, wherein parameters originally belonging to an end-to-end speech recognition model decoder need to be fixed during training, and only parameters in a newly added full connection layer need to be updated. The learning rate during training is exponentially decreased to 0.015 at the beginning, and is exponentially decreased to 0.0015 after 10000 iterations. And after convergence, obtaining an estimation model of the internal language model;

and S4, decoding the score fusion of the end-to-end speech recognition model, the external language model and the estimation model of the internal language model to obtain a decoding result, wherein the Beam Search should use a larger Beam size as a decoding parameter when decoding, and the Beam size is 60 in the example.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A fusion method based on an end-to-end speech recognition model and a language model is characterized by comprising the following steps:

2. The method according to claim 1, wherein the trained decoder of the end-to-end speech recognition model is taken out separately and forms an independent model, specifically: modifying the topology of the decoder to form an independent model by: the attention mechanism is replaced with a fully connected network.

3. The method according to claim 2, wherein the independent model is trained separately with text part data of the originally trained end-to-end speech recognition model, and an estimation model of the internal language model is obtained after convergence;

4. The fusion method of claim 1, wherein the decoding is performed by using a Beam Search algorithm, and the score of decoding is calculated as: subtracting the score of the estimation model of the internal language model after adding the score of the external language model to the score of the end-to-end speech recognition model; the score is calculated by inputting the normalized probability distribution output by the corresponding model into a natural logarithm function.

5. The method of claim 4, wherein the score weights of the end-to-end speech recognition model, the external language model and the internal language model are controlled by setting two fusion weights.

6. The method of claim 1, wherein the end-to-end speech recognition model is a recurrent neural network language model.

7. The method of claim 1, wherein the end-to-end speech recognition model is a Transformer language model.

8. The method of claim 1, wherein the encoder is a convolutional enhanced Transformer encoder, a bi-directional long short term memory network encoder, or a Transformer encoder.

9. The method of claim 1, wherein the decoder is a long-short term memory network decoder or a Transformer decoder.

10. The method for fusing the end-to-end voice recognition model and the language model according to any one of claims 1 to 9, wherein the attention mechanism is not limited to an additive attention mechanism, a position-sensitive attention mechanism or a monotonic attention mechanism.