CN114596843A - Fusion method based on end-to-end voice recognition model and language model - Google Patents

Fusion method based on end-to-end voice recognition model and language model Download PDF

Info

Publication number
CN114596843A
CN114596843A CN202210242872.3A CN202210242872A CN114596843A CN 114596843 A CN114596843 A CN 114596843A CN 202210242872 A CN202210242872 A CN 202210242872A CN 114596843 A CN114596843 A CN 114596843A
Authority
CN
China
Prior art keywords
model
speech recognition
language model
recognition model
decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210242872.3A
Other languages
Chinese (zh)
Inventor
柳宇非
张伟彬
邢晓芬
徐向民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202210242872.3A priority Critical patent/CN114596843A/en
Publication of CN114596843A publication Critical patent/CN114596843A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

S1, training an end-to-end speech recognition model by using speech and text pairs, and training an external language model by using text data; s2, independently taking out the trained decoder part of the voice recognition model and forming an independent model; s3, training the independent model by using training data to text and obtaining an estimation model of the internal language model after convergence; s4, decoding the score fusion of the speech recognition model, the external language model and the estimation model of the internal language model to obtain a decoding result. The algorithm can improve the recognition accuracy after the speech recognition model and the language model are fused, and has wide application prospect in the field of speech recognition.

Description

Fusion method based on end-to-end voice recognition model and language model
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a fusion method based on an end-to-end voice recognition model and a language model.
Background
The most classical speech recognition method at present is based on a method of combining a Hidden Markov Model (HMM) and a Neural Network (DNN). Although the method well utilizes the short-time stationary characteristic of the voice signal, the method still has the defects of acoustic model, pronunciation dictionary, multi-model cascade of the language model, inconsistent model training target, large decoding space and the like. The invention of end-to-end voice recognition simplifies the whole voice recognition process, and the training targets are simple and consistent.
Currently, end-to-end speech recognition models can be mainly classified into three categories: continuous Time Classification (CTC), Recurrent Neural Network-Transducer (RNN-Transducer) and Attention-based sequence models (Attention-based End-to-End Model, A-E2E). The sequence model based on the attention mechanism aligns frame-level speech signals and character sequences by adopting the attention mechanism, and the accuracy of the sequence model is the highest in the end-to-end speech recognition model. The end-to-end speech recognition framework is largely divided into three parts, an encoder, a decoder and an attention mechanism. It is also important to obtain a language model with better recognition effect. The Fusion algorithm of the language model and the speech recognition model which is mainstream at present is a Shallow Fusion technique (SF). With this technique, the fusion technique works well for the traditional speech recognition model, but the promotion of the end-to-end speech recognition model is very limited. This is mainly because, unlike the conventional speech recognition Model, the end-to-end speech recognition Model models the entire sentence, and therefore an Internal Language Model (ILM) is inevitably learned. This internal language model may affect the fusion of the speech recognition model and the external language model. As the end-to-end model is more and more widely used, more and more solutions are proposed, among which the best known is the sensitivity Ratio method proposed by Masashi Sugiyama. In the method, a small language model is trained on data trained by a speech recognition model to approximate ILM, and the approximate ILM is subtracted when the speech recognition model is fused with an external language model so as to achieve the purpose of reducing the influence of ILM. Based on the sensitivity Ratio method, microsoft provides an Internal Language Model Estimation technology (Internal Language Model Estimation), which can directly and accurately estimate a Language Model in a speech recognition Model, so that an estimated more accurate ILM is subtracted in a fusion stage, and great performance improvement is achieved. However, the ILME method proposed by microsoft is only applicable to an end-to-end speech recognition model with a bidirectional long-short term memory network encoder, and cannot be applied to the newly proposed transform encoder and former encoder, so that its application is greatly limited. Meanwhile, the method proposed by Microsoft has no self-adaptive function, so that the optimal effect cannot be obtained even if the method is applied TO an END-TO-END speech recognition MODEL with a BLSTM encoder (INTERNAL LANGUAGE MODEL ESTIMATION FOR DOMAIN-ADAPTIVE END-TO-END).
Disclosure of Invention
Aiming at the defects of the existing language model fusion technology and the internal language model estimation technology, the invention provides a fusion method based on an end-to-end speech recognition model and a language model, and mainly solves the problem that the existing algorithm for internal language estimation does not have self-adaptive capacity. Meanwhile, the prior art has limited estimation accuracy on the internal language model, and has limited precision improvement in fusion. The main application scene of the invention is an end-to-end speech recognition model based on an attention mechanism, which is called end-to-end speech recognition for short, an internal language model in the end-to-end speech recognition model is estimated in a model training mode, and the estimated internal language model is subtracted in a model reasoning and decoding stage. Compared with the traditional language model fusion technology, the method can greatly improve the recognition accuracy rate of the fused end-to-end speech recognition model and the external language model. Meanwhile, the method can be applied to all the attention-based speech recognition models including models based on a converter encoder, a BLSTM encoder and a Transformer encoder, and can also be applied to models of an LSTM decoder and a Transformer decoder, so that the method has wider application range.
The invention is realized by at least one of the following technical schemes.
A fusion method based on an end-to-end speech recognition model and a language model comprises the following steps:
s1, training an end-to-end speech recognition model by using speech and text pairs, and training an external language model by using text data; the end-to-end speech recognition model includes an encoder, a decoder, and an attention mechanism;
s2, independently taking out the trained decoder of the end-to-end speech recognition model and forming an independent model;
s3, training the independent model by using training data to text separately, and obtaining an estimation model of the internal language model after convergence;
and S4, decoding the score fusion of the end-to-end speech recognition model, the external language model and the estimation model of the internal language model to obtain a decoding result.
Further, the trained decoder of the end-to-end speech recognition model is taken out separately and forms an independent model, specifically: modifying the topology of the decoder to form an independent model by: the attention mechanism is replaced with a fully connected network.
Further, the independent model is trained independently by text part data of the originally trained end-to-end speech recognition model, and an estimation model of the internal language model is obtained after convergence;
and during training, fixing the parameters originally belonging to the decoder, only updating the parameters of the added full-connection network, wherein the parameters needing to be updated comprise the weight and the bias parameters in the newly added full-connection network, and obtaining an estimation model of the internal language model after convergence.
Further, decoding is performed by using the Beam Search algorithm, and the score in decoding is calculated as: subtracting the score of the estimation model of the internal language model after adding the score of the external language model to the score of the end-to-end speech recognition model; the score is calculated by inputting the normalized probability distribution output by the corresponding model into a natural logarithm function.
Further, the score weights of the end-to-end speech recognition model, the external language model and the estimation model of the internal language model are controlled by setting two fusion weights, respectively.
Further, the end-to-end speech recognition model is a recurrent neural network language model or a Transformer language model.
Further, the end-to-end speech recognition model must contain an attention mechanism.
Further, the encoder is a convolution enhanced transform encoder, a bidirectional long-short term memory network encoder, or a transform encoder.
Further, the decoder is a long-short term memory network decoder or a transform decoder.
Further, the attention mechanism is not limited to an additive attention mechanism, a position-sensitive attention mechanism, or a monotonic attention mechanism.
Compared with the prior art, the invention has the beneficial effects that: the invention realizes the estimation of the internal language model by changing the attention mechanism of the end-to-end speech recognition model, and can be applied to all end-to-end speech recognition models with the attention mechanism. Compared with the traditional language model fusion algorithm, the method can greatly improve the effect of fusing the end-to-end speech recognition model and the language model.
Drawings
FIG. 1 is a block diagram illustrating an overall structure of a fusion method based on an end-to-end speech recognition model and a language model according to an embodiment;
FIG. 2 is a block diagram of an embodiment speech recognition model;
FIG. 3 is a block diagram of an internal language model estimation model modified from the decoder portion of the speech recognition model in an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Example 1
As shown in fig. 1, fig. 2, and fig. 3, an end-to-end speech recognition model and language model based fusion method is used, and the selected speech recognition model in this embodiment is composed of a transformer encoder, an additive attention mechanism, and an LSTM decoder. Wherein the former encoder is composed of 12 layers, each layer has a width of 512 dimensions, and the number of self-attention mechanisms of the encoder is eight. Random discarding is used in the training process to prevent overfitting of the model. The decoder is composed of two layers of long and short term memory networks, each layer having a width of 2048. The language model uses an RNN language model and is composed of three layers of LSTM networks, the width of a hidden layer of each LSTM network is 2048-dimensional, random discarding is used in the training process, and overfitting of the model is prevented. In this example, the speech recognition model was trained using the chinese universal data set, and the medical dialogue data set was selected as the test set. The external language model in this example is trained using a large volume of medical corpus text in order to match the test set. In the implementation, 40-dimensional MFCC features are extracted from a selected voice data set and serve as input audio features, the feature extraction window is 25 milliseconds long, and the window moves by 10 milliseconds. The method specifically comprises the following steps:
s1, training an end-to-end speech recognition model by using the speech and text pairs in the Chinese universal data training set. The learning rate is exponentially decreased when the end-to-end speech recognition model is trained, the learning rate is 0.015 at the beginning, and the learning rate is exponentially attenuated to 0.0015 after 400000 iterations. And training an external language model using the medical corpus text data. When the language model is trained, the learning rate adopts an exponential descent method, the learning rate is 0.015 at the beginning, and the learning rate is exponentially attenuated to 0.0015 after 100000 iterations;
the language model used in the implementation is a long-term and short-term memory network language model, and the formula is as follows:
Figure BDA0003543349360000051
wherein
Figure BDA0003543349360000052
A prediction output representing an external language model at the current time, yiRepresenting words entered at time i, yi-1...y0Text sequence composed of all words from 0 to the time before the current time, LSTM represents long-short term memory network, softmax represents activation function, and the text sequence is output after the activation function is activated by the softmax
Figure BDA0003543349360000061
Is the normalized probability.
As another embodiment, a Transformer language model is applied to replace the long-short term memory network language model, and the Transformer language model can be expressed as:
Figure BDA0003543349360000062
wherein, the Transformer represents a Transformer network, the softmax represents an activation function, and the activation function is output after passing through the softmax
Figure BDA0003543349360000063
Is the normalized probability.
And S2, taking out the decoder part of the trained speech recognition model to form an independent model, wherein the internal language model estimation model can be obtained by replacing the attention mechanism in the original speech recognition model by a full-connection network and removing the encoder part. The specific formula is as follows:
Figure BDA0003543349360000064
Figure BDA0003543349360000065
Figure BDA0003543349360000066
Figure BDA0003543349360000067
wherein all corner markers indicate the moment of decoding,
Figure BDA0003543349360000068
is the hidden state of the autoregressive decoder at the ith moment
Figure BDA0003543349360000069
And predicted output of previous time
Figure BDA00035433493600000610
And jointly calculating. The hidden state may be provided by a fully connected network
Figure BDA00035433493600000611
Conversion to a query vector for a current time
Figure BDA00035433493600000612
Content vector
Figure BDA00035433493600000613
By query vectors
Figure BDA00035433493600000614
Through a process ofConnecting network
Figure BDA00035433493600000615
Then directly obtaining the product. Full connectivity network FNN for use thereinilmeThe method is characterized in that the method is a two-layer fully-connected network with the width of 512 dimensions, and an activation function is RELU;
Figure BDA00035433493600000616
representing the concatenation of the hidden state with the content vector at the current time. Get the current moment content vector as the end-to-end speech recognition model
Figure BDA00035433493600000617
Then, the hidden state of the current time is compared with the hidden state of the current time
Figure BDA00035433493600000618
Sending the spliced data to a full-connection network FNN2And obtaining the normalized probability output of the current moment after the softmax activation function.
S3, training the independent model by the text in the Chinese universal data set, wherein the learning rate during training is exponentially decreased to 0.015 at the beginning and exponentially decreased to 0.0015 after 10000 iterations. During training, the parameters originally belonging to the decoder in the fixed model only update the parameters of the newly added full-connection network, and the specific operation formula is as follows:
Figure BDA0003543349360000071
Figure BDA0003543349360000072
Figure BDA0003543349360000073
Figure BDA0003543349360000074
Lossilm=CE(yilm,GT)
Figure BDA0003543349360000075
wherein,
Figure BDA0003543349360000076
is the state of the autoregressive decoder, the hidden state being the hidden state at the previous decoding moment
Figure BDA0003543349360000077
And predicted output of previous time
Figure BDA0003543349360000078
And jointly calculating. The hidden state can be converted into the current query vector by a full-connection network
Figure BDA0003543349360000079
Content vector
Figure BDA00035433493600000710
By query vectors
Figure BDA00035433493600000711
Directly obtained after passing through a full-connection network and outputs a content vector by using a current time attention mechanism
Figure BDA00035433493600000712
And hidden state at the current time
Figure BDA00035433493600000713
The prediction output of the current moment can be obtained through a full-connection network after splicing
Figure BDA00035433493600000714
Figure BDA00035433493600000715
Figure BDA00035433493600000716
Representing the predicted output of the internal language model estimation model, GT representing the standard output at all times, CE representing the cross entropy Loss function, LossilmIs a loss function used to train the internal language model estimation model. Using gradient descent algorithm during training, and using loss function to parameters in newly-added fully-connected network
Figure BDA00035433493600000717
Partial derivative of (2)
Figure BDA00035433493600000718
And updating parameters in the newly-added fully-connected network, wherein the parameters needing to be updated comprise the weight parameters and the bias parameters in the newly-added fully-connected network. Meanwhile, the parameters in the encoder are fixed, and the parameters of the Decoder refer to the parameters in the Decoder network Decoder and FNN1And FNN2Weight and bias parameters in (1).
The converged model is called an estimation model for the internal language model;
s4, decoding the medical dialogue data set by using the Beam Search algorithm, wherein the score calculation method during decoding is as follows: the score of the speech recognition model is added to the score of the external language model, and then the score of the estimation model of the internal language model is subtracted, and the calculation formula of the fusion mode is as follows.
Figure BDA0003543349360000081
Wherein
Figure BDA0003543349360000082
And
Figure BDA0003543349360000083
respectively representing the current timeThe output of the voice recognition model, the external language model and the estimated internal language model normalizes the probability, and the log represents the logarithm of the natural number. Lambda [ alpha ]elmAnd λilmThe fusion weights of the external language model and the internal language model are respectively represented, the two data need to be set according to specific conditions, and the effect is better when the two parameters are equal. scoreiThe score after the fusion of the current moment is shown, and the subsequent beam search algorithm decodes according to the score and outputs the final result
S5, the decoding result is the decoding result of the fusion of the speech recognition model and the external language model.
Example 2
The speech recognition model used by the implementation and used for training end to end can be expressed as a structure, speech to be recognized is input into an encoder of the speech recognition model after being extracted into a feature sequence through features, and is output to an attention mechanism module for storage and standby after being encoded by the encoder. The decoder calculates the predicted output of the current moment by using an autoregressive mode and integrating an attention mechanism, and the specific formula is as follows:
H=Conformer(X)
Figure BDA0003543349360000084
qi=FNN1(si)
ci=Attention(H,qi)
Figure BDA0003543349360000085
wherein X ═ X1,x2,…,xt,…,xT]For the sequence of audio features to be recognized, where xtRepresents the audio characteristics of the t-th frame, and X belongs to RT×dT is the length of the audio sequence, and d is the characteristic dimension; h ═ H1,h2,…,ht,…,hT]Is the output after being encoded by the encoder, htIs the coding corresponding to the acoustic characteristics at time tAnd (6) outputting the codes. siIs the state of the autoregressive decoder, the hidden state being the hidden state s from the previous decoding momenti-1And the predicted output of the previous moment is calculated together. The hidden state can be converted into a query vector q at the current moment by a fully-connected networkiThe attention mechanism calculates a corresponding content vector c based on the query vectoriOutputting the content vector and the hidden state s at the current moment by using the attention mechanism at the current momentiThe prediction output of the current moment can be obtained through a full-connection network after splicing
Figure BDA0003543349360000091
This process is then repeated, in an autoregressive manner, until the stop symbol is predicted and decoding is stopped.
The attention mechanism in the end-to-end speech recognition model has the following functions: the decoder acquires the acoustic information processed by the encoder through an attention mechanism.
As another embodiment, a bidirectional long-short term memory network encoder or a Transformer encoder may be applied instead of the former encoder, and these two encoders may be respectively expressed as:
H=BLSTM(X)
H=Transformer(X)。
example 3
The implementation of the fusion method based on the end-to-end voice recognition model and the language model comprises the following steps:
s1, training an end-to-end speech recognition model by using speech and text pairs, and training an external language model by using text data; the end-to-end speech recognition model comprises an encoder, a decoder and an attention mechanism, and the decoder acquires acoustic information processed by the encoder through the attention mechanism;
the encoder is a Conformer encoder, a BLSTM encoder or a Transformer encoder;
the decoder is an LSTM decoder or a transform decoder;
the attention mechanism is an additive attention mechanism, a position-sensitive attention mechanism or a monotonic attention mechanism. The end-to-end speech recognition model can also be formed by combining the modules.
And S2, independently taking out the trained decoder of the end-to-end speech recognition model, and replacing the attention of the model with a full-connection network with a double-layer width of 512 to form an independent model. The first two layers of this fully connected network use RELU activation functions, the last output layer does not use activation functions;
and S3, training the independent model by using training data to text, wherein parameters originally belonging to an end-to-end speech recognition model decoder need to be fixed during training, and only parameters in a newly added full connection layer need to be updated. The learning rate during training is exponentially decreased to 0.015 at the beginning, and is exponentially decreased to 0.0015 after 10000 iterations. And after convergence, obtaining an estimation model of the internal language model;
and S4, decoding the score fusion of the end-to-end speech recognition model, the external language model and the estimation model of the internal language model to obtain a decoding result, wherein the Beam Search should use a larger Beam size as a decoding parameter when decoding, and the Beam size is 60 in the example.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims (10)

1. A fusion method based on an end-to-end speech recognition model and a language model is characterized by comprising the following steps:
s1, training an end-to-end speech recognition model by using speech and text pairs, and training an external language model by using text data; the end-to-end speech recognition model includes an encoder, a decoder, and an attention mechanism;
s2, independently taking out the trained decoder of the end-to-end speech recognition model and forming an independent model;
s3, training the independent model by using training data to text separately, and obtaining an estimation model of the internal language model after convergence;
and S4, decoding the score fusion of the end-to-end speech recognition model, the external language model and the estimation model of the internal language model to obtain a decoding result.
2. The method according to claim 1, wherein the trained decoder of the end-to-end speech recognition model is taken out separately and forms an independent model, specifically: modifying the topology of the decoder to form an independent model by: the attention mechanism is replaced with a fully connected network.
3. The method according to claim 2, wherein the independent model is trained separately with text part data of the originally trained end-to-end speech recognition model, and an estimation model of the internal language model is obtained after convergence;
and during training, fixing the parameters originally belonging to the decoder, only updating the parameters of the added full-connection network, wherein the parameters needing to be updated comprise the weight and the bias parameters in the newly added full-connection network, and obtaining an estimation model of the internal language model after convergence.
4. The fusion method of claim 1, wherein the decoding is performed by using a Beam Search algorithm, and the score of decoding is calculated as: subtracting the score of the estimation model of the internal language model after adding the score of the external language model to the score of the end-to-end speech recognition model; the score is calculated by inputting the normalized probability distribution output by the corresponding model into a natural logarithm function.
5. The method of claim 4, wherein the score weights of the end-to-end speech recognition model, the external language model and the internal language model are controlled by setting two fusion weights.
6. The method of claim 1, wherein the end-to-end speech recognition model is a recurrent neural network language model.
7. The method of claim 1, wherein the end-to-end speech recognition model is a Transformer language model.
8. The method of claim 1, wherein the encoder is a convolutional enhanced Transformer encoder, a bi-directional long short term memory network encoder, or a Transformer encoder.
9. The method of claim 1, wherein the decoder is a long-short term memory network decoder or a Transformer decoder.
10. The method for fusing the end-to-end voice recognition model and the language model according to any one of claims 1 to 9, wherein the attention mechanism is not limited to an additive attention mechanism, a position-sensitive attention mechanism or a monotonic attention mechanism.
CN202210242872.3A 2022-03-11 2022-03-11 Fusion method based on end-to-end voice recognition model and language model Pending CN114596843A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210242872.3A CN114596843A (en) 2022-03-11 2022-03-11 Fusion method based on end-to-end voice recognition model and language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210242872.3A CN114596843A (en) 2022-03-11 2022-03-11 Fusion method based on end-to-end voice recognition model and language model

Publications (1)

Publication Number Publication Date
CN114596843A true CN114596843A (en) 2022-06-07

Family

ID=81809381

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210242872.3A Pending CN114596843A (en) 2022-03-11 2022-03-11 Fusion method based on end-to-end voice recognition model and language model

Country Status (1)

Country Link
CN (1) CN114596843A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114944148A (en) * 2022-07-09 2022-08-26 昆明理工大学 Streaming Vietnamese speech recognition method fusing external language knowledge
CN117351955A (en) * 2023-10-11 2024-01-05 黑龙江大学 Helicopter voice recognition method and device based on transfer learning and end-to-end model

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114944148A (en) * 2022-07-09 2022-08-26 昆明理工大学 Streaming Vietnamese speech recognition method fusing external language knowledge
CN114944148B (en) * 2022-07-09 2023-08-22 昆明理工大学 Streaming Vietnam voice recognition method integrating external language knowledge
CN117351955A (en) * 2023-10-11 2024-01-05 黑龙江大学 Helicopter voice recognition method and device based on transfer learning and end-to-end model

Similar Documents

Publication Publication Date Title
CN110473531B (en) Voice recognition method, device, electronic equipment, system and storage medium
CN108305617B (en) Method and device for recognizing voice keywords
CN110189749B (en) Automatic voice keyword recognition method
CN111199727B (en) Speech recognition model training method, system, mobile terminal and storage medium
CN111382582B (en) Neural machine translation decoding acceleration method based on non-autoregressive
CN114023316A (en) TCN-Transformer-CTC-based end-to-end Chinese voice recognition method
CN111968629A (en) Chinese speech recognition method combining Transformer and CNN-DFSMN-CTC
CN114596843A (en) Fusion method based on end-to-end voice recognition model and language model
CN112052692A (en) Mongolian Chinese neural machine translation method based on grammar supervision and deep reinforcement learning
CN112967739B (en) Voice endpoint detection method and system based on long-term and short-term memory network
CN111783477B (en) Voice translation method and system
CN110532555B (en) Language evaluation generation method based on reinforcement learning
US20220044671A1 (en) Spoken language understanding
CN112509560B (en) Voice recognition self-adaption method and system based on cache language model
CN113241075A (en) Transformer end-to-end speech recognition method based on residual Gaussian self-attention
CN102945673A (en) Continuous speech recognition method with speech command range changed dynamically
CN113488028A (en) Speech transcription recognition training decoding method and system based on rapid skip decoding
CN115831102A (en) Speech recognition method and device based on pre-training feature representation and electronic equipment
CN112967720A (en) End-to-end voice-to-text model optimization method under small amount of accent data
CN115440197A (en) Voice recognition method and system based on domain classification and hot word prefix tree cluster search
CN118471201B (en) Efficient self-adaptive hotword error correction method and system for speech recognition engine
Macoskey et al. Bifocal neural asr: Exploiting keyword spotting for inference optimization
CN116227503A (en) CTC-based non-autoregressive end-to-end speech translation method
CN114999460A (en) Lightweight Chinese speech recognition method combined with Transformer
CN112767921A (en) Voice recognition self-adaption method and system based on cache language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination