CN114596843A - Fusion method based on end-to-end voice recognition model and language model - Google Patents
Fusion method based on end-to-end voice recognition model and language model Download PDFInfo
- Publication number
- CN114596843A CN114596843A CN202210242872.3A CN202210242872A CN114596843A CN 114596843 A CN114596843 A CN 114596843A CN 202210242872 A CN202210242872 A CN 202210242872A CN 114596843 A CN114596843 A CN 114596843A
- Authority
- CN
- China
- Prior art keywords
- model
- speech recognition
- language model
- recognition model
- decoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007500 overflow downdraw method Methods 0.000 title claims description 10
- 230000004927 fusion Effects 0.000 claims abstract description 19
- 230000007246 mechanism Effects 0.000 claims description 34
- 238000000034 method Methods 0.000 claims description 34
- 230000006870 function Effects 0.000 claims description 15
- 230000015654 memory Effects 0.000 claims description 7
- 239000000654 additive Substances 0.000 claims description 4
- 230000000996 additive effect Effects 0.000 claims description 4
- 238000010845 search algorithm Methods 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 230000000306 recurrent effect Effects 0.000 claims description 3
- 230000006403 short-term memory Effects 0.000 claims description 3
- 239000013598 vector Substances 0.000 description 13
- 230000004913 activation Effects 0.000 description 8
- 230000003247 decreasing effect Effects 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000002457 bidirectional effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000002238 attenuated effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000007787 long-term memory Effects 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
Abstract
S1, training an end-to-end speech recognition model by using speech and text pairs, and training an external language model by using text data; s2, independently taking out the trained decoder part of the voice recognition model and forming an independent model; s3, training the independent model by using training data to text and obtaining an estimation model of the internal language model after convergence; s4, decoding the score fusion of the speech recognition model, the external language model and the estimation model of the internal language model to obtain a decoding result. The algorithm can improve the recognition accuracy after the speech recognition model and the language model are fused, and has wide application prospect in the field of speech recognition.
Description
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a fusion method based on an end-to-end voice recognition model and a language model.
Background
The most classical speech recognition method at present is based on a method of combining a Hidden Markov Model (HMM) and a Neural Network (DNN). Although the method well utilizes the short-time stationary characteristic of the voice signal, the method still has the defects of acoustic model, pronunciation dictionary, multi-model cascade of the language model, inconsistent model training target, large decoding space and the like. The invention of end-to-end voice recognition simplifies the whole voice recognition process, and the training targets are simple and consistent.
Currently, end-to-end speech recognition models can be mainly classified into three categories: continuous Time Classification (CTC), Recurrent Neural Network-Transducer (RNN-Transducer) and Attention-based sequence models (Attention-based End-to-End Model, A-E2E). The sequence model based on the attention mechanism aligns frame-level speech signals and character sequences by adopting the attention mechanism, and the accuracy of the sequence model is the highest in the end-to-end speech recognition model. The end-to-end speech recognition framework is largely divided into three parts, an encoder, a decoder and an attention mechanism. It is also important to obtain a language model with better recognition effect. The Fusion algorithm of the language model and the speech recognition model which is mainstream at present is a Shallow Fusion technique (SF). With this technique, the fusion technique works well for the traditional speech recognition model, but the promotion of the end-to-end speech recognition model is very limited. This is mainly because, unlike the conventional speech recognition Model, the end-to-end speech recognition Model models the entire sentence, and therefore an Internal Language Model (ILM) is inevitably learned. This internal language model may affect the fusion of the speech recognition model and the external language model. As the end-to-end model is more and more widely used, more and more solutions are proposed, among which the best known is the sensitivity Ratio method proposed by Masashi Sugiyama. In the method, a small language model is trained on data trained by a speech recognition model to approximate ILM, and the approximate ILM is subtracted when the speech recognition model is fused with an external language model so as to achieve the purpose of reducing the influence of ILM. Based on the sensitivity Ratio method, microsoft provides an Internal Language Model Estimation technology (Internal Language Model Estimation), which can directly and accurately estimate a Language Model in a speech recognition Model, so that an estimated more accurate ILM is subtracted in a fusion stage, and great performance improvement is achieved. However, the ILME method proposed by microsoft is only applicable to an end-to-end speech recognition model with a bidirectional long-short term memory network encoder, and cannot be applied to the newly proposed transform encoder and former encoder, so that its application is greatly limited. Meanwhile, the method proposed by Microsoft has no self-adaptive function, so that the optimal effect cannot be obtained even if the method is applied TO an END-TO-END speech recognition MODEL with a BLSTM encoder (INTERNAL LANGUAGE MODEL ESTIMATION FOR DOMAIN-ADAPTIVE END-TO-END).
Disclosure of Invention
Aiming at the defects of the existing language model fusion technology and the internal language model estimation technology, the invention provides a fusion method based on an end-to-end speech recognition model and a language model, and mainly solves the problem that the existing algorithm for internal language estimation does not have self-adaptive capacity. Meanwhile, the prior art has limited estimation accuracy on the internal language model, and has limited precision improvement in fusion. The main application scene of the invention is an end-to-end speech recognition model based on an attention mechanism, which is called end-to-end speech recognition for short, an internal language model in the end-to-end speech recognition model is estimated in a model training mode, and the estimated internal language model is subtracted in a model reasoning and decoding stage. Compared with the traditional language model fusion technology, the method can greatly improve the recognition accuracy rate of the fused end-to-end speech recognition model and the external language model. Meanwhile, the method can be applied to all the attention-based speech recognition models including models based on a converter encoder, a BLSTM encoder and a Transformer encoder, and can also be applied to models of an LSTM decoder and a Transformer decoder, so that the method has wider application range.
The invention is realized by at least one of the following technical schemes.
A fusion method based on an end-to-end speech recognition model and a language model comprises the following steps:
s1, training an end-to-end speech recognition model by using speech and text pairs, and training an external language model by using text data; the end-to-end speech recognition model includes an encoder, a decoder, and an attention mechanism;
s2, independently taking out the trained decoder of the end-to-end speech recognition model and forming an independent model;
s3, training the independent model by using training data to text separately, and obtaining an estimation model of the internal language model after convergence;
and S4, decoding the score fusion of the end-to-end speech recognition model, the external language model and the estimation model of the internal language model to obtain a decoding result.
Further, the trained decoder of the end-to-end speech recognition model is taken out separately and forms an independent model, specifically: modifying the topology of the decoder to form an independent model by: the attention mechanism is replaced with a fully connected network.
Further, the independent model is trained independently by text part data of the originally trained end-to-end speech recognition model, and an estimation model of the internal language model is obtained after convergence;
and during training, fixing the parameters originally belonging to the decoder, only updating the parameters of the added full-connection network, wherein the parameters needing to be updated comprise the weight and the bias parameters in the newly added full-connection network, and obtaining an estimation model of the internal language model after convergence.
Further, decoding is performed by using the Beam Search algorithm, and the score in decoding is calculated as: subtracting the score of the estimation model of the internal language model after adding the score of the external language model to the score of the end-to-end speech recognition model; the score is calculated by inputting the normalized probability distribution output by the corresponding model into a natural logarithm function.
Further, the score weights of the end-to-end speech recognition model, the external language model and the estimation model of the internal language model are controlled by setting two fusion weights, respectively.
Further, the end-to-end speech recognition model is a recurrent neural network language model or a Transformer language model.
Further, the end-to-end speech recognition model must contain an attention mechanism.
Further, the encoder is a convolution enhanced transform encoder, a bidirectional long-short term memory network encoder, or a transform encoder.
Further, the decoder is a long-short term memory network decoder or a transform decoder.
Further, the attention mechanism is not limited to an additive attention mechanism, a position-sensitive attention mechanism, or a monotonic attention mechanism.
Compared with the prior art, the invention has the beneficial effects that: the invention realizes the estimation of the internal language model by changing the attention mechanism of the end-to-end speech recognition model, and can be applied to all end-to-end speech recognition models with the attention mechanism. Compared with the traditional language model fusion algorithm, the method can greatly improve the effect of fusing the end-to-end speech recognition model and the language model.
Drawings
FIG. 1 is a block diagram illustrating an overall structure of a fusion method based on an end-to-end speech recognition model and a language model according to an embodiment;
FIG. 2 is a block diagram of an embodiment speech recognition model;
FIG. 3 is a block diagram of an internal language model estimation model modified from the decoder portion of the speech recognition model in an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Example 1
As shown in fig. 1, fig. 2, and fig. 3, an end-to-end speech recognition model and language model based fusion method is used, and the selected speech recognition model in this embodiment is composed of a transformer encoder, an additive attention mechanism, and an LSTM decoder. Wherein the former encoder is composed of 12 layers, each layer has a width of 512 dimensions, and the number of self-attention mechanisms of the encoder is eight. Random discarding is used in the training process to prevent overfitting of the model. The decoder is composed of two layers of long and short term memory networks, each layer having a width of 2048. The language model uses an RNN language model and is composed of three layers of LSTM networks, the width of a hidden layer of each LSTM network is 2048-dimensional, random discarding is used in the training process, and overfitting of the model is prevented. In this example, the speech recognition model was trained using the chinese universal data set, and the medical dialogue data set was selected as the test set. The external language model in this example is trained using a large volume of medical corpus text in order to match the test set. In the implementation, 40-dimensional MFCC features are extracted from a selected voice data set and serve as input audio features, the feature extraction window is 25 milliseconds long, and the window moves by 10 milliseconds. The method specifically comprises the following steps:
s1, training an end-to-end speech recognition model by using the speech and text pairs in the Chinese universal data training set. The learning rate is exponentially decreased when the end-to-end speech recognition model is trained, the learning rate is 0.015 at the beginning, and the learning rate is exponentially attenuated to 0.0015 after 400000 iterations. And training an external language model using the medical corpus text data. When the language model is trained, the learning rate adopts an exponential descent method, the learning rate is 0.015 at the beginning, and the learning rate is exponentially attenuated to 0.0015 after 100000 iterations;
the language model used in the implementation is a long-term and short-term memory network language model, and the formula is as follows:
whereinA prediction output representing an external language model at the current time, yiRepresenting words entered at time i, yi-1...y0Text sequence composed of all words from 0 to the time before the current time, LSTM represents long-short term memory network, softmax represents activation function, and the text sequence is output after the activation function is activated by the softmaxIs the normalized probability.
As another embodiment, a Transformer language model is applied to replace the long-short term memory network language model, and the Transformer language model can be expressed as:
wherein, the Transformer represents a Transformer network, the softmax represents an activation function, and the activation function is output after passing through the softmaxIs the normalized probability.
And S2, taking out the decoder part of the trained speech recognition model to form an independent model, wherein the internal language model estimation model can be obtained by replacing the attention mechanism in the original speech recognition model by a full-connection network and removing the encoder part. The specific formula is as follows:
wherein all corner markers indicate the moment of decoding,is the hidden state of the autoregressive decoder at the ith momentAnd predicted output of previous timeAnd jointly calculating. The hidden state may be provided by a fully connected networkConversion to a query vector for a current timeContent vectorBy query vectorsThrough a process ofConnecting networkThen directly obtaining the product. Full connectivity network FNN for use thereinilmeThe method is characterized in that the method is a two-layer fully-connected network with the width of 512 dimensions, and an activation function is RELU;representing the concatenation of the hidden state with the content vector at the current time. Get the current moment content vector as the end-to-end speech recognition modelThen, the hidden state of the current time is compared with the hidden state of the current timeSending the spliced data to a full-connection network FNN2And obtaining the normalized probability output of the current moment after the softmax activation function.
S3, training the independent model by the text in the Chinese universal data set, wherein the learning rate during training is exponentially decreased to 0.015 at the beginning and exponentially decreased to 0.0015 after 10000 iterations. During training, the parameters originally belonging to the decoder in the fixed model only update the parameters of the newly added full-connection network, and the specific operation formula is as follows:
Lossilm=CE(yilm,GT)
wherein,is the state of the autoregressive decoder, the hidden state being the hidden state at the previous decoding momentAnd predicted output of previous timeAnd jointly calculating. The hidden state can be converted into the current query vector by a full-connection networkContent vectorBy query vectorsDirectly obtained after passing through a full-connection network and outputs a content vector by using a current time attention mechanismAnd hidden state at the current timeThe prediction output of the current moment can be obtained through a full-connection network after splicing Representing the predicted output of the internal language model estimation model, GT representing the standard output at all times, CE representing the cross entropy Loss function, LossilmIs a loss function used to train the internal language model estimation model. Using gradient descent algorithm during training, and using loss function to parameters in newly-added fully-connected networkPartial derivative of (2)And updating parameters in the newly-added fully-connected network, wherein the parameters needing to be updated comprise the weight parameters and the bias parameters in the newly-added fully-connected network. Meanwhile, the parameters in the encoder are fixed, and the parameters of the Decoder refer to the parameters in the Decoder network Decoder and FNN1And FNN2Weight and bias parameters in (1).
The converged model is called an estimation model for the internal language model;
s4, decoding the medical dialogue data set by using the Beam Search algorithm, wherein the score calculation method during decoding is as follows: the score of the speech recognition model is added to the score of the external language model, and then the score of the estimation model of the internal language model is subtracted, and the calculation formula of the fusion mode is as follows.
WhereinAndrespectively representing the current timeThe output of the voice recognition model, the external language model and the estimated internal language model normalizes the probability, and the log represents the logarithm of the natural number. Lambda [ alpha ]elmAnd λilmThe fusion weights of the external language model and the internal language model are respectively represented, the two data need to be set according to specific conditions, and the effect is better when the two parameters are equal. scoreiThe score after the fusion of the current moment is shown, and the subsequent beam search algorithm decodes according to the score and outputs the final result
S5, the decoding result is the decoding result of the fusion of the speech recognition model and the external language model.
Example 2
The speech recognition model used by the implementation and used for training end to end can be expressed as a structure, speech to be recognized is input into an encoder of the speech recognition model after being extracted into a feature sequence through features, and is output to an attention mechanism module for storage and standby after being encoded by the encoder. The decoder calculates the predicted output of the current moment by using an autoregressive mode and integrating an attention mechanism, and the specific formula is as follows:
H=Conformer(X)
qi=FNN1(si)
ci=Attention(H,qi)
wherein X ═ X1,x2,…,xt,…,xT]For the sequence of audio features to be recognized, where xtRepresents the audio characteristics of the t-th frame, and X belongs to RT×dT is the length of the audio sequence, and d is the characteristic dimension; h ═ H1,h2,…,ht,…,hT]Is the output after being encoded by the encoder, htIs the coding corresponding to the acoustic characteristics at time tAnd (6) outputting the codes. siIs the state of the autoregressive decoder, the hidden state being the hidden state s from the previous decoding momenti-1And the predicted output of the previous moment is calculated together. The hidden state can be converted into a query vector q at the current moment by a fully-connected networkiThe attention mechanism calculates a corresponding content vector c based on the query vectoriOutputting the content vector and the hidden state s at the current moment by using the attention mechanism at the current momentiThe prediction output of the current moment can be obtained through a full-connection network after splicingThis process is then repeated, in an autoregressive manner, until the stop symbol is predicted and decoding is stopped.
The attention mechanism in the end-to-end speech recognition model has the following functions: the decoder acquires the acoustic information processed by the encoder through an attention mechanism.
As another embodiment, a bidirectional long-short term memory network encoder or a Transformer encoder may be applied instead of the former encoder, and these two encoders may be respectively expressed as:
H=BLSTM(X)
H=Transformer(X)。
example 3
The implementation of the fusion method based on the end-to-end voice recognition model and the language model comprises the following steps:
s1, training an end-to-end speech recognition model by using speech and text pairs, and training an external language model by using text data; the end-to-end speech recognition model comprises an encoder, a decoder and an attention mechanism, and the decoder acquires acoustic information processed by the encoder through the attention mechanism;
the encoder is a Conformer encoder, a BLSTM encoder or a Transformer encoder;
the decoder is an LSTM decoder or a transform decoder;
the attention mechanism is an additive attention mechanism, a position-sensitive attention mechanism or a monotonic attention mechanism. The end-to-end speech recognition model can also be formed by combining the modules.
And S2, independently taking out the trained decoder of the end-to-end speech recognition model, and replacing the attention of the model with a full-connection network with a double-layer width of 512 to form an independent model. The first two layers of this fully connected network use RELU activation functions, the last output layer does not use activation functions;
and S3, training the independent model by using training data to text, wherein parameters originally belonging to an end-to-end speech recognition model decoder need to be fixed during training, and only parameters in a newly added full connection layer need to be updated. The learning rate during training is exponentially decreased to 0.015 at the beginning, and is exponentially decreased to 0.0015 after 10000 iterations. And after convergence, obtaining an estimation model of the internal language model;
and S4, decoding the score fusion of the end-to-end speech recognition model, the external language model and the estimation model of the internal language model to obtain a decoding result, wherein the Beam Search should use a larger Beam size as a decoding parameter when decoding, and the Beam size is 60 in the example.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.
Claims (10)
1. A fusion method based on an end-to-end speech recognition model and a language model is characterized by comprising the following steps:
s1, training an end-to-end speech recognition model by using speech and text pairs, and training an external language model by using text data; the end-to-end speech recognition model includes an encoder, a decoder, and an attention mechanism;
s2, independently taking out the trained decoder of the end-to-end speech recognition model and forming an independent model;
s3, training the independent model by using training data to text separately, and obtaining an estimation model of the internal language model after convergence;
and S4, decoding the score fusion of the end-to-end speech recognition model, the external language model and the estimation model of the internal language model to obtain a decoding result.
2. The method according to claim 1, wherein the trained decoder of the end-to-end speech recognition model is taken out separately and forms an independent model, specifically: modifying the topology of the decoder to form an independent model by: the attention mechanism is replaced with a fully connected network.
3. The method according to claim 2, wherein the independent model is trained separately with text part data of the originally trained end-to-end speech recognition model, and an estimation model of the internal language model is obtained after convergence;
and during training, fixing the parameters originally belonging to the decoder, only updating the parameters of the added full-connection network, wherein the parameters needing to be updated comprise the weight and the bias parameters in the newly added full-connection network, and obtaining an estimation model of the internal language model after convergence.
4. The fusion method of claim 1, wherein the decoding is performed by using a Beam Search algorithm, and the score of decoding is calculated as: subtracting the score of the estimation model of the internal language model after adding the score of the external language model to the score of the end-to-end speech recognition model; the score is calculated by inputting the normalized probability distribution output by the corresponding model into a natural logarithm function.
5. The method of claim 4, wherein the score weights of the end-to-end speech recognition model, the external language model and the internal language model are controlled by setting two fusion weights.
6. The method of claim 1, wherein the end-to-end speech recognition model is a recurrent neural network language model.
7. The method of claim 1, wherein the end-to-end speech recognition model is a Transformer language model.
8. The method of claim 1, wherein the encoder is a convolutional enhanced Transformer encoder, a bi-directional long short term memory network encoder, or a Transformer encoder.
9. The method of claim 1, wherein the decoder is a long-short term memory network decoder or a Transformer decoder.
10. The method for fusing the end-to-end voice recognition model and the language model according to any one of claims 1 to 9, wherein the attention mechanism is not limited to an additive attention mechanism, a position-sensitive attention mechanism or a monotonic attention mechanism.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210242872.3A CN114596843A (en) | 2022-03-11 | 2022-03-11 | Fusion method based on end-to-end voice recognition model and language model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210242872.3A CN114596843A (en) | 2022-03-11 | 2022-03-11 | Fusion method based on end-to-end voice recognition model and language model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114596843A true CN114596843A (en) | 2022-06-07 |
Family
ID=81809381
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210242872.3A Pending CN114596843A (en) | 2022-03-11 | 2022-03-11 | Fusion method based on end-to-end voice recognition model and language model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114596843A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114944148A (en) * | 2022-07-09 | 2022-08-26 | 昆明理工大学 | Streaming Vietnamese speech recognition method fusing external language knowledge |
CN117351955A (en) * | 2023-10-11 | 2024-01-05 | 黑龙江大学 | Helicopter voice recognition method and device based on transfer learning and end-to-end model |
-
2022
- 2022-03-11 CN CN202210242872.3A patent/CN114596843A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114944148A (en) * | 2022-07-09 | 2022-08-26 | 昆明理工大学 | Streaming Vietnamese speech recognition method fusing external language knowledge |
CN114944148B (en) * | 2022-07-09 | 2023-08-22 | 昆明理工大学 | Streaming Vietnam voice recognition method integrating external language knowledge |
CN117351955A (en) * | 2023-10-11 | 2024-01-05 | 黑龙江大学 | Helicopter voice recognition method and device based on transfer learning and end-to-end model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110473531B (en) | Voice recognition method, device, electronic equipment, system and storage medium | |
CN108305617B (en) | Method and device for recognizing voice keywords | |
CN110189749B (en) | Automatic voice keyword recognition method | |
CN111199727B (en) | Speech recognition model training method, system, mobile terminal and storage medium | |
CN111382582B (en) | Neural machine translation decoding acceleration method based on non-autoregressive | |
CN114023316A (en) | TCN-Transformer-CTC-based end-to-end Chinese voice recognition method | |
CN111968629A (en) | Chinese speech recognition method combining Transformer and CNN-DFSMN-CTC | |
CN114596843A (en) | Fusion method based on end-to-end voice recognition model and language model | |
CN112052692A (en) | Mongolian Chinese neural machine translation method based on grammar supervision and deep reinforcement learning | |
CN112967739B (en) | Voice endpoint detection method and system based on long-term and short-term memory network | |
CN111783477B (en) | Voice translation method and system | |
CN110532555B (en) | Language evaluation generation method based on reinforcement learning | |
US20220044671A1 (en) | Spoken language understanding | |
CN112509560B (en) | Voice recognition self-adaption method and system based on cache language model | |
CN113241075A (en) | Transformer end-to-end speech recognition method based on residual Gaussian self-attention | |
CN102945673A (en) | Continuous speech recognition method with speech command range changed dynamically | |
CN113488028A (en) | Speech transcription recognition training decoding method and system based on rapid skip decoding | |
CN115831102A (en) | Speech recognition method and device based on pre-training feature representation and electronic equipment | |
CN112967720A (en) | End-to-end voice-to-text model optimization method under small amount of accent data | |
CN115440197A (en) | Voice recognition method and system based on domain classification and hot word prefix tree cluster search | |
CN118471201B (en) | Efficient self-adaptive hotword error correction method and system for speech recognition engine | |
Macoskey et al. | Bifocal neural asr: Exploiting keyword spotting for inference optimization | |
CN116227503A (en) | CTC-based non-autoregressive end-to-end speech translation method | |
CN114999460A (en) | Lightweight Chinese speech recognition method combined with Transformer | |
CN112767921A (en) | Voice recognition self-adaption method and system based on cache language model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |