CN118038873A - Speech recognition text error correction method based on pronunciation guidance - Google Patents
Speech recognition text error correction method based on pronunciation guidance Download PDFInfo
- Publication number
- CN118038873A CN118038873A CN202410163742.XA CN202410163742A CN118038873A CN 118038873 A CN118038873 A CN 118038873A CN 202410163742 A CN202410163742 A CN 202410163742A CN 118038873 A CN118038873 A CN 118038873A
- Authority
- CN
- China
- Prior art keywords
- pronunciation
- decoder
- bart
- sequence
- error correction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012937 correction Methods 0.000 title claims abstract description 63
- 238000000034 method Methods 0.000 title claims abstract description 23
- 230000004927 fusion Effects 0.000 claims abstract description 21
- 239000010410 layer Substances 0.000 claims description 26
- 230000002457 bidirectional effect Effects 0.000 claims description 10
- 239000002356 single layer Substances 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000010076 replication Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000012549 training Methods 0.000 description 6
- 238000012217 deletion Methods 0.000 description 4
- 230000037430 deletion Effects 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000003780 insertion Methods 0.000 description 3
- 230000037431 insertion Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 235000019580 granularity Nutrition 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a voice recognition text error correction method based on pronunciation guidance; the multi-granularity pronunciation characteristic coding module consists of a unidirectional GRU and four layers of transformers, and pronunciation characteristics are extracted from a pinyin sequence; the pronunciation and semantic representation fusion module fuses with the last layer of hidden state of the BART encoder through the gate control unit, and inputs fusion characteristics to the BART decoder and the duplication correction decision module; the copy correction decision takes the pronunciation and semantic fusion features and the last layer hidden state from the BART decoder as input, calculates the probability distribution of whether a character should be copied or corrected by multi-head attention, and finally keeps or corrects each character in the speech recognition text according to the probability distribution. The invention can effectively reduce the word error rate of the voice recognition, effectively relieve the problem of excessive correction in the traditional sequence-to-sequence error correction model, and provide a more flexible solution for detecting and correcting the word error of the voice recognition.
Description
Technical Field
The invention relates to a voice recognition text error correction method based on pronunciation guidance, and belongs to the field of post-processing of voice recognition.
Technical Field
Speech recognition error correction is important to improve the recognition accuracy of Automatic Speech Recognition (ASR) systems. When the ASR system transcribes the voice into the text, the ASR system is influenced by factors such as accents of different speakers, surrounding environments and the like, so that errors occur in the transcribed text, recognition errors of homophones, near-phones and the like often occur, and voice recognition text error correction is regarded as a key technology for improving recognition accuracy and readability.
Similar to machine translation, the mainstream error correction model is generally regarded as a sequence-to-sequence task, where the original sentences in which the various ASR recognition errors are treated as source inputs, and the outputs are the corresponding corrected sentences. However, sequence-to-sequence based methods often suffer from overcorrection, i.e., the model often incorrectly modifies the correct recognition result or the generated correction result is inconsistent with the original pronunciation or semantics, causing ambiguity.
Furthermore, the word error rate of ASR systems is typically low (typically less than 10%), indicating that sentences input to the error correction model have substantial overlap with output decoupling pairs, and ASR recognition errors are mainly homophonic substitution errors, with small part insertion and deletion errors. Therefore, it is important to ensure that the text generated by the sequence-to-sequence correction model not only preserves semantics, but also conforms more to pronunciation similarity. Recent studies have proposed sequence-to-edit methods that combine sequence labeling and sequence generation tasks, which design an error detector for predicting corrective actions (i.e., retention, deletion, substitution, or insertion) to guide and limit the behavior of the corrector during correction. In addition, an edit distance-based scheme is designed with a length predictor for bridging a length mismatch between an input sequence and a target sequence, the predicted length indicating a position where an edit operation occurs. The sequence-to-edit method provides an additional supervisory signal to the correction model enabling finer control of the correction operation. However, the accuracy of the predictor severely affects correction performance, using only a portion of the sequence as input to the decoder tends to disrupt the structure of the original sentence and results in loss of semantic information.
On the other hand, there is also research interest in integrating speech knowledge into error correction models. Correction candidates are generated from the confusion set by selectively masking candidates that have no similar pronunciation or shape as compared to the original word. However, the confusion set generated by manual annotation has limitations in coverage and merely filtering out characters not present in the confusion set may inadvertently ignore the correct candidates. The foregoing methods treat the speech information as external knowledge rather than directly modeling the speech features in the correction model, which are not very efficient in exploiting the inherent speech features of the corrected word.
Disclosure of Invention
The invention aims to solve the technical problems that: the invention provides a speech recognition text error correction method based on pronunciation guidance, which optimizes the codec architecture of a BART pre-training model, fully utilizes the powerful repairing capability of the BART in the aspect of reconstructing noisy or damaged sentences, adds a multi-granularity phonological feature encoder on the basis of the BART pre-training model, and extracts pinyin letters at syllable level and sentence level respectively; a pronunciation-guided copy correction decision module is added to control whether the model copies correct characters from the source sentence or generates new characters, thereby directing the model to generate correct candidate words that conform to pronunciation and semantic constraints, thereby alleviating the problem of excessive error correction.
The technical scheme of the invention is as follows: a pronunciation guide based speech recognition text error correction method, the method comprising:
step1, extracting syllable-level and sentence-level pronunciation characteristics from a pinyin sequence through a constructed multi-granularity pronunciation characteristic coding module;
Step2, a pronunciation and semantic representation fusion module fuses with the last layer of hidden state of the BART encoder through a gating unit, and inputs pronunciation and semantic representation fusion characteristics to the BART decoder and a duplication correction decision module;
Step3, the duplication correction decision module takes the fusion characteristics of pronunciation and semantic representation and the hidden state of the last layer from the BART decoder as input, calculates the probability distribution of whether a character should be duplicated or corrected through multiple attentions, and finally keeps or corrects each character in the speech recognition text according to the probability distribution.
Further, step1 includes:
The multi-granularity pronunciation characteristic coding module consists of a unidirectional GRU and four layers of bidirectional Transformer encoders; the multi-granularity pronunciation characteristic coding module is used for capturing pronunciation dissimilarity among different initials, finals and tones, and respectively extracting pronunciation characteristics of syllable level and sentence level from the pinyin sequence through a single-layer unidirectional GRU and four-layer bidirectional transducer module; for the Chinese character c i, its Pinyin sequence is expressed as Where q i is the pinyin sequence length of c i, the pitch level encoding is expressed as:
wherein Emb (p i,j) is the embedding of the jth pinyin letter, And/>Represents the j-th and j-1-th hidden states, respectively, from the GRU network;
sentence-level coding adopts four layers of bidirectional convertor blocks, outputs the last hidden state of GRU as the input of each Chinese character, and embeds the position as the input to obtain pronunciation context expressed as Phonetic symbols represent a length equal to the input sequence.
Further, step2 includes: the final layer hidden state output from the BART encoder by the pronunciation and semantic representation fusion moduleCharacterized as semantic, where X is the input sequence, emb (X) and PE (X) represent characters and position embedding functions, respectively; expressed as:
Hs=BARTEnc(Emb(X)+PE(X))
in order to integrate pronunciation and semantic representation, a gating unit is designed to enable the model to determine how many pronunciation features or semantic features each character should adopt; the threshold size is calculated by a linear layer and normalized by the Sigmoid function:
fused features The update is expressed as:
Expressed as/>, for pronunciation context I=1 to n, which are sent to the BART decoder after the enhanced pronunciation characteristics are obtained, and also to the duplication correction decision module to determine whether characters in the source input sentence should be duplicated to the target output.
Further, step3 includes the following:
hidden state of last layer decoder Is output from the decoder according to the previous step S tep2 and the resulting fusion feature, where the output d t at decoding step t is represented as:
The distribution of all the tokens in the target vocabulary is then calculated by:
Probability distribution P copy, representing whether to copy from the source tag or to obtain from the generated tag, is calculated by multi-head attention, where d t is the query and H fuse is the key and value;
Where c t is the vector of interest, a t is the attention weight, the probability of replication Performing linear transformation on the connection of c t and the decoder hidden state d t, and then performing calculation by using a sigmoid function;
In this context, Is a scalar for weighting the duplication probability and 1-P copy is used for weighting the generation probability, the final output of the decoder at step t is expressed as:
Finally, the model is optimized by the average negative log-likelihood of the target real labels y t for each step t:
the beneficial effects of the invention are as follows:
The invention provides a text error correction method based on pronunciation guidance voice recognition, which adopts BART with strong repair capability to damaged sentences as a basic framework; constructing a multi-granularity pronunciation characteristic encoder, guiding a copy correction decision module and a BART decoder to identify the position where the error occurs so as to accurately correct; and judging whether to copy or generate new characters from the original sentence by utilizing a pronunciation and semantic representation fusion module. Compared with the existing ASR error correction method, the invention realizes 18% improvement on AISHELL-1 test set and more remarkable improvement (44%) on MAGICDATA. The invention has higher error correction precision, and can effectively relieve the problem of excessive error correction in the error correction model of the voice recognition based on the sequence-to-sequence.
Drawings
FIG. 1 is a general block diagram of a speech recognition text error correction method based on pronunciation guidance in accordance with the present invention;
FIG. 2 is an illustration of error correction for a speech recognition text error correction method based on pronunciation guidance in accordance with the present invention;
FIG. 3 is a schematic diagram of attention weight of a duplication correction decision module in a speech recognition text correction method based on pronunciation guidance according to the present invention;
Detailed Description
The invention provides a voice recognition text error correction method based on pronunciation guidance, which is further described below with reference to the accompanying drawings and the specific embodiments. It should be noted that the drawings are in a very simplified form and serve only to facilitate a clear and concise description of embodiments of the present inventions.
Example 1: as shown in FIG. 1, the speech recognition text error correction method based on pronunciation guidance comprises the following specific steps of data selection and preprocessing, a multi-granularity pronunciation characteristic encoder module, a pronunciation and semantic representation fusion module and a duplication correction decision module:
Step1, extracting syllable-level and sentence-level pronunciation characteristics from a pinyin sequence through a constructed multi-granularity pronunciation characteristic coding module; firstly, selecting and preprocessing data:
Experiments were performed on common datasets AISHELL-1 and MAGICDATA, where: AISHELL-1 contains 178 hours of chinese phonetic text pairing data. The training set contained about 160 hours of speech corresponding to about 120,000 sentence pairs, and the validation set and the test set contained 15 hours of speech corresponding to about 21,000 sentence pairs, respectively. MAGICDATA is another open source speech corpus containing 755 hours of phonetic text paired data. The training set contained about 660 hours of speech corresponding to about 57,000 sentence pairs, and the validation set and the test set contained about 52 hours of speech corresponding to about 26,000 sentence pairs. Including interactive questions and answers, music searches, social networking messages, home commands and controls, and the like.
The phonetic sequence of Chinese character consists of initial consonant, final sound and tone. Initials (21 total) and finals (39 total) are represented by roman letters, and 5 tones may be represented as {0,1,2,3,4}. A pinyin conversion tool (PyP inyin) is used to convert Chinese characters into a pinyin sequence. Taking the Chinese character as an example, the Pinyin sequence is 'wei 4', namely { 'w', 'ei',4}.
The multi-granularity pronunciation characteristic coding module consists of a unidirectional GRU and four layers of bidirectional Transformer encoders; the multi-granularity pronunciation characteristic coding module is used for capturing pronunciation dissimilarity among different initials, finals and tones, and respectively extracting pronunciation characteristics of syllable level and sentence level from the pinyin sequence through a single-layer unidirectional GRU and four-layer bidirectional transducer module;
In order to capture pronunciation dissimilarity among different initials, finals and finals, pronunciation characteristics of syllable level and sentence level are respectively extracted from the pinyin sequence through a single-layer unidirectional GRU and four-layer bidirectional transducer module. For the Chinese character c i, its Pinyin sequence is expressed as Where q i is the pinyin sequence length of c i, the pitch level encoding is expressed as:
wherein Emb (p i,j) is the embedding of the jth pinyin letter, And/>Represents the j-th and j-1-th hidden states, respectively, from the GRU network;
sentence-level coding adopts four layers of bidirectional convertor blocks, outputs the last hidden state of GRU as the input of each Chinese character, and embeds the position as the input to obtain pronunciation context expressed as Phonetic symbols represent a length equal to the input sequence.
Step2, a pronunciation and semantic representation fusion module fuses with the last layer of hidden state of the BART encoder through a gating unit, and inputs pronunciation and semantic representation fusion characteristics to the BART decoder and a duplication correction decision module;
The final layer hidden state output from the BART encoder by the pronunciation and semantic representation fusion module Characterized as semantic, where X is the input sequence, emb (X) and PE (X) represent characters and position embedding functions, respectively; expressed as:
Hs=BARTEnc(Emb(X)+PE(X))
in order to integrate pronunciation and semantic representation, a gating unit is designed to enable the model to determine how many pronunciation features or semantic features each character should adopt; the threshold size is calculated by a linear layer and normalized by the Sigmoid function:
fused features The update is expressed as:
Expressed as/>, for pronunciation context I=1 to n, which are sent to the BART decoder after the enhanced pronunciation characteristics are obtained, and also to the duplication correction decision module to determine whether characters in the source input sentence should be duplicated to the target output.
Step3, the duplication correction decision module takes the fusion characteristics of pronunciation and semantic representation and the hidden state of the last layer from the BART decoder as input, calculates the probability distribution of whether a character should be duplicated or corrected through multiple attentions, and finally keeps or corrects each character in the speech recognition text according to the probability distribution.
It is contemplated that there is a significant overlap between the source sequence and the target sequence in the ASR error correction task. A pronunciation-enhanced copy correction decision mechanism is applied between the BAR T encoder and decoder. The copy correction decision module determines whether the characters in the target sequence are copied from the source sequence or generated from the decoder and thus can be considered an implicit error detector, while the BART decoder can be considered an error corrector. The duplication correction decision module takes as input the pronunciation and semantic fusion representation and the last layer hidden state from the decoder, calculates the duplication or corrected probability distribution by multi-headed attention (MHA).
Hidden state of last layer decoderIs output from the decoder according to the previous step S tep2 and the resulting fusion feature, where the output d t at decoding step t is represented as:
The distribution of all the tokens in the target vocabulary is then calculated by:
Probability distribution P copy, representing whether to copy from the source tag or to obtain from the generated tag, is calculated by multi-head attention, where d t is the query and H fuse is the key and value;
Where c t is the vector of interest, a t is the attention weight, the probability of replication Performing linear transformation on the connection of c t and the decoder hidden state d t, and then performing calculation by using a sigmoid function;
In this context, Is a scalar for weighting the duplication probability and 1-P copy is used for weighting the generation probability, the final output of the decoder at step t is expressed as:
Finally, the model is optimized by the average negative log-likelihood of the target real labels y t for each step t:
To illustrate the effect of the invention, the invention has been designed in comparison with other protocols to demonstrate the effectiveness of the proposed method. The evaluation results of the proposed scheme (PGCC) of the present invention and other representative models (C onvSeq, seq, BART, constDecoder, DCN) of Character Error Rate (CER) and Character Error Reduction Rate (CERR) are shown in table 1. The invention has lower character error rate, achieves 18% performance improvement, particularly remarkable improvement on M AGICDATA, reaches 44%, and proves the effectiveness of the invention in the ASR error correction task.
Table 1: experimental results of the proposed protocol with other models on AISHELL-1 and MAGICDATA datasets
Analysis of the results in Table 1 shows that the pre-trained model has good performance compared to the classical sequence-to-sequence model. The fine-tuning BART model achieves lower CER than the ConvSeq Seq model, and the pre-training process and goal of BART greatly facilitate error correction; fastCorrect achieved slightly superior results relative to BART because pre-training was performed with the 400M dummy dataset; the ConstDecoder model uses only a part of the sequence as input to the decoder, resulting in the loss of semantic information; the DCN model performs slightly worse than the BART trimmed model because insertion or deletion errors cannot be corrected. The invention encodes pronunciation characteristics with multiple granularities, enhances the ability of the model to capture more pronunciation information, and is more flexible than an error detector in the sequence-to-edit model.
The present invention shows two predictive examples in fig. 2, one with a consistent length of input and correction sequences and the other with a variable length. In the graph (a), the pronunciation of the correct character "(kui 4)" is similar to that of the wrong character "(hui)", and it can be observed that the model focuses more on the pronunciation information (0.23) than the semantic meaning (0.73), so that more pronunciation information is provided to the duplication correction decision module and decoder. In the diagram (b), the character "(ji 1)" is followed by a deletion error, the correct word should be "(ji 1) (ji 2)," means "positive," and the "model is also more concerned with pronunciation information.
It can be seen from fig. 3 that the attention weights in the duplication correction decision module demonstrate the ability of the invention to effectively detect and locate erroneous characters. The duplication correction decision module learns well the alignment between the source input sequence and the target correct sequence, and assigns higher weights to correct source characters, while assigning lower weights to incorrect labels, both with the same length and with variable length correction.
The specific embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.
Claims (4)
1. A voice recognition text error correction method based on pronunciation guidance is characterized in that: the method comprises the following steps:
step1, extracting syllable-level and sentence-level pronunciation characteristics from a pinyin sequence through a constructed multi-granularity pronunciation characteristic coding module;
Step2, a pronunciation and semantic representation fusion module fuses with the last layer of hidden state of the BART encoder through a gating unit, and inputs pronunciation and semantic representation fusion characteristics to the BART decoder and a duplication correction decision module;
Step3, the duplication correction decision module takes the fusion characteristics of pronunciation and semantic representation and the hidden state of the last layer from the BART decoder as input, calculates the probability distribution of whether a character should be duplicated or corrected through multiple attentions, and finally keeps or corrects each character in the speech recognition text according to the probability distribution.
2. The pronunciation guide based speech recognition text error correction method of claim 1 wherein: the Step1 includes:
The multi-granularity pronunciation characteristic coding module consists of a unidirectional GRU and four layers of bidirectional Transformer encoders; the multi-granularity pronunciation characteristic coding module is used for capturing pronunciation dissimilarity among different initials, finals and tones, and respectively extracting pronunciation characteristics of syllable level and sentence level from the pinyin sequence through a single-layer unidirectional GRU and four-layer bidirectional transducer module; for the chinese character c i, its pinyin sequence is denoted p i={pi1,pi2,…,piqi, where q i is the pinyin sequence length of c i, and the syllable level code is denoted:
wherein Emb (p i,j) is the embedding of the jth pinyin letter, And/>Represents the j-th and j-1-th hidden states, respectively, from the GRU network;
sentence-level coding adopts four layers of bidirectional convertor blocks, outputs the last hidden state of GRU as the input of each Chinese character, and embeds the position as the input to obtain pronunciation context expressed as Phonetic symbols represent a length equal to the input sequence.
3. The speech recognition error correction model based on pronunciation guidance of claim 1, wherein: the Step2 includes: the final layer hidden state output from the BART encoder by the pronunciation and semantic representation fusion moduleCharacterized as semantic, where X is the input sequence, emb (X) and PE (X) represent characters and position embedding functions, respectively; expressed as:
Hs=BARTEnc(Emb(X)+PE(X))
in order to integrate pronunciation and semantic representation, a gating unit is designed to enable the model to determine how many pronunciation features or semantic features each character should adopt; the threshold size is calculated by a linear layer and normalized by the Sigmoid function:
fused features The update is expressed as:
Expressed as/>, for pronunciation context I=1 to n, which are sent to the BART decoder after the enhanced pronunciation characteristics are obtained, and also to the duplication correction decision module to determine whether characters in the source input sentence should be duplicated to the target output.
4. A pronunciation-guided speech recognition error correction model as claimed in claim 3, wherein: the Step3 comprises the following steps:
hidden state of last layer decoder Is output from the decoder according to the previous step S tep2 and the resulting fusion feature, where the output d t at decoding step t is represented as:
The distribution of all the tokens in the target vocabulary is then calculated by:
Probability distribution P copy, representing whether to copy from the source tag or to obtain from the generated tag, is calculated by multi-head attention, where d t is the query and H fuse is the key and value;
Where c t is the vector of interest, a t is the attention weight, the probability of replication Performing linear transformation on the connection of c t and the decoder hidden state d t, and then performing calculation by using a sigmoid function;
In this context, Is a scalar for weighting the duplication probability and 1-P copy is used for weighting the generation probability, the final output of the decoder at step t is expressed as:
Finally, the model is optimized by the average negative log-likelihood of the target real labels y t for each step t:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410163742.XA CN118038873A (en) | 2024-02-05 | 2024-02-05 | Speech recognition text error correction method based on pronunciation guidance |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410163742.XA CN118038873A (en) | 2024-02-05 | 2024-02-05 | Speech recognition text error correction method based on pronunciation guidance |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118038873A true CN118038873A (en) | 2024-05-14 |
Family
ID=91001626
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410163742.XA Pending CN118038873A (en) | 2024-02-05 | 2024-02-05 | Speech recognition text error correction method based on pronunciation guidance |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118038873A (en) |
-
2024
- 2024-02-05 CN CN202410163742.XA patent/CN118038873A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113811946B (en) | End-to-end automatic speech recognition of digital sequences | |
CN107741928B (en) | Method for correcting error of text after voice recognition based on domain recognition | |
CN114023316B (en) | TCN-transducer-CTC-based end-to-end Chinese speech recognition method | |
Mao et al. | Speech recognition and multi-speaker diarization of long conversations | |
CN114444479A (en) | End-to-end Chinese speech text error correction method, device and storage medium | |
US11417322B2 (en) | Transliteration for speech recognition training and scoring | |
CN110459208A (en) | A kind of sequence of knowledge based migration is to sequential speech identification model training method | |
CN117099157A (en) | Multitasking learning for end-to-end automatic speech recognition confidence and erasure estimation | |
Sagae et al. | Hallucinated n-best lists for discriminative language modeling | |
Arora et al. | Two-pass low latency end-to-end spoken language understanding | |
CN115455946A (en) | Voice recognition error correction method and device, electronic equipment and storage medium | |
CN116757184A (en) | Vietnam voice recognition text error correction method and system integrating pronunciation characteristics | |
Ashihara et al. | SpeechGLUE: How well can self-supervised speech models capture linguistic knowledge? | |
Yang et al. | ASR error correction with constrained decoding on operation prediction | |
CN113571037A (en) | Method and system for synthesizing Chinese braille voice | |
CN115270771B (en) | Fine-grained self-adaptive Chinese spelling error correction method assisted by word-sound prediction task | |
WO2024020154A1 (en) | Using aligned text and speech representations to train automatic speech recognition models without transcribed speech data | |
CN115795008A (en) | Spoken language dialogue state tracking model training method and spoken language dialogue state tracking method | |
KR20210076163A (en) | Transliteration for speech recognition training and scoring | |
CN118038873A (en) | Speech recognition text error correction method based on pronunciation guidance | |
CN115171647A (en) | Voice synthesis method and device with natural pause processing, electronic equipment and computer readable medium | |
Nguyen et al. | User-initiated repetition-based recovery in multi-utterance dialogue systems | |
CN113378553A (en) | Text processing method and device, electronic equipment and storage medium | |
Liu et al. | Investigating for punctuation prediction in Chinese speech transcriptions | |
CN118471201B (en) | Efficient self-adaptive hotword error correction method and system for speech recognition engine |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |