CN118038873A - Speech recognition text error correction method based on pronunciation guidance - Google Patents

Speech recognition text error correction method based on pronunciation guidance Download PDF

Info

Publication number
CN118038873A
CN118038873A CN202410163742.XA CN202410163742A CN118038873A CN 118038873 A CN118038873 A CN 118038873A CN 202410163742 A CN202410163742 A CN 202410163742A CN 118038873 A CN118038873 A CN 118038873A
Authority
CN
China
Prior art keywords
pronunciation
decoder
bart
sequence
error correction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410163742.XA
Other languages
Chinese (zh)
Inventor
董凌
余正涛
高盛祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202410163742.XA priority Critical patent/CN118038873A/en
Publication of CN118038873A publication Critical patent/CN118038873A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a voice recognition text error correction method based on pronunciation guidance; the multi-granularity pronunciation characteristic coding module consists of a unidirectional GRU and four layers of transformers, and pronunciation characteristics are extracted from a pinyin sequence; the pronunciation and semantic representation fusion module fuses with the last layer of hidden state of the BART encoder through the gate control unit, and inputs fusion characteristics to the BART decoder and the duplication correction decision module; the copy correction decision takes the pronunciation and semantic fusion features and the last layer hidden state from the BART decoder as input, calculates the probability distribution of whether a character should be copied or corrected by multi-head attention, and finally keeps or corrects each character in the speech recognition text according to the probability distribution. The invention can effectively reduce the word error rate of the voice recognition, effectively relieve the problem of excessive correction in the traditional sequence-to-sequence error correction model, and provide a more flexible solution for detecting and correcting the word error of the voice recognition.

Description

Speech recognition text error correction method based on pronunciation guidance
Technical Field
The invention relates to a voice recognition text error correction method based on pronunciation guidance, and belongs to the field of post-processing of voice recognition.
Technical Field
Speech recognition error correction is important to improve the recognition accuracy of Automatic Speech Recognition (ASR) systems. When the ASR system transcribes the voice into the text, the ASR system is influenced by factors such as accents of different speakers, surrounding environments and the like, so that errors occur in the transcribed text, recognition errors of homophones, near-phones and the like often occur, and voice recognition text error correction is regarded as a key technology for improving recognition accuracy and readability.
Similar to machine translation, the mainstream error correction model is generally regarded as a sequence-to-sequence task, where the original sentences in which the various ASR recognition errors are treated as source inputs, and the outputs are the corresponding corrected sentences. However, sequence-to-sequence based methods often suffer from overcorrection, i.e., the model often incorrectly modifies the correct recognition result or the generated correction result is inconsistent with the original pronunciation or semantics, causing ambiguity.
Furthermore, the word error rate of ASR systems is typically low (typically less than 10%), indicating that sentences input to the error correction model have substantial overlap with output decoupling pairs, and ASR recognition errors are mainly homophonic substitution errors, with small part insertion and deletion errors. Therefore, it is important to ensure that the text generated by the sequence-to-sequence correction model not only preserves semantics, but also conforms more to pronunciation similarity. Recent studies have proposed sequence-to-edit methods that combine sequence labeling and sequence generation tasks, which design an error detector for predicting corrective actions (i.e., retention, deletion, substitution, or insertion) to guide and limit the behavior of the corrector during correction. In addition, an edit distance-based scheme is designed with a length predictor for bridging a length mismatch between an input sequence and a target sequence, the predicted length indicating a position where an edit operation occurs. The sequence-to-edit method provides an additional supervisory signal to the correction model enabling finer control of the correction operation. However, the accuracy of the predictor severely affects correction performance, using only a portion of the sequence as input to the decoder tends to disrupt the structure of the original sentence and results in loss of semantic information.
On the other hand, there is also research interest in integrating speech knowledge into error correction models. Correction candidates are generated from the confusion set by selectively masking candidates that have no similar pronunciation or shape as compared to the original word. However, the confusion set generated by manual annotation has limitations in coverage and merely filtering out characters not present in the confusion set may inadvertently ignore the correct candidates. The foregoing methods treat the speech information as external knowledge rather than directly modeling the speech features in the correction model, which are not very efficient in exploiting the inherent speech features of the corrected word.
Disclosure of Invention
The invention aims to solve the technical problems that: the invention provides a speech recognition text error correction method based on pronunciation guidance, which optimizes the codec architecture of a BART pre-training model, fully utilizes the powerful repairing capability of the BART in the aspect of reconstructing noisy or damaged sentences, adds a multi-granularity phonological feature encoder on the basis of the BART pre-training model, and extracts pinyin letters at syllable level and sentence level respectively; a pronunciation-guided copy correction decision module is added to control whether the model copies correct characters from the source sentence or generates new characters, thereby directing the model to generate correct candidate words that conform to pronunciation and semantic constraints, thereby alleviating the problem of excessive error correction.
The technical scheme of the invention is as follows: a pronunciation guide based speech recognition text error correction method, the method comprising:
step1, extracting syllable-level and sentence-level pronunciation characteristics from a pinyin sequence through a constructed multi-granularity pronunciation characteristic coding module;
Step2, a pronunciation and semantic representation fusion module fuses with the last layer of hidden state of the BART encoder through a gating unit, and inputs pronunciation and semantic representation fusion characteristics to the BART decoder and a duplication correction decision module;
Step3, the duplication correction decision module takes the fusion characteristics of pronunciation and semantic representation and the hidden state of the last layer from the BART decoder as input, calculates the probability distribution of whether a character should be duplicated or corrected through multiple attentions, and finally keeps or corrects each character in the speech recognition text according to the probability distribution.
Further, step1 includes:
The multi-granularity pronunciation characteristic coding module consists of a unidirectional GRU and four layers of bidirectional Transformer encoders; the multi-granularity pronunciation characteristic coding module is used for capturing pronunciation dissimilarity among different initials, finals and tones, and respectively extracting pronunciation characteristics of syllable level and sentence level from the pinyin sequence through a single-layer unidirectional GRU and four-layer bidirectional transducer module; for the Chinese character c i, its Pinyin sequence is expressed as Where q i is the pinyin sequence length of c i, the pitch level encoding is expressed as:
wherein Emb (p i,j) is the embedding of the jth pinyin letter, And/>Represents the j-th and j-1-th hidden states, respectively, from the GRU network;
sentence-level coding adopts four layers of bidirectional convertor blocks, outputs the last hidden state of GRU as the input of each Chinese character, and embeds the position as the input to obtain pronunciation context expressed as Phonetic symbols represent a length equal to the input sequence.
Further, step2 includes: the final layer hidden state output from the BART encoder by the pronunciation and semantic representation fusion moduleCharacterized as semantic, where X is the input sequence, emb (X) and PE (X) represent characters and position embedding functions, respectively; expressed as:
Hs=BARTEnc(Emb(X)+PE(X))
in order to integrate pronunciation and semantic representation, a gating unit is designed to enable the model to determine how many pronunciation features or semantic features each character should adopt; the threshold size is calculated by a linear layer and normalized by the Sigmoid function:
fused features The update is expressed as:
Expressed as/>, for pronunciation context I=1 to n, which are sent to the BART decoder after the enhanced pronunciation characteristics are obtained, and also to the duplication correction decision module to determine whether characters in the source input sentence should be duplicated to the target output.
Further, step3 includes the following:
hidden state of last layer decoder Is output from the decoder according to the previous step S tep2 and the resulting fusion feature, where the output d t at decoding step t is represented as:
The distribution of all the tokens in the target vocabulary is then calculated by:
Probability distribution P copy, representing whether to copy from the source tag or to obtain from the generated tag, is calculated by multi-head attention, where d t is the query and H fuse is the key and value;
Where c t is the vector of interest, a t is the attention weight, the probability of replication Performing linear transformation on the connection of c t and the decoder hidden state d t, and then performing calculation by using a sigmoid function;
In this context, Is a scalar for weighting the duplication probability and 1-P copy is used for weighting the generation probability, the final output of the decoder at step t is expressed as:
Finally, the model is optimized by the average negative log-likelihood of the target real labels y t for each step t:
the beneficial effects of the invention are as follows:
The invention provides a text error correction method based on pronunciation guidance voice recognition, which adopts BART with strong repair capability to damaged sentences as a basic framework; constructing a multi-granularity pronunciation characteristic encoder, guiding a copy correction decision module and a BART decoder to identify the position where the error occurs so as to accurately correct; and judging whether to copy or generate new characters from the original sentence by utilizing a pronunciation and semantic representation fusion module. Compared with the existing ASR error correction method, the invention realizes 18% improvement on AISHELL-1 test set and more remarkable improvement (44%) on MAGICDATA. The invention has higher error correction precision, and can effectively relieve the problem of excessive error correction in the error correction model of the voice recognition based on the sequence-to-sequence.
Drawings
FIG. 1 is a general block diagram of a speech recognition text error correction method based on pronunciation guidance in accordance with the present invention;
FIG. 2 is an illustration of error correction for a speech recognition text error correction method based on pronunciation guidance in accordance with the present invention;
FIG. 3 is a schematic diagram of attention weight of a duplication correction decision module in a speech recognition text correction method based on pronunciation guidance according to the present invention;
Detailed Description
The invention provides a voice recognition text error correction method based on pronunciation guidance, which is further described below with reference to the accompanying drawings and the specific embodiments. It should be noted that the drawings are in a very simplified form and serve only to facilitate a clear and concise description of embodiments of the present inventions.
Example 1: as shown in FIG. 1, the speech recognition text error correction method based on pronunciation guidance comprises the following specific steps of data selection and preprocessing, a multi-granularity pronunciation characteristic encoder module, a pronunciation and semantic representation fusion module and a duplication correction decision module:
Step1, extracting syllable-level and sentence-level pronunciation characteristics from a pinyin sequence through a constructed multi-granularity pronunciation characteristic coding module; firstly, selecting and preprocessing data:
Experiments were performed on common datasets AISHELL-1 and MAGICDATA, where: AISHELL-1 contains 178 hours of chinese phonetic text pairing data. The training set contained about 160 hours of speech corresponding to about 120,000 sentence pairs, and the validation set and the test set contained 15 hours of speech corresponding to about 21,000 sentence pairs, respectively. MAGICDATA is another open source speech corpus containing 755 hours of phonetic text paired data. The training set contained about 660 hours of speech corresponding to about 57,000 sentence pairs, and the validation set and the test set contained about 52 hours of speech corresponding to about 26,000 sentence pairs. Including interactive questions and answers, music searches, social networking messages, home commands and controls, and the like.
The phonetic sequence of Chinese character consists of initial consonant, final sound and tone. Initials (21 total) and finals (39 total) are represented by roman letters, and 5 tones may be represented as {0,1,2,3,4}. A pinyin conversion tool (PyP inyin) is used to convert Chinese characters into a pinyin sequence. Taking the Chinese character as an example, the Pinyin sequence is 'wei 4', namely { 'w', 'ei',4}.
The multi-granularity pronunciation characteristic coding module consists of a unidirectional GRU and four layers of bidirectional Transformer encoders; the multi-granularity pronunciation characteristic coding module is used for capturing pronunciation dissimilarity among different initials, finals and tones, and respectively extracting pronunciation characteristics of syllable level and sentence level from the pinyin sequence through a single-layer unidirectional GRU and four-layer bidirectional transducer module;
In order to capture pronunciation dissimilarity among different initials, finals and finals, pronunciation characteristics of syllable level and sentence level are respectively extracted from the pinyin sequence through a single-layer unidirectional GRU and four-layer bidirectional transducer module. For the Chinese character c i, its Pinyin sequence is expressed as Where q i is the pinyin sequence length of c i, the pitch level encoding is expressed as:
wherein Emb (p i,j) is the embedding of the jth pinyin letter, And/>Represents the j-th and j-1-th hidden states, respectively, from the GRU network;
sentence-level coding adopts four layers of bidirectional convertor blocks, outputs the last hidden state of GRU as the input of each Chinese character, and embeds the position as the input to obtain pronunciation context expressed as Phonetic symbols represent a length equal to the input sequence.
Step2, a pronunciation and semantic representation fusion module fuses with the last layer of hidden state of the BART encoder through a gating unit, and inputs pronunciation and semantic representation fusion characteristics to the BART decoder and a duplication correction decision module;
The final layer hidden state output from the BART encoder by the pronunciation and semantic representation fusion module Characterized as semantic, where X is the input sequence, emb (X) and PE (X) represent characters and position embedding functions, respectively; expressed as:
Hs=BARTEnc(Emb(X)+PE(X))
in order to integrate pronunciation and semantic representation, a gating unit is designed to enable the model to determine how many pronunciation features or semantic features each character should adopt; the threshold size is calculated by a linear layer and normalized by the Sigmoid function:
fused features The update is expressed as:
Expressed as/>, for pronunciation context I=1 to n, which are sent to the BART decoder after the enhanced pronunciation characteristics are obtained, and also to the duplication correction decision module to determine whether characters in the source input sentence should be duplicated to the target output.
Step3, the duplication correction decision module takes the fusion characteristics of pronunciation and semantic representation and the hidden state of the last layer from the BART decoder as input, calculates the probability distribution of whether a character should be duplicated or corrected through multiple attentions, and finally keeps or corrects each character in the speech recognition text according to the probability distribution.
It is contemplated that there is a significant overlap between the source sequence and the target sequence in the ASR error correction task. A pronunciation-enhanced copy correction decision mechanism is applied between the BAR T encoder and decoder. The copy correction decision module determines whether the characters in the target sequence are copied from the source sequence or generated from the decoder and thus can be considered an implicit error detector, while the BART decoder can be considered an error corrector. The duplication correction decision module takes as input the pronunciation and semantic fusion representation and the last layer hidden state from the decoder, calculates the duplication or corrected probability distribution by multi-headed attention (MHA).
Hidden state of last layer decoderIs output from the decoder according to the previous step S tep2 and the resulting fusion feature, where the output d t at decoding step t is represented as:
The distribution of all the tokens in the target vocabulary is then calculated by:
Probability distribution P copy, representing whether to copy from the source tag or to obtain from the generated tag, is calculated by multi-head attention, where d t is the query and H fuse is the key and value;
Where c t is the vector of interest, a t is the attention weight, the probability of replication Performing linear transformation on the connection of c t and the decoder hidden state d t, and then performing calculation by using a sigmoid function;
In this context, Is a scalar for weighting the duplication probability and 1-P copy is used for weighting the generation probability, the final output of the decoder at step t is expressed as:
Finally, the model is optimized by the average negative log-likelihood of the target real labels y t for each step t:
To illustrate the effect of the invention, the invention has been designed in comparison with other protocols to demonstrate the effectiveness of the proposed method. The evaluation results of the proposed scheme (PGCC) of the present invention and other representative models (C onvSeq, seq, BART, constDecoder, DCN) of Character Error Rate (CER) and Character Error Reduction Rate (CERR) are shown in table 1. The invention has lower character error rate, achieves 18% performance improvement, particularly remarkable improvement on M AGICDATA, reaches 44%, and proves the effectiveness of the invention in the ASR error correction task.
Table 1: experimental results of the proposed protocol with other models on AISHELL-1 and MAGICDATA datasets
Analysis of the results in Table 1 shows that the pre-trained model has good performance compared to the classical sequence-to-sequence model. The fine-tuning BART model achieves lower CER than the ConvSeq Seq model, and the pre-training process and goal of BART greatly facilitate error correction; fastCorrect achieved slightly superior results relative to BART because pre-training was performed with the 400M dummy dataset; the ConstDecoder model uses only a part of the sequence as input to the decoder, resulting in the loss of semantic information; the DCN model performs slightly worse than the BART trimmed model because insertion or deletion errors cannot be corrected. The invention encodes pronunciation characteristics with multiple granularities, enhances the ability of the model to capture more pronunciation information, and is more flexible than an error detector in the sequence-to-edit model.
The present invention shows two predictive examples in fig. 2, one with a consistent length of input and correction sequences and the other with a variable length. In the graph (a), the pronunciation of the correct character "(kui 4)" is similar to that of the wrong character "(hui)", and it can be observed that the model focuses more on the pronunciation information (0.23) than the semantic meaning (0.73), so that more pronunciation information is provided to the duplication correction decision module and decoder. In the diagram (b), the character "(ji 1)" is followed by a deletion error, the correct word should be "(ji 1) (ji 2)," means "positive," and the "model is also more concerned with pronunciation information.
It can be seen from fig. 3 that the attention weights in the duplication correction decision module demonstrate the ability of the invention to effectively detect and locate erroneous characters. The duplication correction decision module learns well the alignment between the source input sequence and the target correct sequence, and assigns higher weights to correct source characters, while assigning lower weights to incorrect labels, both with the same length and with variable length correction.
The specific embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims (4)

1. A voice recognition text error correction method based on pronunciation guidance is characterized in that: the method comprises the following steps:
step1, extracting syllable-level and sentence-level pronunciation characteristics from a pinyin sequence through a constructed multi-granularity pronunciation characteristic coding module;
Step2, a pronunciation and semantic representation fusion module fuses with the last layer of hidden state of the BART encoder through a gating unit, and inputs pronunciation and semantic representation fusion characteristics to the BART decoder and a duplication correction decision module;
Step3, the duplication correction decision module takes the fusion characteristics of pronunciation and semantic representation and the hidden state of the last layer from the BART decoder as input, calculates the probability distribution of whether a character should be duplicated or corrected through multiple attentions, and finally keeps or corrects each character in the speech recognition text according to the probability distribution.
2. The pronunciation guide based speech recognition text error correction method of claim 1 wherein: the Step1 includes:
The multi-granularity pronunciation characteristic coding module consists of a unidirectional GRU and four layers of bidirectional Transformer encoders; the multi-granularity pronunciation characteristic coding module is used for capturing pronunciation dissimilarity among different initials, finals and tones, and respectively extracting pronunciation characteristics of syllable level and sentence level from the pinyin sequence through a single-layer unidirectional GRU and four-layer bidirectional transducer module; for the chinese character c i, its pinyin sequence is denoted p i={pi1,pi2,…,piqi, where q i is the pinyin sequence length of c i, and the syllable level code is denoted:
wherein Emb (p i,j) is the embedding of the jth pinyin letter, And/>Represents the j-th and j-1-th hidden states, respectively, from the GRU network;
sentence-level coding adopts four layers of bidirectional convertor blocks, outputs the last hidden state of GRU as the input of each Chinese character, and embeds the position as the input to obtain pronunciation context expressed as Phonetic symbols represent a length equal to the input sequence.
3. The speech recognition error correction model based on pronunciation guidance of claim 1, wherein: the Step2 includes: the final layer hidden state output from the BART encoder by the pronunciation and semantic representation fusion moduleCharacterized as semantic, where X is the input sequence, emb (X) and PE (X) represent characters and position embedding functions, respectively; expressed as:
Hs=BARTEnc(Emb(X)+PE(X))
in order to integrate pronunciation and semantic representation, a gating unit is designed to enable the model to determine how many pronunciation features or semantic features each character should adopt; the threshold size is calculated by a linear layer and normalized by the Sigmoid function:
fused features The update is expressed as:
Expressed as/>, for pronunciation context I=1 to n, which are sent to the BART decoder after the enhanced pronunciation characteristics are obtained, and also to the duplication correction decision module to determine whether characters in the source input sentence should be duplicated to the target output.
4. A pronunciation-guided speech recognition error correction model as claimed in claim 3, wherein: the Step3 comprises the following steps:
hidden state of last layer decoder Is output from the decoder according to the previous step S tep2 and the resulting fusion feature, where the output d t at decoding step t is represented as:
The distribution of all the tokens in the target vocabulary is then calculated by:
Probability distribution P copy, representing whether to copy from the source tag or to obtain from the generated tag, is calculated by multi-head attention, where d t is the query and H fuse is the key and value;
Where c t is the vector of interest, a t is the attention weight, the probability of replication Performing linear transformation on the connection of c t and the decoder hidden state d t, and then performing calculation by using a sigmoid function;
In this context, Is a scalar for weighting the duplication probability and 1-P copy is used for weighting the generation probability, the final output of the decoder at step t is expressed as:
Finally, the model is optimized by the average negative log-likelihood of the target real labels y t for each step t:
CN202410163742.XA 2024-02-05 2024-02-05 Speech recognition text error correction method based on pronunciation guidance Pending CN118038873A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410163742.XA CN118038873A (en) 2024-02-05 2024-02-05 Speech recognition text error correction method based on pronunciation guidance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410163742.XA CN118038873A (en) 2024-02-05 2024-02-05 Speech recognition text error correction method based on pronunciation guidance

Publications (1)

Publication Number Publication Date
CN118038873A true CN118038873A (en) 2024-05-14

Family

ID=91001626

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410163742.XA Pending CN118038873A (en) 2024-02-05 2024-02-05 Speech recognition text error correction method based on pronunciation guidance

Country Status (1)

Country Link
CN (1) CN118038873A (en)

Similar Documents

Publication Publication Date Title
CN113811946B (en) End-to-end automatic speech recognition of digital sequences
CN107741928B (en) Method for correcting error of text after voice recognition based on domain recognition
CN114023316B (en) TCN-transducer-CTC-based end-to-end Chinese speech recognition method
Mao et al. Speech recognition and multi-speaker diarization of long conversations
CN114444479A (en) End-to-end Chinese speech text error correction method, device and storage medium
US11417322B2 (en) Transliteration for speech recognition training and scoring
CN110459208A (en) A kind of sequence of knowledge based migration is to sequential speech identification model training method
CN117099157A (en) Multitasking learning for end-to-end automatic speech recognition confidence and erasure estimation
Sagae et al. Hallucinated n-best lists for discriminative language modeling
Arora et al. Two-pass low latency end-to-end spoken language understanding
CN115455946A (en) Voice recognition error correction method and device, electronic equipment and storage medium
CN116757184A (en) Vietnam voice recognition text error correction method and system integrating pronunciation characteristics
Ashihara et al. SpeechGLUE: How well can self-supervised speech models capture linguistic knowledge?
Yang et al. ASR error correction with constrained decoding on operation prediction
CN113571037A (en) Method and system for synthesizing Chinese braille voice
CN115270771B (en) Fine-grained self-adaptive Chinese spelling error correction method assisted by word-sound prediction task
WO2024020154A1 (en) Using aligned text and speech representations to train automatic speech recognition models without transcribed speech data
CN115795008A (en) Spoken language dialogue state tracking model training method and spoken language dialogue state tracking method
KR20210076163A (en) Transliteration for speech recognition training and scoring
CN118038873A (en) Speech recognition text error correction method based on pronunciation guidance
CN115171647A (en) Voice synthesis method and device with natural pause processing, electronic equipment and computer readable medium
Nguyen et al. User-initiated repetition-based recovery in multi-utterance dialogue systems
CN113378553A (en) Text processing method and device, electronic equipment and storage medium
Liu et al. Investigating for punctuation prediction in Chinese speech transcriptions
CN118471201B (en) Efficient self-adaptive hotword error correction method and system for speech recognition engine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination