CN118038873A

CN118038873A - Speech recognition text error correction method based on pronunciation guidance

Info

Publication number: CN118038873A
Application number: CN202410163742.XA
Authority: CN
Inventors: 董凌; 余正涛; 高盛祥
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2024-02-05
Filing date: 2024-02-05
Publication date: 2024-05-14

Abstract

The invention relates to a voice recognition text error correction method based on pronunciation guidance; the multi-granularity pronunciation characteristic coding module consists of a unidirectional GRU and four layers of transformers, and pronunciation characteristics are extracted from a pinyin sequence; the pronunciation and semantic representation fusion module fuses with the last layer of hidden state of the BART encoder through the gate control unit, and inputs fusion characteristics to the BART decoder and the duplication correction decision module; the copy correction decision takes the pronunciation and semantic fusion features and the last layer hidden state from the BART decoder as input, calculates the probability distribution of whether a character should be copied or corrected by multi-head attention, and finally keeps or corrects each character in the speech recognition text according to the probability distribution. The invention can effectively reduce the word error rate of the voice recognition, effectively relieve the problem of excessive correction in the traditional sequence-to-sequence error correction model, and provide a more flexible solution for detecting and correcting the word error of the voice recognition.

Description

Speech recognition text error correction method based on pronunciation guidance

Technical Field

The invention relates to a voice recognition text error correction method based on pronunciation guidance, and belongs to the field of post-processing of voice recognition.

Technical Field

Speech recognition error correction is important to improve the recognition accuracy of Automatic Speech Recognition (ASR) systems. When the ASR system transcribes the voice into the text, the ASR system is influenced by factors such as accents of different speakers, surrounding environments and the like, so that errors occur in the transcribed text, recognition errors of homophones, near-phones and the like often occur, and voice recognition text error correction is regarded as a key technology for improving recognition accuracy and readability.

Similar to machine translation, the mainstream error correction model is generally regarded as a sequence-to-sequence task, where the original sentences in which the various ASR recognition errors are treated as source inputs, and the outputs are the corresponding corrected sentences. However, sequence-to-sequence based methods often suffer from overcorrection, i.e., the model often incorrectly modifies the correct recognition result or the generated correction result is inconsistent with the original pronunciation or semantics, causing ambiguity.

Furthermore, the word error rate of ASR systems is typically low (typically less than 10%), indicating that sentences input to the error correction model have substantial overlap with output decoupling pairs, and ASR recognition errors are mainly homophonic substitution errors, with small part insertion and deletion errors. Therefore, it is important to ensure that the text generated by the sequence-to-sequence correction model not only preserves semantics, but also conforms more to pronunciation similarity. Recent studies have proposed sequence-to-edit methods that combine sequence labeling and sequence generation tasks, which design an error detector for predicting corrective actions (i.e., retention, deletion, substitution, or insertion) to guide and limit the behavior of the corrector during correction. In addition, an edit distance-based scheme is designed with a length predictor for bridging a length mismatch between an input sequence and a target sequence, the predicted length indicating a position where an edit operation occurs. The sequence-to-edit method provides an additional supervisory signal to the correction model enabling finer control of the correction operation. However, the accuracy of the predictor severely affects correction performance, using only a portion of the sequence as input to the decoder tends to disrupt the structure of the original sentence and results in loss of semantic information.

On the other hand, there is also research interest in integrating speech knowledge into error correction models. Correction candidates are generated from the confusion set by selectively masking candidates that have no similar pronunciation or shape as compared to the original word. However, the confusion set generated by manual annotation has limitations in coverage and merely filtering out characters not present in the confusion set may inadvertently ignore the correct candidates. The foregoing methods treat the speech information as external knowledge rather than directly modeling the speech features in the correction model, which are not very efficient in exploiting the inherent speech features of the corrected word.

Disclosure of Invention

The invention aims to solve the technical problems that: the invention provides a speech recognition text error correction method based on pronunciation guidance, which optimizes the codec architecture of a BART pre-training model, fully utilizes the powerful repairing capability of the BART in the aspect of reconstructing noisy or damaged sentences, adds a multi-granularity phonological feature encoder on the basis of the BART pre-training model, and extracts pinyin letters at syllable level and sentence level respectively; a pronunciation-guided copy correction decision module is added to control whether the model copies correct characters from the source sentence or generates new characters, thereby directing the model to generate correct candidate words that conform to pronunciation and semantic constraints, thereby alleviating the problem of excessive error correction.

The technical scheme of the invention is as follows: a pronunciation guide based speech recognition text error correction method, the method comprising:

step1, extracting syllable-level and sentence-level pronunciation characteristics from a pinyin sequence through a constructed multi-granularity pronunciation characteristic coding module;

Step2, a pronunciation and semantic representation fusion module fuses with the last layer of hidden state of the BART encoder through a gating unit, and inputs pronunciation and semantic representation fusion characteristics to the BART decoder and a duplication correction decision module;

Step3, the duplication correction decision module takes the fusion characteristics of pronunciation and semantic representation and the hidden state of the last layer from the BART decoder as input, calculates the probability distribution of whether a character should be duplicated or corrected through multiple attentions, and finally keeps or corrects each character in the speech recognition text according to the probability distribution.

Further, step1 includes:

The multi-granularity pronunciation characteristic coding module consists of a unidirectional GRU and four layers of bidirectional Transformer encoders; the multi-granularity pronunciation characteristic coding module is used for capturing pronunciation dissimilarity among different initials, finals and tones, and respectively extracting pronunciation characteristics of syllable level and sentence level from the pinyin sequence through a single-layer unidirectional GRU and four-layer bidirectional transducer module; for the Chinese character c _i, its Pinyin sequence is expressed as Where q _i is the pinyin sequence length of c _i, the pitch level encoding is expressed as:

wherein Emb (p _i,j) is the embedding of the jth pinyin letter, And/>Represents the j-th and j-1-th hidden states, respectively, from the GRU network;

sentence-level coding adopts four layers of bidirectional convertor blocks, outputs the last hidden state of GRU as the input of each Chinese character, and embeds the position as the input to obtain pronunciation context expressed as Phonetic symbols represent a length equal to the input sequence.

Further, step2 includes: the final layer hidden state output from the BART encoder by the pronunciation and semantic representation fusion moduleCharacterized as semantic, where X is the input sequence, emb (X) and PE (X) represent characters and position embedding functions, respectively; expressed as:

H^s＝BARTEnc(Emb(X)+PE(X))

in order to integrate pronunciation and semantic representation, a gating unit is designed to enable the model to determine how many pronunciation features or semantic features each character should adopt; the threshold size is calculated by a linear layer and normalized by the Sigmoid function:

fused features The update is expressed as:

Expressed as/>, for pronunciation context I=1 to n, which are sent to the BART decoder after the enhanced pronunciation characteristics are obtained, and also to the duplication correction decision module to determine whether characters in the source input sentence should be duplicated to the target output.

Further, step3 includes the following:

hidden state of last layer decoder Is output from the decoder according to the previous step S tep2 and the resulting fusion feature, where the output d _t at decoding step t is represented as:

The distribution of all the tokens in the target vocabulary is then calculated by:

Probability distribution P _copy, representing whether to copy from the source tag or to obtain from the generated tag, is calculated by multi-head attention, where d _t is the query and H ^fuse is the key and value;

Where c _t is the vector of interest, a _t is the attention weight, the probability of replication Performing linear transformation on the connection of c _t and the decoder hidden state d _t, and then performing calculation by using a sigmoid function;

In this context, Is a scalar for weighting the duplication probability and 1-P _copy is used for weighting the generation probability, the final output of the decoder at step t is expressed as:

Finally, the model is optimized by the average negative log-likelihood of the target real labels y _t for each step t:

the beneficial effects of the invention are as follows:

The invention provides a text error correction method based on pronunciation guidance voice recognition, which adopts BART with strong repair capability to damaged sentences as a basic framework; constructing a multi-granularity pronunciation characteristic encoder, guiding a copy correction decision module and a BART decoder to identify the position where the error occurs so as to accurately correct; and judging whether to copy or generate new characters from the original sentence by utilizing a pronunciation and semantic representation fusion module. Compared with the existing ASR error correction method, the invention realizes 18% improvement on AISHELL-1 test set and more remarkable improvement (44%) on MAGICDATA. The invention has higher error correction precision, and can effectively relieve the problem of excessive error correction in the error correction model of the voice recognition based on the sequence-to-sequence.

Drawings

FIG. 1 is a general block diagram of a speech recognition text error correction method based on pronunciation guidance in accordance with the present invention;

FIG. 2 is an illustration of error correction for a speech recognition text error correction method based on pronunciation guidance in accordance with the present invention;

FIG. 3 is a schematic diagram of attention weight of a duplication correction decision module in a speech recognition text correction method based on pronunciation guidance according to the present invention;

Detailed Description

The invention provides a voice recognition text error correction method based on pronunciation guidance, which is further described below with reference to the accompanying drawings and the specific embodiments. It should be noted that the drawings are in a very simplified form and serve only to facilitate a clear and concise description of embodiments of the present inventions.

Example 1: as shown in FIG. 1, the speech recognition text error correction method based on pronunciation guidance comprises the following specific steps of data selection and preprocessing, a multi-granularity pronunciation characteristic encoder module, a pronunciation and semantic representation fusion module and a duplication correction decision module:

Step1, extracting syllable-level and sentence-level pronunciation characteristics from a pinyin sequence through a constructed multi-granularity pronunciation characteristic coding module; firstly, selecting and preprocessing data:

Experiments were performed on common datasets AISHELL-1 and MAGICDATA, where: AISHELL-1 contains 178 hours of chinese phonetic text pairing data. The training set contained about 160 hours of speech corresponding to about 120,000 sentence pairs, and the validation set and the test set contained 15 hours of speech corresponding to about 21,000 sentence pairs, respectively. MAGICDATA is another open source speech corpus containing 755 hours of phonetic text paired data. The training set contained about 660 hours of speech corresponding to about 57,000 sentence pairs, and the validation set and the test set contained about 52 hours of speech corresponding to about 26,000 sentence pairs. Including interactive questions and answers, music searches, social networking messages, home commands and controls, and the like.

The phonetic sequence of Chinese character consists of initial consonant, final sound and tone. Initials (21 total) and finals (39 total) are represented by roman letters, and 5 tones may be represented as {0,1,2,3,4}. A pinyin conversion tool (PyP inyin) is used to convert Chinese characters into a pinyin sequence. Taking the Chinese character as an example, the Pinyin sequence is 'wei 4', namely { 'w', 'ei',4}.

The multi-granularity pronunciation characteristic coding module consists of a unidirectional GRU and four layers of bidirectional Transformer encoders; the multi-granularity pronunciation characteristic coding module is used for capturing pronunciation dissimilarity among different initials, finals and tones, and respectively extracting pronunciation characteristics of syllable level and sentence level from the pinyin sequence through a single-layer unidirectional GRU and four-layer bidirectional transducer module;

In order to capture pronunciation dissimilarity among different initials, finals and finals, pronunciation characteristics of syllable level and sentence level are respectively extracted from the pinyin sequence through a single-layer unidirectional GRU and four-layer bidirectional transducer module. For the Chinese character c _i, its Pinyin sequence is expressed as Where q _i is the pinyin sequence length of c _i, the pitch level encoding is expressed as:

The final layer hidden state output from the BART encoder by the pronunciation and semantic representation fusion module Characterized as semantic, where X is the input sequence, emb (X) and PE (X) represent characters and position embedding functions, respectively; expressed as:

H^s＝BARTEnc(Emb(X)+PE(X))

fused features The update is expressed as:

It is contemplated that there is a significant overlap between the source sequence and the target sequence in the ASR error correction task. A pronunciation-enhanced copy correction decision mechanism is applied between the BAR T encoder and decoder. The copy correction decision module determines whether the characters in the target sequence are copied from the source sequence or generated from the decoder and thus can be considered an implicit error detector, while the BART decoder can be considered an error corrector. The duplication correction decision module takes as input the pronunciation and semantic fusion representation and the last layer hidden state from the decoder, calculates the duplication or corrected probability distribution by multi-headed attention (MHA).

Hidden state of last layer decoderIs output from the decoder according to the previous step S tep2 and the resulting fusion feature, where the output d _t at decoding step t is represented as:

To illustrate the effect of the invention, the invention has been designed in comparison with other protocols to demonstrate the effectiveness of the proposed method. The evaluation results of the proposed scheme (PGCC) of the present invention and other representative models (C onvSeq, seq, BART, constDecoder, DCN) of Character Error Rate (CER) and Character Error Reduction Rate (CERR) are shown in table 1. The invention has lower character error rate, achieves 18% performance improvement, particularly remarkable improvement on M AGICDATA, reaches 44%, and proves the effectiveness of the invention in the ASR error correction task.

Table 1: experimental results of the proposed protocol with other models on AISHELL-1 and MAGICDATA datasets

Analysis of the results in Table 1 shows that the pre-trained model has good performance compared to the classical sequence-to-sequence model. The fine-tuning BART model achieves lower CER than the ConvSeq Seq model, and the pre-training process and goal of BART greatly facilitate error correction; fastCorrect achieved slightly superior results relative to BART because pre-training was performed with the 400M dummy dataset; the ConstDecoder model uses only a part of the sequence as input to the decoder, resulting in the loss of semantic information; the DCN model performs slightly worse than the BART trimmed model because insertion or deletion errors cannot be corrected. The invention encodes pronunciation characteristics with multiple granularities, enhances the ability of the model to capture more pronunciation information, and is more flexible than an error detector in the sequence-to-edit model.

The present invention shows two predictive examples in fig. 2, one with a consistent length of input and correction sequences and the other with a variable length. In the graph (a), the pronunciation of the correct character "(kui 4)" is similar to that of the wrong character "(hui)", and it can be observed that the model focuses more on the pronunciation information (0.23) than the semantic meaning (0.73), so that more pronunciation information is provided to the duplication correction decision module and decoder. In the diagram (b), the character "(ji 1)" is followed by a deletion error, the correct word should be "(ji 1) (ji 2)," means "positive," and the "model is also more concerned with pronunciation information.

It can be seen from fig. 3 that the attention weights in the duplication correction decision module demonstrate the ability of the invention to effectively detect and locate erroneous characters. The duplication correction decision module learns well the alignment between the source input sequence and the target correct sequence, and assigns higher weights to correct source characters, while assigning lower weights to incorrect labels, both with the same length and with variable length correction.

The specific embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims

1. A voice recognition text error correction method based on pronunciation guidance is characterized in that: the method comprises the following steps:

2. The pronunciation guide based speech recognition text error correction method of claim 1 wherein: the Step1 includes:

The multi-granularity pronunciation characteristic coding module consists of a unidirectional GRU and four layers of bidirectional Transformer encoders; the multi-granularity pronunciation characteristic coding module is used for capturing pronunciation dissimilarity among different initials, finals and tones, and respectively extracting pronunciation characteristics of syllable level and sentence level from the pinyin sequence through a single-layer unidirectional GRU and four-layer bidirectional transducer module; for the chinese character c _i, its pinyin sequence is denoted p _i＝{p_i1,p_i2,…,p_iqi, where q _i is the pinyin sequence length of c _i, and the syllable level code is denoted:

3. The speech recognition error correction model based on pronunciation guidance of claim 1, wherein: the Step2 includes: the final layer hidden state output from the BART encoder by the pronunciation and semantic representation fusion moduleCharacterized as semantic, where X is the input sequence, emb (X) and PE (X) represent characters and position embedding functions, respectively; expressed as:

H^s＝BARTEnc(Emb(X)+PE(X))

fused features The update is expressed as:

4. A pronunciation-guided speech recognition error correction model as claimed in claim 3, wherein: the Step3 comprises the following steps: