CN114781377B - Error correction model, training and error correction method for non-aligned text - Google Patents

Error correction model, training and error correction method for non-aligned text Download PDF

Info

Publication number
CN114781377B
CN114781377B CN202210696857.6A CN202210696857A CN114781377B CN 114781377 B CN114781377 B CN 114781377B CN 202210696857 A CN202210696857 A CN 202210696857A CN 114781377 B CN114781377 B CN 114781377B
Authority
CN
China
Prior art keywords
text
vector
module
decoding
phoneme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210696857.6A
Other languages
Chinese (zh)
Other versions
CN114781377A (en
Inventor
许程冲
赵文博
肖清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unicom Guangdong Industrial Internet Co Ltd
Original Assignee
China Unicom Guangdong Industrial Internet Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unicom Guangdong Industrial Internet Co Ltd filed Critical China Unicom Guangdong Industrial Internet Co Ltd
Priority to CN202210696857.6A priority Critical patent/CN114781377B/en
Publication of CN114781377A publication Critical patent/CN114781377A/en
Application granted granted Critical
Publication of CN114781377B publication Critical patent/CN114781377B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Abstract

The invention provides an error correction model, a training method and an error correction method for non-aligned texts, wherein the model comprises the following steps: an encoder model and a decoder model; the pre-processing module and the encoding word embedding module of the encoder model are used for embedding the first text vectorEOutputting to a coding layer; the coding layer obtains a text feature vector and outputs the text feature vector to a decoding layer of the decoder model; the phoneme extraction module, the decoded word embedding module and the first decoding multi-head attention calculation module of the decoder model output a plurality of second phoneme vectors to a decoding layer; and the decoding layer fuses a plurality of second phoneme vectors to obtain phoneme feature vectors, decodes the phoneme feature vectors by combining the text feature vectors and the phoneme feature vectors to obtain decoding feature vectors, and takes the decoding feature vectors as the text after error correction of the original text. Each processing process of text error correction is corrected and optimized in the training process of the end-to-end model, the problem of error accumulation is avoided, and the error correction accuracy rate is effectively improved.

Description

Error correction model, training and error correction method for non-aligned text
Technical Field
The invention relates to the field of text error correction, in particular to an error correction model, training and error correction method for non-aligned text.
Background
Automatic Speech Recognition (ASR) is a basic task of intelligent Speech in natural language processing, and the technology can be widely applied to scenes such as intelligent customer service, intelligent outbound and the like. In the automatic speech recognition task, situations often occur in which the speech recognition result is not accurate enough, for example, the recognized text has errors such as wrong characters, multiple characters, few characters, and the like. Among them, the task of solving the problem of wrong words is called aligned text error correction, and the task of solving the problems of wrong words, multiple words and few words is called non-aligned text error correction. The non-aligned text error correction can be applied to tasks such as spelling correction, voice recognition optimization and the like, and the corresponding text accuracy is improved.
Error correction of automatic speech recognition results is a critical task for downstream natural language processing services. The existing text error correction scheme generally adopts pipeline processing, namely the method is divided into three sequential steps: error detection, candidate recall, candidate ranking. The error detection means detecting and positioning points with errors in the text, the candidate recall means recalling correct candidate words with the error points, and the candidate sorting means sorting the recalled candidate words through a sorting algorithm, and selecting a word/word with the highest score/the first order and the error points for replacement. In the existing scheme, three steps are respectively realized through three independent models, but the pipeline processing mode inevitably causes that a downstream model strongly depends on the result of an upstream model, and when an error occurs in a certain model, the error is continuously accumulated in the downstream model, so that a final result has a large error. Assuming a model accuracy of each model of
Figure 149270DEST_PATH_IMAGE001
The final error correction accuracy is
Figure 56046DEST_PATH_IMAGE002
If, if
Figure 211084DEST_PATH_IMAGE003
The accuracy is 90%, and the final accuracy is only 73%.
Disclosure of Invention
The invention aims to overcome at least one defect of the prior art, and provides an error correction model, a training method and an error correction method of a non-aligned text, which are used for solving the problem that the final result has larger errors due to error accumulation easily occurring in the traditional text error correction scheme.
The technical scheme adopted by the invention comprises the following steps:
in a first aspect, the present invention provides a non-aligned text correction model, comprising: an encoder model and a decoder model; the encoder model comprises a preprocessing module, a coded word embedding module and at least one coding layer; the decoder model comprises a phoneme extraction module and decoded word embeddingThe module comprises a first decoding multi-head attention calculation module and at least one decoding layer; the preprocessing module is used for inputting the original text from the outsideS o Preprocessing and coding to obtain initial text vectorV 0 And output to the encoding word embedding module; the code word embedding module is used for embedding the initial text vectorV 0 Converting to a first text vector of specified dimensionsEAnd combining the first text vectorEOutputting to the coding layer; the coding layer is used for coding the first text vectorECoding to obtain text feature vectorMAnd the text feature vector is used forMAs a first text vectorEOutputting the text feature vectors to the next coding layer or directly outputting the text feature vectorsMOutputting a decoding layer of the decoder model; the phoneme extraction module is used for inputting the original text of the external inputS o Extracting phoneme information and coding the extracted phoneme information to obtain a plurality of initial phoneme vectorsVAnd output it to the decoding word embedding module; the decoded word embedding module is used for respectively embedding a plurality of initial phoneme vectorsVConverting to a first phoneme vector of a specified dimensioneAnd combining a number of said first phoneme vectorseOutputting to the decoding multi-head attention calculation module; the first decoding multi-head attention calculation module is used for respectively carrying out the calculation on a plurality of first phoneme vectorsePerforming multi-head self-attention calculation to obtain a plurality of second phoneme vectorsAAnd output it to the decoding layer; the decoding layer is used for fusing a plurality of second phoneme vectorsAObtaining phoneme feature vectorsV p Combining the text feature vectorsMAnd the phoneme feature vectorV p Decoding to obtain decoding characteristic vectorV d And decoding the feature vectorV d As one of the second phoneme vectorsAOutputting the decoded feature vectors to the next decoding layer or directly outputting the decoded feature vectorsV d As to the original textS o And (5) correcting the text.
The non-aligned text error correction model provided by the invention consists of an encoder model and a decoder model, the error correction process of the model has no manual intervention, the input text is the original text to be corrected, and the text which is subjected to error correction and is finally output by a decoding layer is the text which is subjected to error correction of the original text. Meanwhile, in the error correction process, the decoding layer performs fusion decoding on the text features obtained by coding in the coding layer and the phoneme features obtained by coding in the decoder model to obtain decoding feature vectors as the text after error correction of the original text, and the error correction process enables the decoder to consider the error correction of the semantic features and the pronunciation features of the text by fusing the text features and the phoneme features of the text.
Further, the coding layer comprises a coding multi-head attention calculation module, a first coding normalization module, a coding forward propagation module and a second coding normalization module; the encoded multi-headed attention calculation module is to the first text vectorEPerforming multi-head self-attention calculation to obtain a second text vectoraAnd output it to the first code normalization module; the first encoding normalization module is used for normalizing the second text vectoraCarrying out normalization processing to obtain a third text vectorV a And output it to the code forward propagation module; the coding forward propagation module is used for carrying out forward propagation on the third text vectorV a Forward propagation processing is carried out to obtain a fourth text vectorV f And transmitting the data to a second encoding normalization module; the second encoding normalization module is used for normalizing the fourth text vectorV f Carrying out normalization processing to obtain text feature vectorsMAnd uses it as the first text vectorEOutputting the text feature vectors to the next coding layer or directly outputting the text feature vectorsMAnd outputting the data to the decoding layer.
In the coding layer, a multi-head attention mechanism, normalization processing and forward propagation processing are utilized to effectively extract text feature vectors of an original text, and more accurate text feature vectors can be obtained through repeated processing of multiple coding layers.
Further, the decoding layer comprises a vector fusion module, a second decoding multi-head attention calculation module, a first decoding normalization module, a decoding forward propagation module and a second decoding normalization module; the vector fusion module is used for fusing a plurality of second phoneme vectorsATo obtain the phoneme feature vectorV p And output it to the second decoding multiheaded attention calculation module; the second decoding multi-headed attention calculation module is used for combining the text feature vectorsMAnd the phoneme feature vectorV p Performing multi-head self-attention calculation to obtain a fusion attention vectorNAnd output it to the first decoding normalization module; the first decoding normalization module is used for normalizing the fusion attention vectorNNormalization processing is carried out to obtain a first decoding vectorV A And outputs it to the decoding forward propagation module; the decoding forward propagation module is used for decoding the first decoding vectorV A Forward propagation processing is carried out to obtain a second decoding vectorV F And transmitting it to a second decoding normalization module; the second decoding normalization module is used for the second decoding vectorV F Normalization processing is carried out to obtain decoding characteristic vectorV d And uses it as one of the second phoneme vectorsAOutputting the decoded feature vectors to the next decoding layer or directly outputting the decoded feature vectorsV d As to the original textS o And (5) correcting the text.
In the decoding layer, after a plurality of second phoneme vectors of the original text are effectively proposed by using a multi-head attention mechanism, the second phoneme vectors are fused to obtain phoneme feature vectors, then the phoneme feature vectors and the text feature vectors are fused to obtain fusion attention vectors by using the multi-head attention mechanism, the fusion attention vectors comprise text features of the text and phoneme features of the text, so that the decoding layer gives consideration to the two features of the text in the error correction process, finally, the decoding feature vectors are obtained from the fusion attention vectors giving consideration to the two features of the text through normalization and forward propagation processing, and more accurate decoding feature vectors can be obtained through repeated processing of a multi-layer decoder to serve as the text after error correction.
Further, the second decoding multi-headed attention calculation module is used for combining the text feature vectorsMAnd the phoneme feature vectorV p Performing multi-head self-attention calculation to obtain a fusion attention vectorNAnd outputting the data to the first decoding normalization module, specifically comprising: the second decoding multi-head attention calculation module is used for calculating the attention according to the formula
Figure 886916DEST_PATH_IMAGE004
Combining the text feature vectorsMAnd the phoneme feature vectorV p Performing multi-head self-attention calculation to obtain a fusion attention vectorNAnd output it to the first decoding normalization module; wherein, theK 1 AndV 1 as the feature vector of the textMAccording to the formula
Figure 967742DEST_PATH_IMAGE005
And
Figure 158552DEST_PATH_IMAGE006
for the text feature vectorMPerforming a linear transformation, saidW k AndW v training parameters of the non-aligned text error correction model; the above-mentionedQ 3 For phoneme feature vectorsV p According to the formula
Figure 851702DEST_PATH_IMAGE007
For phoneme feature vectorV p The linear transformation is carried out, and the linear transformation,W p and correcting the training parameters of the non-aligned text error model.d 1 Is composed ofK 1 Dimension (d); the above-mentioned
Figure 382040DEST_PATH_IMAGE008
Is that it isK 1 The transposed matrix of (2).
In the decoding layer, when a multi-head attention mechanism is used to obtain a fusion attention vector, K and V in the multi-head attention mechanism are linear transformation of a text feature vector, and the feature of a phoneme feature vector is combined, so Q in the multi-head attention mechanism is linear transformation of the phoneme feature vector, wherein training parameters in linear transformation calculation are all adjusted to be optimal neural network training parameters in a training process of the non-aligned text error correction model.
Further, the phoneme information comprises pinyin initial information and pinyin final information; a number of initial phoneme vectorsVIncluding initial phoneme vectors of initial consonantsV i And initial phoneme vector of vowelV f (ii) a Accordingly, a number of first phoneme vectorseIncluding a first initial phoneme vectore i And a first final phoneme vectore f (ii) a Accordingly, a number of second phoneme vectorsAIncluding a second vowel phoneme vectorA i And a second final phoneme vectorA f
The phoneme information can represent the pronunciation characteristics of the text, so that the phoneme information takes the pinyin initial information and the pinyin final information of each character of the text as the basic information of the pronunciation characteristics of the text, and the pinyin initial information and the pinyin final information are further encoded into phoneme vectors through each module and decoding layer in the decoder model and are fused with the text characteristic vectors.
Further, the vector fusion module is used for fusing a plurality of second phoneme vectorsAObtaining phoneme feature vectorV p And outputting the information to a second decoding multi-head attention calculation module, which specifically comprises: the vector fusion module is used for fusing the vector according to
Figure 135233DEST_PATH_IMAGE009
Fusing a plurality of second phoneme vectorsAObtaining phoneme feature vectorV p And output it to the second decoding multiheaded attention calculation module; wherein, theW i And saidW f And correcting the training parameters of the non-aligned text.
Furthermore, the coding multi-head attention calculation module and the first coding normalization module, and the coding forward propagation module and the second coding normalization module are connected by using residual error networks. And residual error network connection is utilized between the second decoding multi-head attention and the first decoding normalization module, and between the decoding forward propagation module and the second decoding normalization module.
The multi-head attention calculation module and the normalization module are connected by using a residual error network, and the forward propagation module and the normalization module are connected, so that the generalization capability of the non-aligned text error correction model can be improved.
In a second aspect, the present invention provides a method for training a non-aligned text error correction model, including: constructing a training data set, and randomly deleting, replacing and/or repeating the content of each sample in the training data set to obtain a preprocessed training data set; initializing a neural network model composed of an encoder and a decoder, inputting the training data set into the neural network model in batches for training until the function value of the loss function of the neural network is not obviously reduced any more, and obtaining the non-aligned text error correction model.
In a third aspect, the present invention provides a method for correcting errors of non-aligned texts, including: and inputting the original text to be processed into the non-aligned text error correction model, so that the non-aligned text error correction model corrects the error of the original text to be processed, and outputting the text of the original text to be processed after error correction.
Compared with the prior art, the invention has the following beneficial effects:
the non-aligned text error correction model provided by the invention comprises a decoder model and an encoder model, the input of the whole model is an original text, the output of the whole model is an error-corrected text, and all error correction processes of the original text are contained in the error correction model, so that each process of the text can be corrected and optimized in the training process of an end-to-end model, and the problem of error accumulation in the traditional pipeline type process is avoided. The model provided by the invention utilizes the superposition of multiple coding layers and decoding layers to obtain more accurate and effective characteristics, comprehensively considers the semantic characteristics and pronunciation characteristics of the original text to correct the error of the original text, and effectively improves the error correction accuracy rate.
Drawings
Fig. 1 is a schematic diagram of a module composition of a non-aligned text error correction model in embodiment 1 of the present invention.
Fig. 2 is a schematic diagram of the module composition of the coding layer and the decoding layer in embodiment 1 of the present invention.
Fig. 3 is a diagram illustrating specific data transmission of the phoneme extraction module 210 in the decoder model 200 according to embodiment 1 of the present invention.
FIG. 4 is a flowchart illustrating steps S210-S230 of the training method in embodiment 2 of the present invention.
Fig. 5 is a schematic flow chart of the training process and the inference stage in embodiments 2 and 3 of the present invention.
Fig. 6 is a flowchart illustrating the step S310 of the error correction method in embodiment 3 of the present invention.
Detailed Description
The drawings are only for purposes of illustration and are not to be construed as limiting the invention. For the purpose of better illustrating the following embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
Example 1
The embodiment provides a non-aligned text error correction model, which is an end-to-end error correction model and is constructed by a structure of an encoder and a decoder, and the whole process of text error correction is included in an end-to-end neural network model, so that the problem of error accumulation of the traditional pipeline type text error correction model is solved.
As shown in fig. 1, the non-aligned text error correction model includes an encoder model 100 and a decoder model 200.
The encoder model 100 includes a preprocessing module 110, a coded word embedding module 120, and at least one coding layer 130.
In the present embodiment, the Encoder model 100 is implemented based on a variant model of a transform structure (the transform means that a network structure is completely composed of attention mechanism), such as a BERT model (Bidirectional Encoder), a DistillBert model, a RoBERTa model, and the like.
The preprocessing module 110 is used for inputting the original text from the outsideS o Preprocessing and coding are carried out to obtain an initial text vectorV 0 And output to the encoding word embedding module 120.
Original textS o The method refers to a text to be corrected without error correction, and the preprocessing refers to the original text input from the outsideS o The processing is an operation compatible with the type or length of data processed by the encoder model 100, and in the present embodiment, the preprocessing specifically refers to the processing of the original textS o The word segmentation is performed, that is, the word is segmented into word groups, which can also be called text sequences. After preprocessing, each text sequence is coded according to a word list so as to convert the unstructured information of the text into structured information, namely, each text sequence is converted into a corresponding vector, and then the vectors of each sequence form an initial text vectorV 0 To the other modules of the encoder model 100. More specifically, the Encoding mode adopts One-Hot Encoding (One-Hot Encoding), which is to encode N states by using an N-bit state register.
The encoding word embedding module 120 is used for embedding the initial text vectorV 0 Converting to a first text vector of specified dimensionsEAnd the first text vector is usedEAnd outputting the data to the coding layer.
Word embedding means embedding a high-dimensional space with the number of all words into a continuous vector space with lower dimension, and each word or phrase is mapped into a real number domainThe vector of (c). The encode word embedding module 120 embeds the initial text vectorV 0 Converting to a first text vector of specified dimensionsEFirst text vectorEDimension ratio initial text vectorV 0 And lower.
The coding layer 130 is used for coding the first text vectorECoding to obtain text feature vectorMAnd combining the text feature vectorsMAs a first text vectorEOutputting to the next coding layer, or directly outputting the text feature vectorMOutput to the decoder model 200.
In a specific embodiment, the number of the coding layers 130 is several, and each coding layer 130 is for a first text vectorEObtaining text characteristic vector after encodingMThen, it is used as the input of the next coding layer 130, and the next coding layer 130 continues to use the text feature vector obtained from the previous coding layer 130MCoding to obtain new text feature vectorMThrough repeated encoding processing of the multi-layer encoding layer 130, more accurate text feature vectors can be obtainedMTo effectively characterize the original textS o Features on a textual level.
Specifically, as shown in fig. 2, the encoding layer 130 includes an encoding multi-head attention calculation module 131, a first encoding normalization module 132, an encoding forward propagation module 133, and a second encoding normalization module 134.
The encoded multi-head attention calculation module 131 is used for the first text vectorEPerforming multi-head self-attention calculation to obtain a second text vectoraAnd outputs it to the first code normalization module.
Multi-headed self-attention computation refers to inputting an initial vector into a plurality of parallel attention-based computation modules. The encoding multi-head attention calculation module 131 is formed by combining and connecting a plurality of parallel attention calculation modules in parallel.
Specifically, each of the encoded multi-head attention calculation modules 131 is according to the equation
Figure 751022DEST_PATH_IMAGE010
For the first text vectorEPerforming attention calculation, each attention calculation module independently calculates a result, and finally splicing the results of the attention calculation modules to obtain a second text vectora
Wherein the content of the first and second substances,K 1 andV 1 for text feature vectorsMAccording to the formula
Figure 247862DEST_PATH_IMAGE011
And
Figure 695024DEST_PATH_IMAGE012
for text feature vectorMThe linear transformation is carried out, and the linear transformation,W k andW v and correcting the training parameters of the non-aligned text.Q 1 For text feature vectorsMAccording to the formula
Figure 353538DEST_PATH_IMAGE013
For text feature vectorMThe linear transformation is carried out, and the linear transformation,W q the training parameters of the non-aligned text error correction model.d 1 Is composed ofK 1 Dimension (d);
Figure 722203DEST_PATH_IMAGE014
is that it isK 1 The transposed matrix of (2).
The first encoding normalization module 132 is used for normalizing the second text vectoraCarrying out normalization processing to obtain a third text vectorV a And outputs it to the code forward propagation module 133.
The normalization process is also called data normalization, the normalization process limits the data to be processed within a certain range, and converts the dimensional data into dimensionless dataV a And the phoneme vectors are fused in the subsequent decoding process, and the adverse effect caused by bad data can be eliminated.
In a preferred embodiment, the encoding multi-head attention calculation module 131 and the first encoding normalization module 132 are connected through a residual error network, so as to improve the generalization capability of the non-aligned text error correction model provided in this embodiment.
The encoding forward propagation module 134 is used for the third text vectorV a Forward propagation processing is carried out to obtain a fourth text vectorV f And transmits it to the second code normalization module 135.
The forward propagation processing refers to that in a neural network, information directly flows from a previous neuron to a next neuron until output, and in this embodiment, the forward propagation processing can be realized through a full connection layer.
The second encoding normalization module 135 is for normalizing the fourth text vectorV f Carrying out normalization processing to obtain text feature vectorsMAnd uses it as a first text vectorEOutputting to the next coding layer, or directly outputting the text feature vectorMOutput to the decoder module 200.
In a preferred embodiment, the encoding forward propagation module 134 is connected to the second encoding normalization module 135 through a residual network, so as to improve the generalization capability of the non-aligned text error correction model provided in this embodiment.
In the present embodiment, as shown in fig. 2, the decoder model 200 includes a phoneme extraction module 210, a decoded word embedding module 220, a first decoding multi-headed attention calculation module 320, and at least one decoding layer 240.
The phoneme extraction module 210 is used for inputting the original text from the outsideS o Extracting phoneme information and coding the extracted phoneme information to obtain a plurality of initial phoneme vectorsVAnd outputs it to the decoded word embedding module 220.
The phoneme information means that the original text can be representedS o The information of pronunciation may be, for example, the original textS o Any suitable pinyin, phonetic symbol, etc. for representing the original textS o The pronunciation symbol of the pronunciation.
In this embodiment, the phoneme information specifically refers to each original textS o The initial consonant information and the final consonant information of each character in the text are extracted by the phoneme extraction module 210S o Each character of (a) is converted into pinyin to generate a pinyin sequence
Figure 981322DEST_PATH_IMAGE016
E.g. textS o "BYZ" correspondingly generated phonetic sequenceP o Is "zaijian". As shown in FIG. 3, the phone extracting module 210 extracts the Pinyin sequenceP o Split into phonetic initial consonant sequencesP i And phonetic vowel sequenceP f Taking the foregoing example as an example, the Pinyin sequenceP o Zaijian phonetic alphabet sequenceP i Z j, the phonetic vowel sequenceP f Is "aiian". The phoneme extraction module extracts the phonetic initial sequenceP i And phonetic vowel sequenceP f Respectively coding to obtain initial phoneme vectors of initial consonantsV i And initial phoneme vector of finalV f As initial phoneme vectorVThe input decoded word embedding module 220.
The decoded word embedding module 220 is used for respectively embedding a plurality of initial phoneme vectorsVConverting to a first phoneme vector of a specified dimensioneAnd a plurality of first phoneme vectorseOutput to the first decoding multi-headed attention calculation module 230.
The first decoding multi-head attention calculating module 230 is used for respectively calculating a plurality of first phoneme vectorsePerforming multi-head self-attention calculation to obtain a plurality of second phoneme vectorsAAnd outputs it to the decoding layer.
The first decoding multi-head attention calculation module 230 is formed by combining and connecting a plurality of attention calculation modules in parallel.
Each attention calculation module in the first decoding multi-headed attention calculation module 230 is according to the equation
Figure 751832DEST_PATH_IMAGE017
For each first phoneme vectoreI.e. separately for the first initial phoneme vectore i And a first final phoneme vectore f Performing attention calculation, each attention calculation module independently calculates an attention result, and finally splicing the results of the attention calculation modules to obtain a first initial phoneme vectore i Corresponding second vowel phoneme vectorA i And the initial phoneme vector of the finalV f Corresponding second final phoneme vectorA f
Wherein the content of the first and second substances,K 2 andV 2 for the first initial phoneme vectore i Or the first vowel phoneme vectore f According to the formula
Figure 581247DEST_PATH_IMAGE018
And
Figure 233946DEST_PATH_IMAGE019
for the first initial phoneme vectore i Is linearly transformed or according to
Figure 72589DEST_PATH_IMAGE020
And
Figure 166447DEST_PATH_IMAGE021
for the first vowel phoneme vectore f The linear transformation is carried out, and the linear transformation,W k andW v the training parameters of the non-aligned text error correction model.Q 2 For the first initial phoneme vectore i Or initial phoneme vector of vowelV f According to the formula
Figure 432343DEST_PATH_IMAGE022
For the first initial phoneme vectore i Performing a linear transformationOr according to the formula
Figure 306758DEST_PATH_IMAGE023
For the first vowel phoneme vectore f The linear transformation is carried out, and the linear transformation,W q the training parameters of the non-aligned text error correction model.d 2 Is composed ofK 2 Dimension of (d);
Figure 949092DEST_PATH_IMAGE024
is composed ofK 2 The transposed matrix of (2).
The decoding layer 240 is used for fusing a plurality of second phoneme vectorsAI.e. the second vowel phoneme vectorA i And a second final phoneme vectorA f Obtaining phoneme feature vectorsV p Combining text feature vectorsMAnd phoneme feature vectorsV p Decoding to obtain decoding characteristic vectorV d And decoding the feature vectorV d As one of the second phoneme vectorsAOutputting to the next decoding layer, or directly decoding the feature vectorV d As to the original textS o And (5) correcting the text.
In a specific embodiment, the number of decoding layers 240 is several, and each decoding layer 240 will have several second phoneme vectorsAFusing to obtain phoneme feature vectorV p Then, combining the text feature vectorsMAnd phoneme feature vectorsV p Decoding to obtain a decoded feature vectorV d Which is taken as input to the next decoding layer 240, the next decoding layer 240 proceeds on the basis of the decoded feature vectors obtained by the previous decoding layer 240V d Encoding to obtain new decoding characteristic vectorV d Through repeated encoding processing of the multi-layer decoding layer 240, more accurate decoding characteristic vector can be obtainedV d As for the original textS o And (5) correcting the text.
In a specific embodiment, as shown in fig. 2, the decoding layer 240 includes a vector fusion module 241, a second decoding multi-head attention calculation module 242, a first decoding normalization module 243, a decoding forward propagation module 244, and a second decoding normalization module 245.
The vector fusion module 241 is used for fusing the second vowel phoneme vectorA i And a second final phoneme vectorA f To obtain the phoneme feature vectorV p And outputs it to the second decoding multi-head attention calculation module 242.
Specifically, the vector fusion module 241 is according to the equation
Figure 163036DEST_PATH_IMAGE025
Fusing second vowel phoneme vectorsA i And a second final phoneme vectorA f Wherein, in the step (A),W i andW f and correcting training parameters of the model for the non-aligned text.
The second decoding multi-headed attention calculation module 242 is used to combine text feature vectorsMAnd phoneme feature vectorV p Performing multi-head self-attention calculation to obtain a fusion attention vectorNAnd outputs it to the first decoding normalization module 243.
The second decoding multi-head attention calculation module 242 is formed by combining and connecting a plurality of attention calculation modules in parallel.
Each attention calculation module of the second decoding multi-headed attention calculation module 242 is according to the equation
Figure 98368DEST_PATH_IMAGE026
For combined text feature vectorMAnd phoneme feature vectorV p Performing attention calculation, each attention calculation module independently calculates an attention result, and finally splicing the results of the attention calculation modules to obtain a fusion attention vectorN
Wherein, the first and the second end of the pipe are connected with each other,K 1 andV 1 for text feature vectorsMLinear transformation ofBody based form
Figure 663342DEST_PATH_IMAGE027
And
Figure 640525DEST_PATH_IMAGE028
for text feature vectorMThe linear transformation is carried out to carry out the linear transformation,W k andW v and correcting the training parameters of the non-aligned text.Q 3 For phoneme feature vectorsV p According to the formula
Figure 708975DEST_PATH_IMAGE029
For phoneme feature vectorV p The linear transformation is carried out to carry out the linear transformation,W q and correcting the training parameters of the non-aligned text.d 1 Is composed ofK 1 Dimension of (d);
Figure 316674DEST_PATH_IMAGE014
is that it isK 1 The transposed matrix of (2).
The first decoding normalization module 243 is used for merging the attention vectorsNCarrying out normalization processing to obtain a first decoding vectorV A And outputs it to the decode forward propagation module 244.
The decoding forward propagation module 244 is used to decode the first decoded vectorV A Forward propagation processing is carried out to obtain a second decoding vectorV F And transmits it to the second decoding normalization module.
In this embodiment, the forward propagation process may be implemented by one fully connected layer.
The second decoding normalization module is used for carrying out second decoding vectorV F Normalization processing is carried out to obtain decoding characteristic vectorV d And uses it as one of the second phoneme vectorsAOutputting to the next decoding layer, or directly decoding the feature vectorV d As to the original textS o And (5) correcting the text.
In a specific embodiment, when feature vectors are to be decodedV d And uses it as one of the second phoneme vectorsAWhen the output is outputted to the next decoding layer, the vector fusion module 241 of the next decoding layer 240 fuses the second vowel vectorA i And a second final phoneme vectorA f And simultaneously fusing the decoded feature vectors output by the previous layerV d Can be specifically according to the formula
Figure 368944DEST_PATH_IMAGE030
The three vectors are fused.
The non-aligned text error correction model provided by the embodiment comprises a decoder model and an encoder model, each neural network parameter of the whole model can be updated simultaneously in the training process of the model, the input of the model is an original text, the output of the model is an error-corrected text, and the processes of phoneme extraction, phoneme coding, language coding, feature merging and decoding of the original text are included in the error correction model, so that each processing process of the text can be corrected and optimized in the training process of an end-to-end model, the accuracy of short sentences corrected by using the trained error correction model is ensured, and the problem of error accumulation in the pipeline processing does not exist. Meanwhile, the non-aligned text error correction model provided by the embodiment utilizes the superposition of multiple coding layers and decoding layers to obtain more accurate and effective features, and in the processing process of the decoding layers, a text feature vector corresponding to the original text generated by the coder model and a phoneme vector corresponding to the original text generated by the decoder model are fused, that is, the semantic features and pronunciation features of the original text are comprehensively considered to correct the error of the original text, so that the error correction accuracy is effectively improved.
Example 2
Based on the same concept as that of embodiment 1, this embodiment provides a training method for a non-aligned text correction model, which is shown in fig. 4 and 5, and includes the following steps:
s210, constructing a training data set;
in this step, the specific process of constructing the training data set is to obtain a plurality of original texts and error-corrected texts corresponding to the original texts, and each group of original texts and the corresponding error-corrected texts form a sentence pair to form a sample. After the training data set is constructed, the training data set can be segmented into a training set, a verification set and a test set according to a preset proportion, wherein the training set is used for training the non-aligned text error correction model, and the verification set and the test set are used for verifying and testing the model after the model training is completed. The preset ratio can be 8:1:1, and can be properly adjusted according to actual implementation scenes.
S220, randomly deleting, replacing and/or repeating the content of each sample in the training data set to obtain a preprocessed training data set;
in the step, the content of each sample in the training data set is deleted, replaced and/or repeated randomly, which is beneficial for the error correction model to identify various types of texts and improves the generalization capability of the error correction model.
And 3 operations of deleting, replacing and repeating text sample content can be selectively executed according to actual conditions.
Specifically, the process of random deletion is as follows: each word in the sample with a certain probabilityp 0 Randomly deleting, wherein the number of the deleted words is not more than 30% of the total sentence length, and the proportion can be determined according to the actual situation; the process of random replacement is as follows: each word in the sample with a certain probabilityp 1 Randomly replacing the words with harmonic or near-harmonic words, wherein the number of the replaced words is not more than 30% of the total sentence length, and the proportion can be determined according to the actual situation; the randomly repeated process is as follows: each word in the text sample with a certain probabilityp 2 Randomly repeating and inserting the current position, wherein the repeated word number does not exceed 30% of the total sentence length, and the proportion can be determined according to the actual situation.
S230, initializing a neural network model composed of an encoder and a decoder, and inputting the training set to the neural network model in batches for training until the function value of the loss function of the neural network is not reduced, so as to obtain the non-aligned text error correction model described in embodiment 1.
In this step, the parameters of the neural network that need to be trained and updated during the training process are as described in embodiment 1W f W i W k W v W q W p Six neural network parameters.
In a specific embodiment, the error correction model may use the cross entropy of each character as a loss function during the training process, sequentially calculate the loss of each position of the output sequence and the target sequence, and add up to obtain the final loss. Meanwhile, an adam (adaptive motion optimization) optimization algorithm is used as a training optimizer, and learning rate preheating and attenuation strategies are used in a matching manner to update model parameters until the function value of the loss function of the neural network is not obviously reduced.
Example 3
Based on the same concept as that of embodiment 1, this embodiment provides a method for correcting a non-aligned text, which is shown in fig. 5 and 6 and includes the following steps:
s310, inputting the original text to be processed into the non-aligned text error correction model described in embodiment 1, so that the non-aligned text error correction model corrects the error of the original text to be processed, and outputs the text after error correction of the original text to be processed.
It should be understood that the above-mentioned embodiments of the present invention are only examples for clearly illustrating the technical solutions of the present invention, and are not intended to limit the specific embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention claims should be included in the protection scope of the present invention claims.

Claims (9)

1. A non-aligned text error correction model, comprising: an encoder model and a decoder model;
the encoder model comprises a preprocessing module, a coded word embedding module and at least one coding layer;
the decoder model comprises a phoneme extraction module, a decoded word embedding module, a first decoding multi-head attention calculation module and at least one decoding layer;
the preprocessing module is used for inputting the original text from the outsideS o Preprocessing and coding are carried out to obtain an initial text vectorV o And output to the encoding word embedding module;
the code word embedding module is used for embedding the initial text vectorV o Converting to a first text vector of specified dimensionsEAnd combining the first text vectorEOutputting to the coding layer;
the coding layer is used for coding the first text vectorECoding to obtain text feature vectorMAnd the text feature vector is used forMAs a first text vectorEOutputting the text feature vectors to the next coding layer or directly outputting the text feature vectorsMOutputting a decoding layer of the decoder model;
the phoneme extraction module is used for inputting the original text of the external inputS o Extracting phoneme information and coding the extracted phoneme information to obtain a plurality of initial phoneme vectorsVAnd output it to the decoding word embedding module;
the decoded word embedding module is used for respectively embedding a plurality of initial phoneme vectorsVConverting to a first phoneme vector of a specified dimensioneAnd combining a number of said first phoneme vectorseOutput to the first decoding multi-headed attention calculation module;
the first decoding multi-head attention calculation module is used for respectively carrying out multi-head attention calculation on a plurality of first phoneme vectorsePerforming multi-head self-attention calculation to obtain a plurality of second phoneme vectorsAAnd output it to the decoding layer;
the decoding layer comprises a vector fusion module, a second decoding multi-head attention calculation module, a first decoding normalization module, a decoding forward propagation module and a second decoding normalization module;
the vector fusion module is used for fusing a plurality of second phoneme directionsMeasurement ofATo obtain the phoneme feature vectorV p And output it to the second decoding multiheaded attention calculation module;
the second decoding multi-head attention calculation module is used for combining the text feature vectorsMAnd the phoneme feature vectorV p Performing multi-head self-attention calculation to obtain a fusion attention vectorNAnd output it to the first decoding normalization module;
the first decoding normalization module is used for normalizing the fusion attention vectorNNormalization processing is carried out to obtain a first decoding vectorV A And outputs it to the decoding forward propagation module;
the decoding forward propagation module is used for decoding the first decoding vectorV A Forward propagation processing is carried out to obtain a second decoding vectorV F And transmitting it to a second decoding normalization module;
the second decoding normalization module is used for the second decoding vectorV F Normalization processing is carried out to obtain decoding characteristic vectorV d And uses it as one of the second phoneme vectorsAOutputting the decoded feature vectors to the next decoding layer or directly outputting the decoded feature vectorsV d As to the original textS o And (5) correcting the text.
2. The non-aligned text correction model of claim 1,
the coding layer comprises a coding multi-head attention calculation module, a first coding normalization module, a coding forward propagation module and a second coding normalization module;
the encoded multi-head attention calculation module is used for calculating the first text vectorEPerforming multi-head self-attention calculation to obtain a second text vectoraAnd output it to the first code normalization module;
the first encoding normalization module is used for normalizing the second text vectoraCarrying out normalization processing to obtain a third text directionMeasurement ofV a And output it to the code forward propagation module;
the code forward propagation module is used for transmitting the third text vectorV a Forward propagation processing is carried out to obtain a fourth text vectorV f And transmitting the data to a second encoding normalization module;
the second encoding normalization module is used for normalizing the fourth text vectorV f Carrying out normalization processing to obtain text feature vectorsMAnd uses it as the first text vectorEOutputting the text feature vectors to the next coding layer or directly outputting the text feature vectorsMAnd outputting the data to the decoding layer.
3. The non-aligned text correction model of claim 1,
the second decoding multi-head attention calculation module is used for combining the text feature vectorsMAnd the phoneme feature vectorV p Performing multi-head self-attention calculation to obtain a fusion attention vectorNAnd outputting the data to the first decoding normalization module, specifically comprising:
the second decoding multi-head attention calculation module is used for calculating the attention according to the formula
Figure 720883DEST_PATH_IMAGE001
Combining the text feature vectorsMAnd the phoneme feature vectorV p Performing multi-head self-attention calculation to obtain a fusion attention vectorNAnd output it to the first decoding normalization module; wherein, theK 1 AndV 1 as the feature vector of the textMAccording to the formula
Figure 800834DEST_PATH_IMAGE002
And
Figure 888876DEST_PATH_IMAGE003
for the text feature directionMeasurement ofMPerforming a linear transformation ofW k AndW v training parameters of the non-aligned text error correction model; the describedQ 3 For phoneme feature vectorsV p According to the formula
Figure 155909DEST_PATH_IMAGE004
For phoneme feature vectorV p The linear transformation is carried out, and the linear transformation,W q training parameters of the non-aligned text error correction model;d 1 is composed ofK 1 Dimension (d);
Figure 620389DEST_PATH_IMAGE005
is that it isK 1 The transposed matrix of (2).
4. The non-aligned text correction model of claim 1,
the phoneme information comprises pinyin initial information and pinyin final information;
a number of initial phoneme vectorsVIncluding initial phoneme vectors of initial consonantsV i And initial phoneme vector of finalV f
Accordingly, a number of first phoneme vectorseIncluding a first initial phoneme vectore i And a first final phoneme vectore f
Accordingly, a number of second phoneme vectorsAIncluding a second vowel phoneme vectorA i And a second final phoneme vectorA f
5. The non-aligned text correction model of claim 4,
the vector fusion module is used for fusing the vector according to
Figure 554846DEST_PATH_IMAGE006
Fusing a number of second phoneme vectorsAObtaining phoneme feature vectorV p And output it to the second decoding multiheaded attention computing module; wherein, theW i And saidW f And correcting the training parameters of the non-aligned text.
6. The non-aligned text correction model of claim 2,
the coding multi-head attention calculation module and the first coding normalization module, and the coding forward propagation module and the second coding normalization module are connected by using residual error networks.
7. The non-aligned text correction model according to any one of claims 1 to 5,
and residual error network connection is utilized between the second decoding multi-head attention calculation module and the first decoding normalization module, and between the decoding forward propagation module and the second decoding normalization module.
8. A training method of a non-aligned text error correction model is characterized by comprising the following steps:
constructing a training data set, and randomly deleting, replacing and repeating the content of each sample in the training data set to obtain a preprocessed training data set;
initializing a neural network model composed of an encoder and a decoder, and inputting the training data set into the neural network model in batches for training until the function value of the loss function of the neural network model is not reduced obviously any more, so as to obtain the non-aligned text error correction model according to any one of claims 1 to 7.
9. A method for correcting errors in non-aligned text, comprising:
inputting the original text to be processed into the non-aligned text error correction model according to any one of claims 1 to 7, so that the non-aligned text error correction model corrects the error of the original text to be processed, and outputting the corrected text of the original text to be processed.
CN202210696857.6A 2022-06-20 2022-06-20 Error correction model, training and error correction method for non-aligned text Active CN114781377B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210696857.6A CN114781377B (en) 2022-06-20 2022-06-20 Error correction model, training and error correction method for non-aligned text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210696857.6A CN114781377B (en) 2022-06-20 2022-06-20 Error correction model, training and error correction method for non-aligned text

Publications (2)

Publication Number Publication Date
CN114781377A CN114781377A (en) 2022-07-22
CN114781377B true CN114781377B (en) 2022-09-09

Family

ID=82420349

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210696857.6A Active CN114781377B (en) 2022-06-20 2022-06-20 Error correction model, training and error correction method for non-aligned text

Country Status (1)

Country Link
CN (1) CN114781377B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665675B (en) * 2023-07-25 2023-12-12 上海蜜度信息技术有限公司 Voice transcription method, system, electronic equipment and storage medium
CN116991874B (en) * 2023-09-26 2024-03-01 海信集团控股股份有限公司 Text error correction and large model-based SQL sentence generation method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114444479A (en) * 2022-04-11 2022-05-06 南京云问网络技术有限公司 End-to-end Chinese speech text error correction method, device and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113297833A (en) * 2020-02-21 2021-08-24 华为技术有限公司 Text error correction method and device, terminal equipment and computer storage medium
US11514888B2 (en) * 2020-08-13 2022-11-29 Google Llc Two-level speech prosody transfer
CN113409757A (en) * 2020-12-23 2021-09-17 腾讯科技(深圳)有限公司 Audio generation method, device, equipment and storage medium based on artificial intelligence
CN114373480A (en) * 2021-12-17 2022-04-19 腾讯音乐娱乐科技(深圳)有限公司 Training method of voice alignment network, voice alignment method and electronic equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114444479A (en) * 2022-04-11 2022-05-06 南京云问网络技术有限公司 End-to-end Chinese speech text error correction method, device and storage medium

Also Published As

Publication number Publication date
CN114781377A (en) 2022-07-22

Similar Documents

Publication Publication Date Title
CN114781377B (en) Error correction model, training and error correction method for non-aligned text
CN114444479B (en) End-to-end Chinese speech text error correction method, device and storage medium
CN109522403B (en) Abstract text generation method based on fusion coding
CN110765772A (en) Text neural network error correction model after Chinese speech recognition with pinyin as characteristic
CN111767718B (en) Chinese grammar error correction method based on weakened grammar error feature representation
CN113297841A (en) Neural machine translation method based on pre-training double-word vectors
CN111783477B (en) Voice translation method and system
CN113569562B (en) Method and system for reducing cross-modal and cross-language barriers of end-to-end voice translation
CN113327595B (en) Pronunciation deviation detection method and device and storage medium
CN115935957B (en) Sentence grammar error correction method and system based on syntactic analysis
WO2023193542A1 (en) Text error correction method and system, and device and storage medium
CN114662476A (en) Character sequence recognition method fusing dictionary and character features
CN115831102A (en) Speech recognition method and device based on pre-training feature representation and electronic equipment
CN112349288A (en) Chinese speech recognition method based on pinyin constraint joint learning
CN115602161A (en) Chinese speech enhancement recognition and text error correction method
CN117437909B (en) Speech recognition model construction method based on hotword feature vector self-attention mechanism
CN115270771B (en) Fine-grained self-adaptive Chinese spelling error correction method assisted by word-sound prediction task
CN115223549A (en) Vietnamese speech recognition corpus construction method
CN115240712A (en) Multi-mode-based emotion classification method, device, equipment and storage medium
CN115376547A (en) Pronunciation evaluation method and device, computer equipment and storage medium
CN114239548A (en) Triple extraction method for merging dependency syntax and pointer generation network
CN115034236A (en) Chinese-English machine translation method based on knowledge distillation
CN114912441A (en) Text error correction model generation method, error correction method, system, device and medium
CN114005434A (en) End-to-end voice confidence calculation method, device, server and medium
US20240135089A1 (en) Text error correction method, system, device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant