CN114781377B

CN114781377B - Error correction model, training and error correction method for non-aligned text

Info

Publication number: CN114781377B
Application number: CN202210696857.6A
Authority: CN
Inventors: 许程冲; 赵文博; 肖清
Original assignee: China Unicom Guangdong Industrial Internet Co Ltd
Current assignee: China Unicom Guangdong Industrial Internet Co Ltd
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2022-09-09
Anticipated expiration: 2042-06-20
Also published as: CN114781377A

Abstract

The invention provides an error correction model, a training method and an error correction method for non-aligned texts, wherein the model comprises the following steps: an encoder model and a decoder model; the pre-processing module and the encoding word embedding module of the encoder model are used for embedding the first text vectorEOutputting to a coding layer; the coding layer obtains a text feature vector and outputs the text feature vector to a decoding layer of the decoder model; the phoneme extraction module, the decoded word embedding module and the first decoding multi-head attention calculation module of the decoder model output a plurality of second phoneme vectors to a decoding layer; and the decoding layer fuses a plurality of second phoneme vectors to obtain phoneme feature vectors, decodes the phoneme feature vectors by combining the text feature vectors and the phoneme feature vectors to obtain decoding feature vectors, and takes the decoding feature vectors as the text after error correction of the original text. Each processing process of text error correction is corrected and optimized in the training process of the end-to-end model, the problem of error accumulation is avoided, and the error correction accuracy rate is effectively improved.

Description

Error correction model, training and error correction method for non-aligned text

Technical Field

The invention relates to the field of text error correction, in particular to an error correction model, training and error correction method for non-aligned text.

Background

Automatic Speech Recognition (ASR) is a basic task of intelligent Speech in natural language processing, and the technology can be widely applied to scenes such as intelligent customer service, intelligent outbound and the like. In the automatic speech recognition task, situations often occur in which the speech recognition result is not accurate enough, for example, the recognized text has errors such as wrong characters, multiple characters, few characters, and the like. Among them, the task of solving the problem of wrong words is called aligned text error correction, and the task of solving the problems of wrong words, multiple words and few words is called non-aligned text error correction. The non-aligned text error correction can be applied to tasks such as spelling correction, voice recognition optimization and the like, and the corresponding text accuracy is improved.

Error correction of automatic speech recognition results is a critical task for downstream natural language processing services. The existing text error correction scheme generally adopts pipeline processing, namely the method is divided into three sequential steps: error detection, candidate recall, candidate ranking. The error detection means detecting and positioning points with errors in the text, the candidate recall means recalling correct candidate words with the error points, and the candidate sorting means sorting the recalled candidate words through a sorting algorithm, and selecting a word/word with the highest score/the first order and the error points for replacement. In the existing scheme, three steps are respectively realized through three independent models, but the pipeline processing mode inevitably causes that a downstream model strongly depends on the result of an upstream model, and when an error occurs in a certain model, the error is continuously accumulated in the downstream model, so that a final result has a large error. Assuming a model accuracy of each model of

The final error correction accuracy is

If, if

The accuracy is 90%, and the final accuracy is only 73%.

Disclosure of Invention

The invention aims to overcome at least one defect of the prior art, and provides an error correction model, a training method and an error correction method of a non-aligned text, which are used for solving the problem that the final result has larger errors due to error accumulation easily occurring in the traditional text error correction scheme.

The technical scheme adopted by the invention comprises the following steps:

in a first aspect, the present invention provides a non-aligned text correction model, comprising: an encoder model and a decoder model; the encoder model comprises a preprocessing module, a coded word embedding module and at least one coding layer; the decoder model comprises a phoneme extraction module and decoded word embeddingThe module comprises a first decoding multi-head attention calculation module and at least one decoding layer; the preprocessing module is used for inputting the original text from the outsideS _o Preprocessing and coding to obtain initial text vectorV ₀ And output to the encoding word embedding module; the code word embedding module is used for embedding the initial text vectorV ₀ Converting to a first text vector of specified dimensionsEAnd combining the first text vectorEOutputting to the coding layer; the coding layer is used for coding the first text vectorECoding to obtain text feature vectorMAnd the text feature vector is used forMAs a first text vectorEOutputting the text feature vectors to the next coding layer or directly outputting the text feature vectorsMOutputting a decoding layer of the decoder model; the phoneme extraction module is used for inputting the original text of the external inputS _o Extracting phoneme information and coding the extracted phoneme information to obtain a plurality of initial phoneme vectorsVAnd output it to the decoding word embedding module; the decoded word embedding module is used for respectively embedding a plurality of initial phoneme vectorsVConverting to a first phoneme vector of a specified dimensioneAnd combining a number of said first phoneme vectorseOutputting to the decoding multi-head attention calculation module; the first decoding multi-head attention calculation module is used for respectively carrying out the calculation on a plurality of first phoneme vectorsePerforming multi-head self-attention calculation to obtain a plurality of second phoneme vectorsAAnd output it to the decoding layer; the decoding layer is used for fusing a plurality of second phoneme vectorsAObtaining phoneme feature vectorsV _p Combining the text feature vectorsMAnd the phoneme feature vectorV _p Decoding to obtain decoding characteristic vectorV _d And decoding the feature vectorV _d As one of the second phoneme vectorsAOutputting the decoded feature vectors to the next decoding layer or directly outputting the decoded feature vectorsV _d As to the original textS _o And (5) correcting the text.

The non-aligned text error correction model provided by the invention consists of an encoder model and a decoder model, the error correction process of the model has no manual intervention, the input text is the original text to be corrected, and the text which is subjected to error correction and is finally output by a decoding layer is the text which is subjected to error correction of the original text. Meanwhile, in the error correction process, the decoding layer performs fusion decoding on the text features obtained by coding in the coding layer and the phoneme features obtained by coding in the decoder model to obtain decoding feature vectors as the text after error correction of the original text, and the error correction process enables the decoder to consider the error correction of the semantic features and the pronunciation features of the text by fusing the text features and the phoneme features of the text.

Further, the coding layer comprises a coding multi-head attention calculation module, a first coding normalization module, a coding forward propagation module and a second coding normalization module; the encoded multi-headed attention calculation module is to the first text vectorEPerforming multi-head self-attention calculation to obtain a second text vectoraAnd output it to the first code normalization module; the first encoding normalization module is used for normalizing the second text vectoraCarrying out normalization processing to obtain a third text vectorV _a And output it to the code forward propagation module; the coding forward propagation module is used for carrying out forward propagation on the third text vectorV _a Forward propagation processing is carried out to obtain a fourth text vectorV _f And transmitting the data to a second encoding normalization module; the second encoding normalization module is used for normalizing the fourth text vectorV _f Carrying out normalization processing to obtain text feature vectorsMAnd uses it as the first text vectorEOutputting the text feature vectors to the next coding layer or directly outputting the text feature vectorsMAnd outputting the data to the decoding layer.

In the coding layer, a multi-head attention mechanism, normalization processing and forward propagation processing are utilized to effectively extract text feature vectors of an original text, and more accurate text feature vectors can be obtained through repeated processing of multiple coding layers.

Further, the decoding layer comprises a vector fusion module, a second decoding multi-head attention calculation module, a first decoding normalization module, a decoding forward propagation module and a second decoding normalization module; the vector fusion module is used for fusing a plurality of second phoneme vectorsATo obtain the phoneme feature vectorV _p And output it to the second decoding multiheaded attention calculation module; the second decoding multi-headed attention calculation module is used for combining the text feature vectorsMAnd the phoneme feature vectorV _p Performing multi-head self-attention calculation to obtain a fusion attention vectorNAnd output it to the first decoding normalization module; the first decoding normalization module is used for normalizing the fusion attention vectorNNormalization processing is carried out to obtain a first decoding vectorV _A And outputs it to the decoding forward propagation module; the decoding forward propagation module is used for decoding the first decoding vectorV _A Forward propagation processing is carried out to obtain a second decoding vectorV _F And transmitting it to a second decoding normalization module; the second decoding normalization module is used for the second decoding vectorV _F Normalization processing is carried out to obtain decoding characteristic vectorV _d And uses it as one of the second phoneme vectorsAOutputting the decoded feature vectors to the next decoding layer or directly outputting the decoded feature vectorsV _d As to the original textS _o And (5) correcting the text.

In the decoding layer, after a plurality of second phoneme vectors of the original text are effectively proposed by using a multi-head attention mechanism, the second phoneme vectors are fused to obtain phoneme feature vectors, then the phoneme feature vectors and the text feature vectors are fused to obtain fusion attention vectors by using the multi-head attention mechanism, the fusion attention vectors comprise text features of the text and phoneme features of the text, so that the decoding layer gives consideration to the two features of the text in the error correction process, finally, the decoding feature vectors are obtained from the fusion attention vectors giving consideration to the two features of the text through normalization and forward propagation processing, and more accurate decoding feature vectors can be obtained through repeated processing of a multi-layer decoder to serve as the text after error correction.

Further, the second decoding multi-headed attention calculation module is used for combining the text feature vectorsMAnd the phoneme feature vectorV _p Performing multi-head self-attention calculation to obtain a fusion attention vectorNAnd outputting the data to the first decoding normalization module, specifically comprising: the second decoding multi-head attention calculation module is used for calculating the attention according to the formula

Combining the text feature vectorsMAnd the phoneme feature vectorV _p Performing multi-head self-attention calculation to obtain a fusion attention vectorNAnd output it to the first decoding normalization module; wherein, theK ₁ AndV ₁ as the feature vector of the textMAccording to the formula

And

for the text feature vectorMPerforming a linear transformation, saidW _k AndW _v training parameters of the non-aligned text error correction model; the above-mentionedQ ₃ For phoneme feature vectorsV _p According to the formula

For phoneme feature vectorV _p The linear transformation is carried out, and the linear transformation,W _p and correcting the training parameters of the non-aligned text error model.d ₁ Is composed ofK ₁ Dimension (d); the above-mentioned

Is that it isK ₁ The transposed matrix of (2).

In the decoding layer, when a multi-head attention mechanism is used to obtain a fusion attention vector, K and V in the multi-head attention mechanism are linear transformation of a text feature vector, and the feature of a phoneme feature vector is combined, so Q in the multi-head attention mechanism is linear transformation of the phoneme feature vector, wherein training parameters in linear transformation calculation are all adjusted to be optimal neural network training parameters in a training process of the non-aligned text error correction model.

Further, the phoneme information comprises pinyin initial information and pinyin final information; a number of initial phoneme vectorsVIncluding initial phoneme vectors of initial consonantsV _i And initial phoneme vector of vowelV _f (ii) a Accordingly, a number of first phoneme vectorseIncluding a first initial phoneme vectore _i And a first final phoneme vectore _f (ii) a Accordingly, a number of second phoneme vectorsAIncluding a second vowel phoneme vectorA _i And a second final phoneme vectorA _f 。

The phoneme information can represent the pronunciation characteristics of the text, so that the phoneme information takes the pinyin initial information and the pinyin final information of each character of the text as the basic information of the pronunciation characteristics of the text, and the pinyin initial information and the pinyin final information are further encoded into phoneme vectors through each module and decoding layer in the decoder model and are fused with the text characteristic vectors.

Further, the vector fusion module is used for fusing a plurality of second phoneme vectorsAObtaining phoneme feature vectorV _p And outputting the information to a second decoding multi-head attention calculation module, which specifically comprises: the vector fusion module is used for fusing the vector according to

Fusing a plurality of second phoneme vectorsAObtaining phoneme feature vectorV _p And output it to the second decoding multiheaded attention calculation module; wherein, theW _i And saidW _f And correcting the training parameters of the non-aligned text.

Furthermore, the coding multi-head attention calculation module and the first coding normalization module, and the coding forward propagation module and the second coding normalization module are connected by using residual error networks. And residual error network connection is utilized between the second decoding multi-head attention and the first decoding normalization module, and between the decoding forward propagation module and the second decoding normalization module.

The multi-head attention calculation module and the normalization module are connected by using a residual error network, and the forward propagation module and the normalization module are connected, so that the generalization capability of the non-aligned text error correction model can be improved.

In a second aspect, the present invention provides a method for training a non-aligned text error correction model, including: constructing a training data set, and randomly deleting, replacing and/or repeating the content of each sample in the training data set to obtain a preprocessed training data set; initializing a neural network model composed of an encoder and a decoder, inputting the training data set into the neural network model in batches for training until the function value of the loss function of the neural network is not obviously reduced any more, and obtaining the non-aligned text error correction model.

In a third aspect, the present invention provides a method for correcting errors of non-aligned texts, including: and inputting the original text to be processed into the non-aligned text error correction model, so that the non-aligned text error correction model corrects the error of the original text to be processed, and outputting the text of the original text to be processed after error correction.

Compared with the prior art, the invention has the following beneficial effects:

the non-aligned text error correction model provided by the invention comprises a decoder model and an encoder model, the input of the whole model is an original text, the output of the whole model is an error-corrected text, and all error correction processes of the original text are contained in the error correction model, so that each process of the text can be corrected and optimized in the training process of an end-to-end model, and the problem of error accumulation in the traditional pipeline type process is avoided. The model provided by the invention utilizes the superposition of multiple coding layers and decoding layers to obtain more accurate and effective characteristics, comprehensively considers the semantic characteristics and pronunciation characteristics of the original text to correct the error of the original text, and effectively improves the error correction accuracy rate.

Drawings

Fig. 1 is a schematic diagram of a module composition of a non-aligned text error correction model in embodiment 1 of the present invention.

Fig. 2 is a schematic diagram of the module composition of the coding layer and the decoding layer in embodiment 1 of the present invention.

Fig. 3 is a diagram illustrating specific data transmission of the phoneme extraction module 210 in the decoder model 200 according to embodiment 1 of the present invention.

FIG. 4 is a flowchart illustrating steps S210-S230 of the training method in embodiment 2 of the present invention.

Fig. 5 is a schematic flow chart of the training process and the inference stage in embodiments 2 and 3 of the present invention.

Fig. 6 is a flowchart illustrating the step S310 of the error correction method in embodiment 3 of the present invention.

Detailed Description

The drawings are only for purposes of illustration and are not to be construed as limiting the invention. For the purpose of better illustrating the following embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

Example 1

The embodiment provides a non-aligned text error correction model, which is an end-to-end error correction model and is constructed by a structure of an encoder and a decoder, and the whole process of text error correction is included in an end-to-end neural network model, so that the problem of error accumulation of the traditional pipeline type text error correction model is solved.

As shown in fig. 1, the non-aligned text error correction model includes an encoder model 100 and a decoder model 200.

The encoder model 100 includes a preprocessing module 110, a coded word embedding module 120, and at least one coding layer 130.

In the present embodiment, the Encoder model 100 is implemented based on a variant model of a transform structure (the transform means that a network structure is completely composed of attention mechanism), such as a BERT model (Bidirectional Encoder), a DistillBert model, a RoBERTa model, and the like.

The preprocessing module 110 is used for inputting the original text from the outsideS _o Preprocessing and coding are carried out to obtain an initial text vectorV ₀ And output to the encoding word embedding module 120.

Original textS _o The method refers to a text to be corrected without error correction, and the preprocessing refers to the original text input from the outsideS _o The processing is an operation compatible with the type or length of data processed by the encoder model 100, and in the present embodiment, the preprocessing specifically refers to the processing of the original textS _o The word segmentation is performed, that is, the word is segmented into word groups, which can also be called text sequences. After preprocessing, each text sequence is coded according to a word list so as to convert the unstructured information of the text into structured information, namely, each text sequence is converted into a corresponding vector, and then the vectors of each sequence form an initial text vectorV ₀ To the other modules of the encoder model 100. More specifically, the Encoding mode adopts One-Hot Encoding (One-Hot Encoding), which is to encode N states by using an N-bit state register.

The encoding word embedding module 120 is used for embedding the initial text vectorV ₀ Converting to a first text vector of specified dimensionsEAnd the first text vector is usedEAnd outputting the data to the coding layer.

Word embedding means embedding a high-dimensional space with the number of all words into a continuous vector space with lower dimension, and each word or phrase is mapped into a real number domainThe vector of (c). The encode word embedding module 120 embeds the initial text vectorV ₀ Converting to a first text vector of specified dimensionsEFirst text vectorEDimension ratio initial text vectorV ₀ And lower.

The coding layer 130 is used for coding the first text vectorECoding to obtain text feature vectorMAnd combining the text feature vectorsMAs a first text vectorEOutputting to the next coding layer, or directly outputting the text feature vectorMOutput to the decoder model 200.

In a specific embodiment, the number of the coding layers 130 is several, and each coding layer 130 is for a first text vectorEObtaining text characteristic vector after encodingMThen, it is used as the input of the next coding layer 130, and the next coding layer 130 continues to use the text feature vector obtained from the previous coding layer 130MCoding to obtain new text feature vectorMThrough repeated encoding processing of the multi-layer encoding layer 130, more accurate text feature vectors can be obtainedMTo effectively characterize the original textS _o Features on a textual level.

Specifically, as shown in fig. 2, the encoding layer 130 includes an encoding multi-head attention calculation module 131, a first encoding normalization module 132, an encoding forward propagation module 133, and a second encoding normalization module 134.

The encoded multi-head attention calculation module 131 is used for the first text vectorEPerforming multi-head self-attention calculation to obtain a second text vectoraAnd outputs it to the first code normalization module.

Multi-headed self-attention computation refers to inputting an initial vector into a plurality of parallel attention-based computation modules. The encoding multi-head attention calculation module 131 is formed by combining and connecting a plurality of parallel attention calculation modules in parallel.

Specifically, each of the encoded multi-head attention calculation modules 131 is according to the equation

For the first text vectorEPerforming attention calculation, each attention calculation module independently calculates a result, and finally splicing the results of the attention calculation modules to obtain a second text vectora。

Wherein the content of the first and second substances,K ₁ andV ₁ for text feature vectorsMAccording to the formula

And

for text feature vectorMThe linear transformation is carried out, and the linear transformation,W _k andW _v and correcting the training parameters of the non-aligned text.Q ₁ For text feature vectorsMAccording to the formula

For text feature vectorMThe linear transformation is carried out, and the linear transformation,W _q the training parameters of the non-aligned text error correction model.d ₁ Is composed ofK ₁ Dimension (d);

is that it isK ₁ The transposed matrix of (2).

The first encoding normalization module 132 is used for normalizing the second text vectoraCarrying out normalization processing to obtain a third text vectorV _a And outputs it to the code forward propagation module 133.

The normalization process is also called data normalization, the normalization process limits the data to be processed within a certain range, and converts the dimensional data into dimensionless dataV _a And the phoneme vectors are fused in the subsequent decoding process, and the adverse effect caused by bad data can be eliminated.

In a preferred embodiment, the encoding multi-head attention calculation module 131 and the first encoding normalization module 132 are connected through a residual error network, so as to improve the generalization capability of the non-aligned text error correction model provided in this embodiment.

The encoding forward propagation module 134 is used for the third text vectorV _a Forward propagation processing is carried out to obtain a fourth text vectorV _f And transmits it to the second code normalization module 135.

The forward propagation processing refers to that in a neural network, information directly flows from a previous neuron to a next neuron until output, and in this embodiment, the forward propagation processing can be realized through a full connection layer.

The second encoding normalization module 135 is for normalizing the fourth text vectorV _f Carrying out normalization processing to obtain text feature vectorsMAnd uses it as a first text vectorEOutputting to the next coding layer, or directly outputting the text feature vectorMOutput to the decoder module 200.

In a preferred embodiment, the encoding forward propagation module 134 is connected to the second encoding normalization module 135 through a residual network, so as to improve the generalization capability of the non-aligned text error correction model provided in this embodiment.

In the present embodiment, as shown in fig. 2, the decoder model 200 includes a phoneme extraction module 210, a decoded word embedding module 220, a first decoding multi-headed attention calculation module 320, and at least one decoding layer 240.

The phoneme extraction module 210 is used for inputting the original text from the outsideS _o Extracting phoneme information and coding the extracted phoneme information to obtain a plurality of initial phoneme vectorsVAnd outputs it to the decoded word embedding module 220.

The phoneme information means that the original text can be representedS _o The information of pronunciation may be, for example, the original textS _o Any suitable pinyin, phonetic symbol, etc. for representing the original textS _o The pronunciation symbol of the pronunciation.

In this embodiment, the phoneme information specifically refers to each original textS _o The initial consonant information and the final consonant information of each character in the text are extracted by the phoneme extraction module 210S _o Each character of (a) is converted into pinyin to generate a pinyin sequence

E.g. textS _o "BYZ" correspondingly generated phonetic sequenceP _o Is "zaijian". As shown in FIG. 3, the phone extracting module 210 extracts the Pinyin sequenceP _o Split into phonetic initial consonant sequencesP _i And phonetic vowel sequenceP _f Taking the foregoing example as an example, the Pinyin sequenceP _o Zaijian phonetic alphabet sequenceP _i Z j, the phonetic vowel sequenceP _f Is "aiian". The phoneme extraction module extracts the phonetic initial sequenceP _i And phonetic vowel sequenceP _f Respectively coding to obtain initial phoneme vectors of initial consonantsV _i And initial phoneme vector of finalV _f As initial phoneme vectorVThe input decoded word embedding module 220.

The decoded word embedding module 220 is used for respectively embedding a plurality of initial phoneme vectorsVConverting to a first phoneme vector of a specified dimensioneAnd a plurality of first phoneme vectorseOutput to the first decoding multi-headed attention calculation module 230.

The first decoding multi-head attention calculating module 230 is used for respectively calculating a plurality of first phoneme vectorsePerforming multi-head self-attention calculation to obtain a plurality of second phoneme vectorsAAnd outputs it to the decoding layer.

The first decoding multi-head attention calculation module 230 is formed by combining and connecting a plurality of attention calculation modules in parallel.

Each attention calculation module in the first decoding multi-headed attention calculation module 230 is according to the equation

For each first phoneme vectoreI.e. separately for the first initial phoneme vectore _i And a first final phoneme vectore _f Performing attention calculation, each attention calculation module independently calculates an attention result, and finally splicing the results of the attention calculation modules to obtain a first initial phoneme vectore _i Corresponding second vowel phoneme vectorA _i And the initial phoneme vector of the finalV _f Corresponding second final phoneme vectorA _f 。

Wherein the content of the first and second substances,K ₂ andV ₂ for the first initial phoneme vectore _i Or the first vowel phoneme vectore _f According to the formula

And

for the first initial phoneme vectore _i Is linearly transformed or according to

And

for the first vowel phoneme vectore _f The linear transformation is carried out, and the linear transformation,W _k andW _v the training parameters of the non-aligned text error correction model.Q ₂ For the first initial phoneme vectore _i Or initial phoneme vector of vowelV _f According to the formula

For the first initial phoneme vectore _i Performing a linear transformationOr according to the formula

For the first vowel phoneme vectore _f The linear transformation is carried out, and the linear transformation,W _q the training parameters of the non-aligned text error correction model.d ₂ Is composed ofK ₂ Dimension of (d);

is composed ofK ₂ The transposed matrix of (2).

The decoding layer 240 is used for fusing a plurality of second phoneme vectorsAI.e. the second vowel phoneme vectorA _i And a second final phoneme vectorA _f Obtaining phoneme feature vectorsV _p Combining text feature vectorsMAnd phoneme feature vectorsV _p Decoding to obtain decoding characteristic vectorV _d And decoding the feature vectorV _d As one of the second phoneme vectorsAOutputting to the next decoding layer, or directly decoding the feature vectorV _d As to the original textS _o And (5) correcting the text.

In a specific embodiment, the number of decoding layers 240 is several, and each decoding layer 240 will have several second phoneme vectorsAFusing to obtain phoneme feature vectorV _p Then, combining the text feature vectorsMAnd phoneme feature vectorsV _p Decoding to obtain a decoded feature vectorV _d Which is taken as input to the next decoding layer 240, the next decoding layer 240 proceeds on the basis of the decoded feature vectors obtained by the previous decoding layer 240V _d Encoding to obtain new decoding characteristic vectorV _d Through repeated encoding processing of the multi-layer decoding layer 240, more accurate decoding characteristic vector can be obtainedV _d As for the original textS _o And (5) correcting the text.

In a specific embodiment, as shown in fig. 2, the decoding layer 240 includes a vector fusion module 241, a second decoding multi-head attention calculation module 242, a first decoding normalization module 243, a decoding forward propagation module 244, and a second decoding normalization module 245.

The vector fusion module 241 is used for fusing the second vowel phoneme vectorA _i And a second final phoneme vectorA _f To obtain the phoneme feature vectorV _p And outputs it to the second decoding multi-head attention calculation module 242.

Specifically, the vector fusion module 241 is according to the equation

Fusing second vowel phoneme vectorsA _i And a second final phoneme vectorA _f Wherein, in the step (A),W _i andW _f and correcting training parameters of the model for the non-aligned text.

The second decoding multi-headed attention calculation module 242 is used to combine text feature vectorsMAnd phoneme feature vectorV _p Performing multi-head self-attention calculation to obtain a fusion attention vectorNAnd outputs it to the first decoding normalization module 243.

The second decoding multi-head attention calculation module 242 is formed by combining and connecting a plurality of attention calculation modules in parallel.

Each attention calculation module of the second decoding multi-headed attention calculation module 242 is according to the equation

For combined text feature vectorMAnd phoneme feature vectorV _p Performing attention calculation, each attention calculation module independently calculates an attention result, and finally splicing the results of the attention calculation modules to obtain a fusion attention vectorN。

Wherein, the first and the second end of the pipe are connected with each other,K ₁ andV ₁ for text feature vectorsMLinear transformation ofBody based form

And

for text feature vectorMThe linear transformation is carried out to carry out the linear transformation,W _k andW _v and correcting the training parameters of the non-aligned text.Q ₃ For phoneme feature vectorsV _p According to the formula

For phoneme feature vectorV _p The linear transformation is carried out to carry out the linear transformation,W _q and correcting the training parameters of the non-aligned text.d ₁ Is composed ofK ₁ Dimension of (d);

is that it isK ₁ The transposed matrix of (2).

The first decoding normalization module 243 is used for merging the attention vectorsNCarrying out normalization processing to obtain a first decoding vectorV _A And outputs it to the decode forward propagation module 244.

The decoding forward propagation module 244 is used to decode the first decoded vectorV _A Forward propagation processing is carried out to obtain a second decoding vectorV _F And transmits it to the second decoding normalization module.

In this embodiment, the forward propagation process may be implemented by one fully connected layer.

The second decoding normalization module is used for carrying out second decoding vectorV _F Normalization processing is carried out to obtain decoding characteristic vectorV _d And uses it as one of the second phoneme vectorsAOutputting to the next decoding layer, or directly decoding the feature vectorV _d As to the original textS _o And (5) correcting the text.

In a specific embodiment, when feature vectors are to be decodedV _d And uses it as one of the second phoneme vectorsAWhen the output is outputted to the next decoding layer, the vector fusion module 241 of the next decoding layer 240 fuses the second vowel vectorA _i And a second final phoneme vectorA _f And simultaneously fusing the decoded feature vectors output by the previous layerV _d Can be specifically according to the formula

The three vectors are fused.

The non-aligned text error correction model provided by the embodiment comprises a decoder model and an encoder model, each neural network parameter of the whole model can be updated simultaneously in the training process of the model, the input of the model is an original text, the output of the model is an error-corrected text, and the processes of phoneme extraction, phoneme coding, language coding, feature merging and decoding of the original text are included in the error correction model, so that each processing process of the text can be corrected and optimized in the training process of an end-to-end model, the accuracy of short sentences corrected by using the trained error correction model is ensured, and the problem of error accumulation in the pipeline processing does not exist. Meanwhile, the non-aligned text error correction model provided by the embodiment utilizes the superposition of multiple coding layers and decoding layers to obtain more accurate and effective features, and in the processing process of the decoding layers, a text feature vector corresponding to the original text generated by the coder model and a phoneme vector corresponding to the original text generated by the decoder model are fused, that is, the semantic features and pronunciation features of the original text are comprehensively considered to correct the error of the original text, so that the error correction accuracy is effectively improved.

Example 2

Based on the same concept as that of embodiment 1, this embodiment provides a training method for a non-aligned text correction model, which is shown in fig. 4 and 5, and includes the following steps:

s210, constructing a training data set;

in this step, the specific process of constructing the training data set is to obtain a plurality of original texts and error-corrected texts corresponding to the original texts, and each group of original texts and the corresponding error-corrected texts form a sentence pair to form a sample. After the training data set is constructed, the training data set can be segmented into a training set, a verification set and a test set according to a preset proportion, wherein the training set is used for training the non-aligned text error correction model, and the verification set and the test set are used for verifying and testing the model after the model training is completed. The preset ratio can be 8:1:1, and can be properly adjusted according to actual implementation scenes.

S220, randomly deleting, replacing and/or repeating the content of each sample in the training data set to obtain a preprocessed training data set;

in the step, the content of each sample in the training data set is deleted, replaced and/or repeated randomly, which is beneficial for the error correction model to identify various types of texts and improves the generalization capability of the error correction model.

And 3 operations of deleting, replacing and repeating text sample content can be selectively executed according to actual conditions.

Specifically, the process of random deletion is as follows: each word in the sample with a certain probabilityp ₀ Randomly deleting, wherein the number of the deleted words is not more than 30% of the total sentence length, and the proportion can be determined according to the actual situation; the process of random replacement is as follows: each word in the sample with a certain probabilityp ₁ Randomly replacing the words with harmonic or near-harmonic words, wherein the number of the replaced words is not more than 30% of the total sentence length, and the proportion can be determined according to the actual situation; the randomly repeated process is as follows: each word in the text sample with a certain probabilityp ₂ Randomly repeating and inserting the current position, wherein the repeated word number does not exceed 30% of the total sentence length, and the proportion can be determined according to the actual situation.

S230, initializing a neural network model composed of an encoder and a decoder, and inputting the training set to the neural network model in batches for training until the function value of the loss function of the neural network is not reduced, so as to obtain the non-aligned text error correction model described in embodiment 1.

In this step, the parameters of the neural network that need to be trained and updated during the training process are as described in embodiment 1W _f 、W _i 、W _k 、W _v 、W _q 、W _p Six neural network parameters.

In a specific embodiment, the error correction model may use the cross entropy of each character as a loss function during the training process, sequentially calculate the loss of each position of the output sequence and the target sequence, and add up to obtain the final loss. Meanwhile, an adam (adaptive motion optimization) optimization algorithm is used as a training optimizer, and learning rate preheating and attenuation strategies are used in a matching manner to update model parameters until the function value of the loss function of the neural network is not obviously reduced.

Example 3

Based on the same concept as that of embodiment 1, this embodiment provides a method for correcting a non-aligned text, which is shown in fig. 5 and 6 and includes the following steps:

s310, inputting the original text to be processed into the non-aligned text error correction model described in embodiment 1, so that the non-aligned text error correction model corrects the error of the original text to be processed, and outputs the text after error correction of the original text to be processed.

It should be understood that the above-mentioned embodiments of the present invention are only examples for clearly illustrating the technical solutions of the present invention, and are not intended to limit the specific embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention claims should be included in the protection scope of the present invention claims.

Claims

1. A non-aligned text error correction model, comprising: an encoder model and a decoder model;

the encoder model comprises a preprocessing module, a coded word embedding module and at least one coding layer;

the decoder model comprises a phoneme extraction module, a decoded word embedding module, a first decoding multi-head attention calculation module and at least one decoding layer;

the preprocessing module is used for inputting the original text from the outsideS _o Preprocessing and coding are carried out to obtain an initial text vectorV _o And output to the encoding word embedding module;

the code word embedding module is used for embedding the initial text vectorV _o Converting to a first text vector of specified dimensionsEAnd combining the first text vectorEOutputting to the coding layer;

the coding layer is used for coding the first text vectorECoding to obtain text feature vectorMAnd the text feature vector is used forMAs a first text vectorEOutputting the text feature vectors to the next coding layer or directly outputting the text feature vectorsMOutputting a decoding layer of the decoder model;

the phoneme extraction module is used for inputting the original text of the external inputS _o Extracting phoneme information and coding the extracted phoneme information to obtain a plurality of initial phoneme vectorsVAnd output it to the decoding word embedding module;

the decoded word embedding module is used for respectively embedding a plurality of initial phoneme vectorsVConverting to a first phoneme vector of a specified dimensioneAnd combining a number of said first phoneme vectorseOutput to the first decoding multi-headed attention calculation module;

the first decoding multi-head attention calculation module is used for respectively carrying out multi-head attention calculation on a plurality of first phoneme vectorsePerforming multi-head self-attention calculation to obtain a plurality of second phoneme vectorsAAnd output it to the decoding layer;

the decoding layer comprises a vector fusion module, a second decoding multi-head attention calculation module, a first decoding normalization module, a decoding forward propagation module and a second decoding normalization module;

the vector fusion module is used for fusing a plurality of second phoneme directionsMeasurement ofATo obtain the phoneme feature vectorV _p And output it to the second decoding multiheaded attention calculation module;

the second decoding multi-head attention calculation module is used for combining the text feature vectorsMAnd the phoneme feature vectorV _p Performing multi-head self-attention calculation to obtain a fusion attention vectorNAnd output it to the first decoding normalization module;

the first decoding normalization module is used for normalizing the fusion attention vectorNNormalization processing is carried out to obtain a first decoding vectorV _A And outputs it to the decoding forward propagation module;

the decoding forward propagation module is used for decoding the first decoding vectorV _A Forward propagation processing is carried out to obtain a second decoding vectorV _F And transmitting it to a second decoding normalization module;

the second decoding normalization module is used for the second decoding vectorV _F Normalization processing is carried out to obtain decoding characteristic vectorV _d And uses it as one of the second phoneme vectorsAOutputting the decoded feature vectors to the next decoding layer or directly outputting the decoded feature vectorsV _d As to the original textS _o And (5) correcting the text.

2. The non-aligned text correction model of claim 1,

the coding layer comprises a coding multi-head attention calculation module, a first coding normalization module, a coding forward propagation module and a second coding normalization module;

the encoded multi-head attention calculation module is used for calculating the first text vectorEPerforming multi-head self-attention calculation to obtain a second text vectoraAnd output it to the first code normalization module;

the first encoding normalization module is used for normalizing the second text vectoraCarrying out normalization processing to obtain a third text directionMeasurement ofV _a And output it to the code forward propagation module;

the code forward propagation module is used for transmitting the third text vectorV _a Forward propagation processing is carried out to obtain a fourth text vectorV _f And transmitting the data to a second encoding normalization module;

the second encoding normalization module is used for normalizing the fourth text vectorV _f Carrying out normalization processing to obtain text feature vectorsMAnd uses it as the first text vectorEOutputting the text feature vectors to the next coding layer or directly outputting the text feature vectorsMAnd outputting the data to the decoding layer.

3. The non-aligned text correction model of claim 1,

the second decoding multi-head attention calculation module is used for combining the text feature vectorsMAnd the phoneme feature vectorV _p Performing multi-head self-attention calculation to obtain a fusion attention vectorNAnd outputting the data to the first decoding normalization module, specifically comprising:

the second decoding multi-head attention calculation module is used for calculating the attention according to the formula

And

for the text feature directionMeasurement ofMPerforming a linear transformation ofW _k AndW _v training parameters of the non-aligned text error correction model; the describedQ ₃ For phoneme feature vectorsV _p According to the formula

For phoneme feature vectorV _p The linear transformation is carried out, and the linear transformation,W _q training parameters of the non-aligned text error correction model;d ₁ is composed ofK ₁ Dimension (d);

is that it isK ₁ The transposed matrix of (2).

4. The non-aligned text correction model of claim 1,

the phoneme information comprises pinyin initial information and pinyin final information;

a number of initial phoneme vectorsVIncluding initial phoneme vectors of initial consonantsV _i And initial phoneme vector of finalV _f ；

Accordingly, a number of first phoneme vectorseIncluding a first initial phoneme vectore _i And a first final phoneme vectore _f ；

Accordingly, a number of second phoneme vectorsAIncluding a second vowel phoneme vectorA _i And a second final phoneme vectorA _f 。

5. The non-aligned text correction model of claim 4,

the vector fusion module is used for fusing the vector according to

Fusing a number of second phoneme vectorsAObtaining phoneme feature vectorV _p And output it to the second decoding multiheaded attention computing module; wherein, theW _i And saidW _f And correcting the training parameters of the non-aligned text.

6. The non-aligned text correction model of claim 2,

the coding multi-head attention calculation module and the first coding normalization module, and the coding forward propagation module and the second coding normalization module are connected by using residual error networks.

7. The non-aligned text correction model according to any one of claims 1 to 5,

and residual error network connection is utilized between the second decoding multi-head attention calculation module and the first decoding normalization module, and between the decoding forward propagation module and the second decoding normalization module.

8. A training method of a non-aligned text error correction model is characterized by comprising the following steps:

constructing a training data set, and randomly deleting, replacing and repeating the content of each sample in the training data set to obtain a preprocessed training data set;

initializing a neural network model composed of an encoder and a decoder, and inputting the training data set into the neural network model in batches for training until the function value of the loss function of the neural network model is not reduced obviously any more, so as to obtain the non-aligned text error correction model according to any one of claims 1 to 7.

9. A method for correcting errors in non-aligned text, comprising:

inputting the original text to be processed into the non-aligned text error correction model according to any one of claims 1 to 7, so that the non-aligned text error correction model corrects the error of the original text to be processed, and outputting the corrected text of the original text to be processed.