CN114781386A

CN114781386A - Method and device for acquiring text error correction training corpus and electronic equipment

Info

Publication number: CN114781386A
Application number: CN202210537412.3A
Authority: CN
Inventors: 桂睿; 马芸; 曹宇慧; 黄硕; 陈永锋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-17
Filing date: 2022-05-17
Publication date: 2022-07-22

Abstract

The disclosure discloses a method for acquiring text error correction training corpora, which relates to the technical field of data processing, in particular to the fields of big data, natural language processing, artificial intelligence and the like. The specific implementation scheme is as follows: acquiring a reference error correction model and an initial training corpus; inputting a text to be corrected into a reference correction model, and acquiring a plurality of predicted texts and a first probability corresponding to a predicted character at each position in each predicted text; determining characters to be rewritten and rewriting modes in the text to be corrected according to the predicted text, the first probability and the labeled text; and based on the rewriting mode, rewriting the character to be rewritten, and acquiring the text error correction training corpus corresponding to the target field. Therefore, the reference error correction model is used for predicting the field text, the reference error correction model is determined, training corpora are generated aiming at weak points in the field prediction, the quality of the generated training corpora is improved, and conditions are provided for obtaining the reliability and the accuracy of the specific field text error correction model.

Description

Method and device for acquiring text error correction training corpus and electronic equipment

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to the fields of big data, natural language processing, artificial intelligence, and the like, and in particular, to a method and an apparatus for acquiring text error correction training corpus, and an electronic device.

Background

In the process of training the text error correction model, the quality of the training corpus directly affects the reliability of the text error correction model, so a reliable method for generating the text error correction training corpus is urgently needed to improve the reliability and accuracy of the text error correction model.

Disclosure of Invention

The disclosure provides a method and a device for acquiring text error correction training corpora.

According to an aspect of the present disclosure, a method for obtaining a text error correction corpus is provided, including:

acquiring a reference error correction model and an initial training corpus, wherein the reference error correction model is generated based on the training of a universal field corpus, and the initial training corpus comprises a text to be error corrected of a target field and a corresponding labeled text;

inputting a text to be corrected into a reference correction model to obtain a plurality of predicted texts and a first probability corresponding to a predicted character at each position in each predicted text;

determining characters to be rewritten and rewriting modes in the text to be corrected according to the plurality of predicted texts, the first probability corresponding to the predicted characters at each position and the labeled text;

and rewriting the character to be rewritten based on the rewriting mode to obtain the updated text error correction training corpus corresponding to the target field.

According to another aspect of the present disclosure, there is provided an apparatus for acquiring a text error correction corpus, including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a reference error correction model and an initial training corpus, the reference error correction model is generated based on the training of a universal field corpus, and the initial training corpus comprises a text to be corrected of a target field and a corresponding labeled text;

the prediction module is used for inputting the text to be corrected into the reference correction model so as to obtain a plurality of predicted texts and a first probability corresponding to a predicted character at each position in each predicted text;

the determining module is used for determining characters to be rewritten and rewriting modes in the text to be corrected according to the plurality of predicted texts, the first probability corresponding to the predicted characters at each position and the labeled text;

and the rewriting module is used for rewriting the character to be rewritten based on the rewriting mode so as to obtain the updated text error correction training corpus corresponding to the target field.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the above embodiments.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to the above-described embodiments.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method of the above embodiment.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart of a method for acquiring a text error correction corpus according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of another method for acquiring a text error correction corpus according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of another method for acquiring a text error correction training corpus according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of another method for acquiring a text error correction corpus according to an embodiment of the present disclosure;

fig. 5 is a schematic flowchart of another apparatus for acquiring text error correction training corpus according to an embodiment of the present disclosure;

fig. 6 is a block diagram of an electronic device for implementing a method for acquiring text error correction training corpus according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Big data, or mass data, refers to the data that is too large to be captured, managed, processed and organized into information that can help enterprise business decision more actively within a reasonable time through the current mainstream software tools.

NLP (Natural Language Processing) is an important direction in the fields of computer science and artificial intelligence, and the content of NLP research includes but is not limited to the following branch fields: text classification, information extraction, automatic summarization, intelligent question answering, topic recommendation, machine translation, subject word recognition, knowledge base construction, deep text representation, named entity recognition, text generation, text analysis (lexical, syntactic, grammatical, etc.), speech recognition and synthesis, and the like.

Artificial intelligence is the subject of research on the use of computers to simulate certain mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.) of humans, both in the hardware and software domain. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology comprises a computer vision technology, a voice recognition technology, a natural language processing technology, deep learning, a big data processing technology, a knowledge map technology and the like.

According to the method and the device, the weak points in the training texts are detected, the confusing characters are replaced for the weak points, so that a field error correction text set which is more consistent with the real errors of the field is generated, and the text error correction model is trained by using the field error correction text set, so that the reliability of the text error correction model is improved.

The following describes a method, an apparatus, an electronic device, and a storage medium for acquiring a text error correction corpus in an embodiment of the disclosure in detail with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a method for acquiring a text error correction training corpus according to an embodiment of the present disclosure.

As shown in fig. 1, the method includes:

step 101, obtaining a reference error correction model and an initial training corpus, wherein the reference error correction model is generated based on general field corpus training, and the initial training corpus includes a text to be corrected of a target field and a corresponding tagged text.

Generally, the situation of confusing words in different domains may be different, for example, there may be a lot of cases of "postings" being confused "plucked" in the field of fiction, and less cases of "postings" being confused "plucked" in the press. Therefore, in the present disclosure, by generating the text correction corpus in the specific field, the reliability of the text correction model trained based on the text correction corpus for text correction in the field is high.

In the present disclosure, part of characters in the target domain text may be replaced by using the confusion character corresponding to each character in the domain confusion character set, so as to generate the text to be corrected in the domain. For example, the "qi" in "good weather today" may be replaced by the "beginning" with the pinyin word to generate the text to be corrected "good weather today".

The domain confusion character set can comprise a pinyin confusion set and an error-prone character confusion set. The pinyin confusion set can be obtained by counting the pinyin corresponding to each character in the text in the field and combining each pinyin with the character corresponding to the same pinyin or similar pinyin appearing in the field. For example, the confusing character set corresponding to pinyin "da" may be: da: dada, Kao, Kan and Da. The confusing set of error-prone characters can count each character in the text in the field and the character into which each character is confused, and combine each character with the corresponding confusable character to obtain the confusing set of error-prone characters. For example, the confusing character set corresponding to "already" may be: and F: already, already.

In the present disclosure, the initial text error correction model may be trained based on the general field corpus to obtain the reference error correction model. The general field corpus comprises error correction text pairs and corresponding labeled texts of all fields, the error correction texts in the general field corpus can be generated based on a general confusion character set, and the general confusion character set comprises confusion characters of all fields. The text error correction model may be a knowledge enhanced semantic representation model (ERNIE) or the like, which is not limited by this disclosure.

Step 102, inputting the text to be corrected into a reference correction model to obtain a plurality of predicted texts and a first probability corresponding to a predicted character at each position in each predicted text.

In the disclosure, the reference error correction model may output a plurality of predicted texts and a first probability corresponding to the predicted character at each position in each predicted text according to the probability of the predicted character at each position in the predicted text. For example, the probability corresponding to each predicted character at each position may be compared with a threshold, when the probability of a certain predicted character is greater than the threshold, the probability corresponding to the character set may be output at the position, and if the probability of a certain predicted character is less than the threshold, the predicted character and the corresponding probability may not be output.

It is understood that, since there may be a situation where the probability that a plurality of predicted characters correspond to a certain position is greater than the threshold, the output predicted text may be multiple.

And 103, determining characters to be rewritten and rewriting modes in the text to be corrected according to the plurality of predicted texts, the first probability corresponding to the predicted characters at each position and the labeled text.

The rewrite mode may include a pinyin-based rewrite mode, a font-based rewrite mode, and the like. Based on the rewriting mode of the pinyin, the character can be replaced by a character different from the pinyin of the character; the font-based rewrite mode may be to rewrite a character with a character different from the character font, and the like, which is not limited by the present disclosure.

In the present disclosure, whether each position is an error prone point, that is, a weak point predicted by the reference error correction model, may be determined according to the probability of the predicted character at each position in each predicted text and the corresponding character in the labeled text. Therefore, the training corpus is updated according to the weak point predicted by the reference error correction model, so that the obtained training corpus is more targeted, and further, the reliability and the accuracy of the text error correction model generated based on the training corpus are higher.

And 104, rewriting the character to be rewritten based on the rewriting mode to obtain the updated text error correction training corpus corresponding to the target field.

In the present disclosure, after determining the rewrite mode of the character to be rewritten, the character may be rewritten by using the rewrite mode, that is, the character to be rewritten is rewritten, so as to generate a countermeasure error correction sample to obtain an updated text error correction corpus corresponding to the target field. Therefore, character replacement is performed according to the weak point of error correction, so that the quality of the generated error correction sample can be improved, and the reliability of the text error correction model can be improved by training the text error correction model based on the error correction sample.

Optionally, since the rewritten robust error correction text may be a smooth sentence, for example, rewriting "end of year" to "middle of year", the rewritten robust error correction text may affect the reliability of the trained text error correction model. Therefore, the rewritten corpus can be filtered. For example, unsupervised training may be performed by using a text in the field to obtain a fluency evaluation model, then a group of rewritten texts corresponding to the text to be corrected and labeled texts corresponding to the text to be corrected may be input into the fluency evaluation model, fluency of each rewritten text and labeled text may be determined according to output of the evaluation model, and if fluency of a labeled text is smaller than fluency of any rewritten text, the group of rewritten texts corresponding to the text to be corrected may be removed from the corpus.

Optionally, after obtaining the updated text error correction corpus corresponding to the target field, the reference error correction model may be trained by using the text error correction corpus to obtain the text error correction model corresponding to the field, so that the reliability of the text error correction model for field text error correction can be improved.

In the disclosure, after obtaining a reference error correction model generated based on a corpus training in a general field and an initial training corpus including a text to be corrected in a target field and a corresponding tagged text, the text to be corrected may be input into the reference error correction model to obtain a plurality of predicted texts and a first probability corresponding to a predicted character at each position in each predicted text, then, according to the plurality of predicted texts and the first probability corresponding to the predicted character at each position and the tagged text, a character to be rewritten and a rewriting mode in the text to be corrected are determined, and then, based on the rewriting mode, the character to be rewritten is rewritten to obtain an updated text error correction training corpus corresponding to the target field. Therefore, the reference error correction model is used for predicting the field text to determine the weak point of the reference error correction model in the field prediction, and the training corpus is generated aiming at the weak point, so that the quality of the generated training corpus is improved, and conditions are provided for obtaining the reliability and the accuracy of the specific field text error correction model.

Fig. 2 is a schematic flowchart of a method for acquiring a text error correction corpus according to an embodiment of the present disclosure.

As shown in fig. 2, the method includes:

step 201, a reference error correction model and an initial training corpus are obtained, wherein the reference error correction model is generated based on the training of a corpus in a general field, and the initial training corpus includes a text to be error corrected and a corresponding labeled text in a target field.

Step 202, inputting the text to be corrected into a reference correction model to obtain a plurality of predicted texts and a first probability corresponding to a predicted character at each position in each predicted text.

In the present disclosure, for a specific implementation process of step 201 to step 202, reference may be made to detailed description of any embodiment of the present disclosure, which is not described herein again.

And 203, determining a first target probability corresponding to the marking character at each position in the marking text according to the matching degree of the prediction text and the marking text and the first probability corresponding to the prediction character at each position.

In the present disclosure, each predicted text may be matched with the labeled text to determine whether each character in the predicted text is consistent with each character in the labeled text, and when a character at a certain position in a certain predicted text is consistent with the labeled text, a first probability corresponding to the character in the predicted text may be determined as a first target probability.

It can be understood that the first target probability is a probability corresponding to the labeled character, and the labeled character is a correct character, so the first target probability is a probability that the result of the prediction is a correct character with reference to the error correction model.

And 204, determining a first error probability corresponding to each position according to the difference value between the maximum first probability corresponding to each position and the first target probability.

In the present disclosure, when the predicted character at each position is different from the labeled character in each predicted text, it is indicated that the predicted character is wrong, the maximum first probability corresponding to the character with the wrong position prediction and the first target probability corresponding to the character with the correct position prediction are differentiated, the difference is determined as the first error probability corresponding to the position, and the weak point of the reference error correction model is determined according to the first error probability.

Step 205, determining the characters to be rewritten and the rewriting mode in the text to be corrected according to the first error probability corresponding to each position in the text to be corrected.

In the present disclosure, the first error probability corresponding to each position in the text to be error-corrected may be compared with a preset threshold, and when the first error probability corresponding to a certain position is greater than the preset threshold, it indicates that the probability that the position is predicted correctly is low, so that a character corresponding to the certain position may be determined as a weak point of the reference error correction model. It is therefore possible to determine the character corresponding to the position as the character to be rewritten and determine the rewriting mode as the font-based rewriting mode.

Therefore, the weak point predicted by the reference error correction model is determined according to the first error probability corresponding to each position in the text to be corrected, and the character to be rewritten is determined according to the weak point, so that the obtained training corpus is more targeted, and further, the reliability and the accuracy of the text error correction model generated based on the training corpus are higher.

And step 206, rewriting the character to be rewritten based on the rewriting mode to obtain the updated text error correction training corpus corresponding to the target field.

In the present disclosure, the specific implementation process of step 206 may refer to the detailed description of any embodiment of the present disclosure, which is not repeated herein.

In the disclosure, after obtaining a reference error correction model generated based on a corpus training of a general field and an initial training corpus including a text to be corrected of a target field and a corresponding tagged text, the text to be corrected may be input into the reference error correction model to obtain a plurality of predicted texts and a first probability corresponding to a predicted character of each position in each predicted text, and then, a first target probability corresponding to a tagged character of each position in the tagged text may be determined according to a matching degree between the predicted text and the tagged text and the first probability corresponding to the predicted character of each position, and a first error probability corresponding to each position may be determined according to a difference between a maximum first probability corresponding to each position and the first target probability, so that a character to be rewritten and an overwrite mode in the text to be corrected may be determined according to the first error probability corresponding to each position in the text to be corrected, then, based on the rewriting mode, the corrected text pair is rewritten to obtain the updated corpus. The field text is predicted by using the reference error correction model to determine the weak point of the reference error correction model in the field prediction, and then the training corpus is generated aiming at the weak point, so that the quality of the generated training corpus is improved, and conditions are provided for obtaining the reliability and the accuracy of the specific field text error correction model.

In the present disclosure, the text error correction model may include a plurality of weak points, and the prediction error may be different for different types of weak points. For example, a font weak point is a point where pinyin prediction is correct but font prediction is incorrect, and a pinyin weak point is a point where font prediction is correct but pinyin prediction is incorrect. Therefore, in the present disclosure, in order to make the generalization of the obtained corpus stronger, the corpus may include the pinyin corresponding to the text, so that various types of weaknesses of the reference error correction model may be determined according to the prediction result, and then different rewrites may be performed according to the types of predicted weaknesses, so that the rewritten corpus may be more specific to the weak point training of the text error correction model, and further, the reliability of the text error correction model may be improved. The above will be described in detail with reference to fig. 3.

Fig. 3 is a schematic flow chart of a method for acquiring a text error correction training corpus according to an embodiment of the present disclosure.

As shown in fig. 3, the method includes:

step 301, a reference error correction model and an initial training corpus are obtained, wherein the reference error correction model is generated based on the training of the corpus of the general field, and the initial training corpus includes a text to be error corrected of a target field, a labeled text corresponding to the text to be error corrected, a first pinyin sequence corresponding to the text to be error corrected and a second pinyin sequence corresponding to the labeled text.

For a specific explanation of the error correction model and the text to be corrected, reference may be made to detailed description of any embodiment of the present disclosure, which is not described herein again.

In the present disclosure, the text to be corrected may be input into the pinyin generator to generate the first pinyin sequence corresponding to each error correction text. The marked text is a text with correct error correction, and the second pinyin sequence corresponding to the corresponding marked text is a pinyin sequence with correct error correction. The second pinyin sequence may also be understood as a callout pinyin sequence.

Step 302, inputting the text to be corrected and the first pinyin sequence into a reference correction model to obtain a plurality of predicted texts, a first probability corresponding to a predicted character at each position in each predicted text, a plurality of predicted pinyin sequences and a second probability corresponding to a predicted pinyin at each position in each predicted pinyin sequence.

In the present disclosure, the pinyin vector and the character vector corresponding to each character in the text using error correction can be fused. For example, the pinyin vectors and the character vectors corresponding to the characters are spliced to generate a fused vector corresponding to each character, then the fused vector corresponding to the text to be error-corrected can be input into a reference error correction model, and then the reference error correction model can output a plurality of predicted texts, a first probability corresponding to the predicted character at each position in each predicted text, a plurality of predicted pinyin sequences and a second probability corresponding to the predicted pinyin at each position in each predicted pinyin sequence according to the probabilities of the predicted characters at each position in the predicted text and the probabilities of the predicted pinyins.

Or, the probability of each predicted character at each position may be multiplied by the probability of each predicted pinyin to determine the joint probability of each predicted character and each predicted pinyin, when a certain joint probability is greater than a threshold, the predicted character corresponding to the joint probability, the probability of the predicted character, the probability of the predicted pinyin and the probability of the predicted pinyin may be output at the position, and if the joint probability is less than the threshold, the predicted character and the corresponding predicted pinyin, etc. may not be output.

In the present disclosure, the pinyin vector corresponding to each character can be obtained by inputting the pinyin subsequence corresponding to each character into the neural network model for vector mapping. Or, the vectors corresponding to each pinyin letter in the pinyin subsequence corresponding to each character may also be fused to obtain the pinyin vector corresponding to each character, which is not limited by the present disclosure.

It is understood that, since there may be a plurality of cases where the joint probability of a certain position is greater than the threshold value, the output corrected text may be a plurality of.

Therefore, the text to be corrected and the first pinyin sequence are input into the reference correction model, so that the text and the pinyin sequence corresponding to the text can be predicted at the same time, and conditions are provided for determining various weak points of the reference model.

Step 303, determining a first target probability corresponding to the marking character at each position in the marking text according to the matching degree of the prediction text and the marking text and the first probability corresponding to the prediction character at each position.

Step 304, determining a first error probability corresponding to each position according to a difference value between the maximum first probability corresponding to each position and the first target probability.

In the present disclosure, the specific implementation process of steps 303 to 304 may refer to the detailed description of any embodiment of the present disclosure, and is not described herein again.

And 305, determining a second target probability corresponding to the second pinyin of each position in the second pinyin sequence according to the matching degree of the predicted pinyin sequence and the second probability corresponding to the predicted pinyin of each position.

In the disclosure, each predicted pinyin sequence can be matched with the second pinyin sequence to determine whether each pinyin in the predicted pinyin sequence is consistent with each pinyin in the second pinyin sequence, and when the predicted pinyin at a certain position in a predicted pinyin sequence is consistent with the pinyin at the certain position in the second pinyin sequence, the second probability corresponding to the predicted pinyin in the predicted pinyin sequence can be determined as the second target probability.

It can be understood that the second target probability is the probability corresponding to the marked pinyin, and the marked pinyin is the correct pinyin, so the second target probability is the probability that the prediction result of the reference error correction model is the correct pinyin.

And step 306, determining a second error probability corresponding to each position according to the difference value between the maximum second probability corresponding to each position and the second target probability.

In the disclosure, when the predicted pinyin and the marked pinyin at each position in each predicted pinyin sequence are different, it is indicated that the predicted pinyin is wrong, the maximum second probability corresponding to the pinyin with the wrong position prediction and the second target probability corresponding to the pinyin with the correct position prediction can be differentiated, the difference value is determined as the second error probability corresponding to the position, and the weak point of the reference error correction model is determined according to the second error probability.

And 307, determining the position to be rewritten in the text to be corrected and/or the first pinyin sequence and the rewriting mode according to the first error probability and the second error probability corresponding to each position in the text to be corrected.

For a specific explanation of the rewriting mode, reference may be made to the detailed description of any embodiment of the disclosure, which is not repeated herein.

In the present disclosure, the first error probability corresponding to each position in the text to be error-corrected may be compared with a preset threshold, and when the first error probability corresponding to a certain position is greater than the preset threshold, it indicates that the probability of the position being predicted correctly is low, so that it may be determined that the character corresponding to the certain position is a font weak point of the reference error correction model. The character corresponding to the position can be determined as the character to be overwritten.

Optionally, the second error probability corresponding to each position in the text to be corrected may be compared with a preset threshold, and when the second error probability corresponding to a certain position is greater than the preset threshold, it is indicated that the probability of correct prediction of the position is low, so that the character corresponding to the position may be determined as the weak point of the pinyin of the reference error correction model. The character corresponding to the position can be determined as the character to be overwritten.

In addition, different types of weak points correspond to different prediction errors. For example, the font weak point is a point where the pinyin prediction is correct but the font prediction is wrong, and the pinyin weak point is a point where the font prediction is correct but the pinyin prediction is wrong. Thus, font weaknesses can be rewritten with characters that are the same pinyin but different from the font, i.e., based on the font rewrite mode. The weak point of pinyin can be rewritten with characters of different pinyins, i.e. based on the pinyin rewriting mode. Therefore, the rewritten training corpus can be more targeted to weak point training of the text error correction model, and the reliability of the text error correction model can be further improved.

Optionally, the pinyin corresponding to the character to be modified in the first pinyin sequence may also be determined as the pinyin to be rewritten, and then the pinyin to be rewritten may be rewritten by using the pinyin of the confusing character corresponding to the character to be modified.

And 308, rewriting the characters to be rewritten based on the rewriting mode to obtain the updated text error correction training corpus corresponding to the target field.

In this disclosure, the specific implementation process of step 308 may refer to the detailed description of any embodiment of the present disclosure, which is not repeated herein.

In the present disclosure, after obtaining the reference error correction model and the initial training corpus, the text to be error corrected and the first pinyin sequence may be input into the reference error correction model, to obtain a plurality of predicted texts, a first probability corresponding to a predicted character at each position in each predicted text, a plurality of predicted pinyin sequences and a second probability corresponding to a predicted pinyin at each position in each predicted pinyin sequence, and then, the position to be rewritten and the rewriting mode in the text to be corrected and/or the first pinyin sequence may be determined based on the matching degree of the predicted text and the label text, the first probability corresponding to the predicted character at each position, the matching degree of the predicted pinyin sequence and the second pinyin sequence, and the second probability corresponding to the predicted pinyin at each position, and then, based on the rewriting mode, and rewriting the character to be rewritten to obtain the updated text error correction training corpus corresponding to the target field. Therefore, the initial training corpora of the target field are predicted by using the reference error correction model, so that the weak point of the reference error correction model when the target field text is corrected is determined, and then character replacement is performed based on the weak point, so that more targeted training corpora can be obtained, and then the text error correction model is trained based on the training corpora, so that the reliability and the accuracy of text error correction of the target field can be improved.

Fig. 4 is a schematic flow chart of a method for acquiring a text error correction training corpus according to an embodiment of the present disclosure.

As shown in fig. 4, the method includes:

step 401, a reference error correction model and an initial training corpus are obtained, wherein the reference error correction model is generated based on general field corpus training, and the initial training corpus includes a text to be corrected of a target field and a corresponding tagged text.

Step 402, inputting a text to be corrected into a reference correction model to obtain a plurality of predicted texts and a first probability corresponding to a predicted character at each position in each predicted text.

And step 403, determining characters to be rewritten and rewriting modes in the text to be corrected according to the plurality of predicted texts, the first probabilities corresponding to the predicted characters at each position and the labeled text.

In the present disclosure, the specific implementation process of steps 401 to 403 may refer to the detailed description of any embodiment of the present disclosure, and is not described herein again.

Step 404, obtaining a confusion character set corresponding to the character to be rewritten and a distribution probability of each confusion character in the confusion character set in the target field.

In the present disclosure, it is considered that the probability of confusing characters in a certain field may be different for each character, or the probability of confusing each character may be different. For example, the probability that "of" is confused "is different from the probability that" of "is confused" is "ground", or the probability that "of" is confused is different from the probability that "of" is good ". In order to make the generated error correction text approach to the real field text as much as possible, the distribution probability of each confusing character in the target field and the distribution probability of each character being confused can be counted, so that the error correction text which is more suitable for the error condition of the actual field text can be generated according to the distribution probability corresponding to each confusing character and the distribution probability of each character being confused.

And step 405, determining a target character corresponding to the character to be rewritten according to the distribution probability of each confusing character in the target field.

In the present disclosure, when a certain character to be rewritten is rewritten, a plurality of target characters with the same distribution probability may be generated according to the distribution probability of each confusing character corresponding to the character to be rewritten in the target region. For example, if the distribution probability corresponding to the confusion character "d" is 0.7, and the distribution probability corresponding to the confusion character "g" is 0.3, 10 error correction sentences need to be generated, and the target character includes 7 "d" and 3 "g".

And 406, replacing the character to be rewritten with the target character to generate an updated text error correction training corpus.

In the present disclosure, each character to be rewritten in the error correction text may be rewritten with one of the plurality of target characters for multiple times, so as to generate a plurality of updated text error correction training corpora.

After acquiring a reference error correction model generated based on the universal field corpus training and an initial training corpus comprising a text to be error corrected and a corresponding tagged text of a target field, the text to be corrected may be input into a reference correction model to obtain a plurality of predicted texts and a first probability corresponding to a predicted character at each position in each predicted text, and determining the characters to be rewritten and the rewriting mode in the text to be corrected according to the plurality of predicted texts, the first probability corresponding to the predicted characters at each position and the labeled text, and then, can obtain the confusion character set corresponding to the character to be rewritten and the distribution probability of each confusion character in the confusion character set in the target field, and determining the target character corresponding to the character to be rewritten according to the distribution probability of each confusing character in the target field, then, the target character can be used to replace the character to be rewritten to generate an updated text error correction training corpus. Therefore, the reference error correction model is used for predicting the field text to determine the weak point of the reference error correction model in the field prediction, and then the character rewriting is carried out according to the distribution probability of each confusing character in the field aiming at the weak point to generate the training corpus which is more accordant with the actual text error condition of the field, so that the quality of the generated training corpus is improved, and conditions are provided for obtaining the reliability and the accuracy of the specific field text error correction model.

In order to implement the foregoing embodiment, an apparatus for acquiring a text error correction corpus is further provided in the embodiments of the present disclosure. Fig. 5 is a schematic structural diagram of an apparatus for acquiring a text error correction training corpus according to an embodiment of the present disclosure.

As shown in fig. 5, the apparatus 500 for acquiring text error correction corpus includes: an acquisition module 510, a prediction module 520, a determination module 530, and a rewrite module 540.

An obtaining module 510, configured to obtain a reference error correction model and an initial training corpus, where the reference error correction model is generated based on general-purpose-field corpus training, and the initial training corpus includes a text to be error corrected in a target field and a corresponding tagged text;

the prediction module 520 is configured to input the text to be corrected into the reference correction model to obtain a plurality of predicted texts and a first probability corresponding to a predicted character at each position in each predicted text;

a determining module 530, configured to determine, according to the multiple predicted texts, the first probability corresponding to the predicted character at each position, and the labeled text, a character to be rewritten and a rewriting mode in the text to be corrected;

and a rewriting module 540, configured to rewrite the character to be rewritten based on the rewriting mode, so as to obtain an updated text error correction training corpus corresponding to the target field.

In a possible implementation manner of the embodiment of the present disclosure, the determining module 530 is configured to:

determining a first target probability corresponding to a marking character at each position in the marking text according to the matching degree of the predicted text and the marking text and the first probability corresponding to the predicted character at each position;

determining a first error probability corresponding to each position according to the difference value between the maximum first probability corresponding to each position and the first target probability;

and determining characters to be rewritten and rewriting modes in the text to be corrected according to the first error probability corresponding to each position in the text to be corrected.

In a possible implementation manner of the embodiment of the present disclosure, the initial training corpus further includes a first pinyin sequence corresponding to the text to be corrected and a second pinyin sequence corresponding to the labeled text, and the prediction module 520 is configured to:

inputting the text to be corrected and the first pinyin sequence into a reference correction model to obtain a plurality of predicted texts, a first probability corresponding to a predicted character at each position in each predicted text, a plurality of predicted pinyin sequences and a second probability corresponding to a predicted pinyin at each position in each predicted pinyin sequence.

determining a second target probability corresponding to the second pinyin of each position in the second pinyin sequence according to the matching degree of the predicted pinyin sequence and the second probability corresponding to the predicted pinyin of each position;

determining a second error probability corresponding to each position according to the difference value between the maximum second probability corresponding to each position and the second target probability;

and determining the position to be rewritten in the text to be corrected and/or the first pinyin sequence and the rewriting mode according to the first error probability and the second error probability corresponding to each position in the text to be corrected.

In a possible implementation manner of the embodiment of the present disclosure, the prediction module 520 is configured to:

determining a pinyin subsequence corresponding to each character in the first pinyin sequence;

aggregating each pinyin subsequence to determine a pinyin vector corresponding to each character;

fusing a character vector corresponding to each character in the text to be corrected with the pinyin vector to obtain a fused vector corresponding to each character;

and inputting the fusion vector corresponding to the text to be corrected into the reference error correction model.

In a possible implementation manner of the embodiment of the present disclosure, the rewriting module 540 is configured to:

acquiring a confusion character set corresponding to a character to be rewritten and the distribution probability of each confusion character in the confusion character set in a target field;

determining a target character corresponding to the character to be rewritten according to the distribution probability of each confusing character in the target field;

and replacing the characters to be rewritten by using the target characters to generate the updated text error correction training corpus.

It should be noted that the explanation of the foregoing embodiment of the method for obtaining text error correction training corpus is also applicable to the apparatus of this embodiment, and therefore, the description thereof is omitted here.

According to the method, after a reference error correction model generated based on universal field corpus training and an initial training corpus comprising a text to be corrected of a target field and a corresponding tagged text are obtained, the text to be corrected can be input into the reference error correction model to obtain a plurality of predicted texts and a first probability corresponding to a predicted character of each position in each predicted text, then characters to be rewritten and a rewriting mode in the text to be corrected are determined according to the plurality of predicted texts and the first probability corresponding to the predicted character of each position and the tagged text, and then the characters to be rewritten are rewritten based on the rewriting mode to obtain an updated text error correction training corpus corresponding to the target field. Therefore, the reference error correction model is used for predicting the field text to determine the weak point of the reference error correction model in the field prediction, and the training corpus is generated aiming at the weak point, so that the quality of the generated training corpus is improved, and conditions are provided for obtaining the reliability and the accuracy of the specific field text error correction model.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the device 600 includes a computing unit 601 which can perform various appropriate actions and processes in accordance with a computer program stored in a ROM (Read-Only Memory) 602 or a computer program loaded from a storage unit 608 into a RAM (Random Access Memory) 603. In the RAM 603, various programs and data necessary for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An I/O (Input/Output) interface 605 is also connected to the bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, and the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing Unit 601 include, but are not limited to, a CPU (Central Processing Unit), a GPU (graphics Processing Unit), various dedicated AI (Artificial Intelligence) computing chips, various computing Units running machine learning model algorithms, a DSP (Digital Signal Processor), and any suitable Processor, controller, microcontroller, and the like. The calculation unit 601 performs the respective methods and processes described above, such as the acquisition method of the text correction corpus. For example, in some embodiments, the method of obtaining the text correction corpus may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 608. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 600 via ROM 602 and/or communications unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the above-described method for obtaining text error correction training corpus may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the method of obtaining the text correction corpus in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, Integrated circuitry, FPGAs (Field Programmable Gate arrays), ASICs (Application-Specific Integrated circuits), ASSPs (Application Specific Standard products), SOCs (System On Chip, System On a Chip), CPLDs (Complex Programmable Logic devices), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an EPROM (erasable Programmable Read-Only-Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only-Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a Display device (e.g., a CRT (Cathode Ray Tube) or LCD (Liquid Crystal Display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), internet, and blockchain Network.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service extensibility in a conventional physical host and VPS service (Virtual Private Server). The server may also be a server of a distributed system, or a server incorporating a blockchain.

According to an embodiment of the present disclosure, the present disclosure further provides a computer program product, which when executed by an instruction processor in the computer program product, executes the method for acquiring a text error correction training corpus proposed by the above-mentioned embodiment of the present disclosure.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for acquiring text correction training corpora comprises the following steps:

acquiring a reference error correction model and an initial training corpus, wherein the reference error correction model is generated based on the training of a universal field corpus, and the initial training corpus comprises a text to be corrected of a target field and a corresponding labeled text;

inputting the text to be corrected into the reference correction model to obtain a plurality of predicted texts and a first probability corresponding to a predicted character at each position in each predicted text;

2. The method of claim 1, wherein the determining the characters to be rewritten and the rewriting mode in the text to be corrected according to the plurality of predicted texts, the first probability corresponding to the predicted characters at each position, and the labeled text comprises:

determining a first target probability corresponding to a marking character at each position in the marking text according to the matching degree of the predicted text and the marking text and a first probability corresponding to the predicted character at each position;

3. The method according to claim 1, wherein the initial corpus further includes a first pinyin sequence corresponding to the text to be corrected and a second pinyin sequence corresponding to the labeled text, and the inputting the text to be corrected into the reference correction model to obtain a plurality of predicted texts and a first probability corresponding to a predicted character at each position in each predicted text comprises:

and inputting the text to be corrected and the first pinyin sequence into the reference correction model to obtain a plurality of predicted texts, a first probability corresponding to a predicted character at each position in each predicted text, a plurality of predicted pinyin sequences and a second probability corresponding to a predicted pinyin at each position in each predicted pinyin sequence.

4. The method of claim 3, wherein the determining the characters to be rewritten and the rewriting mode in the text to be corrected according to the plurality of predicted texts, the first probability corresponding to the predicted characters at each position, and the labeled text comprises:

determining a first target probability corresponding to a marking character at each position in the marking text according to the matching degree of the prediction text and the marking text and a first probability corresponding to the prediction character at each position;

and determining the position to be rewritten and the rewriting mode in the text to be corrected and/or the first pinyin sequence according to the first error probability and the second error probability corresponding to each position in the text to be corrected.

5. The method of claim 3, wherein the inputting the text to be corrected and the first pinyin sequence into the reference correction model includes:

6. The method according to any one of claims 1 to 5, wherein the overwriting the character to be overwritten based on the overwriting mode to obtain the updated text error correction training corpus corresponding to the target domain comprises:

acquiring a confusion character set corresponding to the character to be rewritten and the distribution probability of each confusion character in the confusion character set in the target field;

determining a target character corresponding to the character to be rewritten according to the distribution probability of each confusion character in the target field;

and replacing the character to be rewritten by using the target character to generate an updated text error correction training corpus.

7. An apparatus for acquiring a text correction corpus, comprising:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a reference error correction model and an initial training corpus, the reference error correction model is generated based on the training of a corpus of a general field, and the initial training corpus comprises a text to be corrected of a target field and a corresponding marked text;

8. The apparatus of claim 7, wherein the means for determining is configured to:

9. The apparatus according to claim 7, wherein the initial corpus further includes a first pinyin sequence corresponding to the text to be corrected and a second pinyin sequence corresponding to the labeled text, and the prediction module is configured to:

10. The apparatus of claim 9, wherein the means for determining is configured to:

11. The apparatus of claim 9, wherein the prediction module is to:

12. The apparatus of any of claims 7-11, wherein the rewrite module is to:

and replacing the character to be rewritten with the target character to generate an updated text error correction training corpus.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method of any one of claims 1 to 6.