CN116579327A

CN116579327A - Text error correction model training method, text error correction method, device and storage medium

Info

Publication number: CN116579327A
Application number: CN202310863192.8A
Authority: CN
Inventors: 孙俊; 田志豪
Original assignee: Uni Entropy Intelligent Technology Wuxi Co ltd
Current assignee: Uni Entropy Intelligent Technology Wuxi Co ltd
Priority date: 2023-07-14
Filing date: 2023-07-14
Publication date: 2023-08-11
Anticipated expiration: 2043-07-14
Also published as: CN116579327B

Abstract

The invention relates to the technical field of text error correction, and particularly discloses a text error correction model training method, a text error correction method, equipment and a storage medium, which comprise the following steps: acquiring a training data set; preprocessing the training data set to obtain an input sequence; training an initial error correction model according to an input sequence to obtain a prediction probability of each character in the input sequence, wherein the prediction probability represents the probability that each character is predicted to be other characters in a candidate set corresponding to the character, each character corresponds to a candidate set, and the candidate set comprises a set of characters with multi-modal associated characteristics with the corresponding character; constructing a negative sample data set corresponding to each character according to the prediction probability of the character, and determining a positive sample corresponding to the character; and optimizing the initial error correction model according to the negative sample data set and the positive sample to obtain the target error correction model. The text error correction model training method provided by the invention can improve the accuracy of text error correction.

Description

Text error correction model training method, text error correction method, device and storage medium

Technical Field

The present invention relates to the field of text correction technologies, and in particular, to a text correction model training method, a text correction method, an electronic device, and a computer readable storage medium.

Background

Chinese misspellings are an important task. This task aims to correct spelling errors at the text or character level, which is a challenging important task. The spelling error correction requires comprehensive knowledge of word similarity, language modeling and reasoning, so text correction is one of the most challenging tasks in natural language processing. Unlike english text correction, chinese text correction is more difficult. This technique plays an important role in various natural language processing applications such as search optimization, machine translation, and entry annotation.

The traditional Chinese text error correction (CSC) method firstly detects misspelled characters through a language model and generates candidate characters, then filters the wrong candidate characters through the language model or rules, finally obtains a character which is considered to be correct by the model, and finally replaces the character with the original misword. CSC, however, is also very challenging because it suffers mainly from confusing characters, such as errors like near-word and near-word.

Pre-trained language models (PLM) such as BERT have been used for CSC tasks and are one solution to mainstream selection. However, there is still a significant gap between the already trained PLM and CSC task targets. PLM provides an information representation from a semantic point of view, but if only the semantics in CSC are considered, then there will be a number of appropriate characters as candidates for correction. Moreover, during the pre-training process, the mask mechanism, without the constraint of phonetic and visual similarity, PLM can easily predict semantically correct or common characters, but these characters are not necessarily truly correct characters.

Therefore, how to improve the error correction accuracy of the training model is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The invention provides a text error correction model training method, a text error correction method, electronic equipment and a computer readable storage medium, which solve the problem of low error correction accuracy in the related technology.

As a first aspect of the present invention, there is provided a text error correction model training method, including:

acquiring a training data set, wherein the training data set comprises an original sentence and a target sentence;

preprocessing the training data set to obtain an input sequence;

training an initial error correction model according to the input sequence to obtain the prediction probability of each character in the input sequence, wherein the prediction probability represents the probability that each character is predicted to be other characters in a candidate set corresponding to the character, each character corresponds to a candidate set, and the candidate set comprises a set of characters with multi-modal associated characteristics with the corresponding character;

constructing a negative sample data set corresponding to each character according to the prediction probability of the character, and determining a positive sample corresponding to the character;

and optimizing the initial error correction model according to the negative sample data set and the positive sample to obtain a target error correction model.

Further, constructing a negative sample data set corresponding to each character according to the prediction probability of the character, and determining a positive sample data set corresponding to the character, including:

taking a set corresponding to a character, wherein the prediction probability is larger than a preset probability threshold value and the actual character and the correct character are not in accordance, as a negative sample data set;

the correct character is taken as a positive sample.

Further, taking a set corresponding to a character whose prediction probability is greater than a preset probability threshold and whose actual character does not conform to the correct character as a negative sample data set, including:

comparing the prediction probability of each character with a preset probability threshold;

if the prediction probability of the current character is larger than a preset probability threshold, comparing the current character with the correct character in the target sentence;

and if the current character does not accord with the correct character in the target sentence, taking a plurality of candidate characters in the candidate set corresponding to the current character as a negative sample data set.

Further, optimizing the initial error correction model according to the negative sample data set and the positive sample to obtain a target error correction model, including:

respectively acquiring the prediction probability of the negative sample data set and the prediction probability of the positive sample;

and optimizing the initial error correction model according to a comparison result of the prediction probability of the negative sample data set and the prediction probability of the positive sample, so as to obtain a target error correction model.

Further, optimizing the initial error correction model according to a comparison result of the prediction probability of the negative sample data set and the prediction probability of the positive sample to obtain a target error correction model, including:

constructing an optimization loss function according to the prediction probability of the negative sample data set and the prediction probability of the positive sample, wherein the optimization loss function is used for increasing the prediction probability of the positive sample and reducing the prediction probability of the negative sample data set, and is used for maximizing the difference value between the prediction probability of the negative sample data set and the prediction probability of the positive sample;

and training the initial error correction model according to the optimized loss function to obtain a target error correction model.

Further, optimizing the initial error correction model according to a comparison result of the prediction probability of the negative sample data set and the prediction probability of the positive sample to obtain a target error correction model, and further comprising:

determining a target loss function according to the optimized loss function and an initial loss function corresponding to the initial error correction model;

and training the initial error correction model according to the target loss function to obtain a target error correction model.

Further, training an initial error correction model according to the input sequence to obtain a prediction probability of each character in the input sequence, including:

inputting the input sequence and the multi-modal associated feature of each character in the input sequence into an error detection network to perform error probability prediction on each character in the input sequence, and obtaining an error probability matrix corresponding to the input sequence, wherein the multi-modal associated feature comprises semantic information, word pronunciation information and font information;

and inputting the error probability matrix into an error correction network for training to obtain the prediction probability of each character in the input sequence.

As another aspect of the present invention, there is provided a text error correction method, including:

acquiring an input sequence corresponding to a text to be corrected;

inputting an input sequence corresponding to the text to be corrected into a text correction model to obtain a text correction prediction result, wherein the text correction model is obtained according to the text correction model training method;

and decoding the text error correction prediction result to obtain a target text corresponding to the text to be corrected.

As another aspect of the present invention, there is provided an electronic apparatus, including: the processor and the memory are used for storing computer instructions, and the processor is used for loading and executing the computer to realize the text error correction model training method or the text error correction method.

As another aspect of the present invention, there is provided a computer readable storage medium for storing computer instructions which, when loaded and executed by a processor, implement the text error correction model training method described above, or implement the text error correction method described above.

According to the text error correction model training method provided by the invention, the prediction probability of each character is obtained after the initial error correction model is trained, and then the positive and negative samples are constructed to optimize the initial error correction model, so that the target error correction model is obtained.

Drawings

The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate the invention and together with the description serve to explain, without limitation, the invention.

Fig. 1 is a flowchart of a text error correction model training method provided by the invention.

Fig. 2 is a flowchart for obtaining a predicted probability of each character according to the present invention.

Fig. 3a is a block diagram of the overall structure of the text error correction model provided by the present invention.

Fig. 3b is a block diagram of the GBERT model according to the present invention.

Fig. 4 is a flowchart of positive and negative sample construction provided by the present invention.

Fig. 5 is a flowchart of obtaining the target error correction model provided by the present invention.

Fig. 6 is a flowchart of a text error correction method provided by the present invention.

Fig. 7 is a block diagram of an electronic device according to the present invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In this embodiment, a text error correction model training method is provided, and fig. 1 is a flowchart of the text error correction model training method provided in an embodiment of the present invention, as shown in fig. 1, including:

s100, acquiring a training data set, wherein the training data set comprises an original sentence and a target sentence;

in the embodiment of the present invention, a training data set is prepared, and the training data set may specifically include an original sentence and a target sentence, as shown in table 1, which is an example of a specific training data set in the embodiment of the present invention.

S200, preprocessing the training data set to obtain an input sequence;

preprocessing the training data set specifically includes removing invalid characters in sentences, converting the sentences into specific vectors, and setting the dimensions of the specific vectors according to the needs of the model, which is not limited herein. In addition, the specific process of converting text into vectors is well known to those skilled in the art, and will not be described here.

S300, training an initial error correction model according to the input sequence to obtain the prediction probability of each character in the input sequence, wherein the prediction probability represents the probability that each character is predicted to be other characters in a candidate set corresponding to the character, each character corresponds to a candidate set, and the candidate set comprises a set of characters with multi-modal associated characteristics with the corresponding character;

in the embodiment of the present invention, the initial error correction model may specifically include an error detection network and an error correction network, where the error detection network first checks whether each character in the input sequence is an error character, and the error correction network trains an error probability matrix formed by the input sequence to obtain a prediction probability of each character.

It should be appreciated that, for example, "weather today is good" for the content entered, the predicted probability for each character in the sentence would be output.

Specifically, as shown in fig. 2, training an initial error correction model according to the input sequence to obtain a prediction probability of each character in the input sequence, including:

s310, inputting the input sequence and multi-modal associated features of each character in the input sequence into an error detection network to predict the error probability of each character in the input sequence and obtain an error probability matrix corresponding to the input sequence, wherein the multi-modal associated features comprise semantic information, word sound information and font information;

fig. 3a is an overall structural block diagram of the text error correction model, and fig. 3b is a structural diagram of the GBERT model. The GBERT model may specifically include an error detection network GRU network and an error correction network Bert network.

The method comprises the steps of receiving an embedding layer embedding vector, firstly taking the obtained embedding vector as input of a GRU network, predicting the probability of each character error through the GRU network, and taking the obtained probability into consideration of three parts of semantic information, word sound information and font information of an input sequence.

It will be appreciated that the text error correction model shown in fig. 3a will determine the input sequence to determine which part has the greatest probability of being erroneous and then replace the erroneous part with the appropriate candidate word.

S320, inputting the error probability matrix into an error correction network for training, and obtaining the prediction probability of each character in the input sequence.

And taking the obtained error probability matrix as input of an error correction network, and predicting each character in the sequence by the Bert error correction network according to the result of the GRU to obtain the probability of each character based on the character-to-character characteristics. The calculation formula of the specific prediction probability is as follows:

，

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the × th in the input sequence X>Personal character->Predicted as the +.>The number of characters to be used in a character,representing the × th in the input sequence X>Personal character->Predicted as the +.>Conditional probability of individual characters, W.epsilon. & lt->And b.epsilon ∈ ->Are all trainable parameters, vocab represents vocabulary size, hidden is hidden state size, +_A->∈/>Is->Personal character->Hidden state output of the model of (a).

In the embodiment of the present invention, the vocabulary vocab may be specifically understood as a candidate word corresponding to each word, for example, "day", and may be "fill", "add", "field", and the like, corresponding to the candidate set.

S400, constructing a negative sample data set corresponding to each character according to the prediction probability of the character, and determining a positive sample corresponding to the character;

in an embodiment of the present invention, as shown in fig. 4, the method specifically may include:

s410, taking a set corresponding to a character, wherein the prediction probability is larger than a preset probability threshold value and the actual character and the correct character do not accord with each other, as a negative sample data set;

specifically, the method comprises the following steps:

S420, taking the correct character as a positive sample.

It should be noted that a negative sample dataset is defined as a common character that was assigned a high predictive probability by PLM in error prior to the optimization process. From observations, negative samples that can form common matches or phrases with context tend to be given a higher probability than the correct word (golden character), resulting in the model making an error correction. Thus, in an embodiment of the present invention, the negative sample data set used at the next stage is determined by calculating the predictive probability for each character. Based on the model's original predicted probability (i.e., the predicted probability of each character), if the model performs error correction on the input character, a negative sample dataset is selected for the input character. Selecting a negative sample set Neg from the candidate set T:

T={t| t∈V and t≠}，

Neg =argmax ，

,/>，

wherein, the liquid crystal display device comprises a liquid crystal display device,and->Respectively positive and negative samples +.>Is selected from the first K characters of the vocabulary V in the predictive probability,/and>the i-th character in the input sequence X is predicted as negative sample +.>Probability of y _i ' indicating that the ith character in the input sequence X is predicted as negative sample +.>And the optimum value of K is empirically selected to be 5,representing a subset of the candidate set. The positive samples are selected by applying the correct character with the label in the datasetIs a positive sample.

For example, the original sentence is "i eat breakfast" and the target sentence is "i eat breakfast". And calculating the error probability of each character in the original sentence to form an error probability matrix, and further calculating the prediction probability. The prediction probability obtained by calculating the dish is 95%, and is greater than a preset probability threshold value of 75%, the dish is compared with the meal in the target sentence, the fact that the dish is not in accordance with the correct characters in the target sentence is determined, a plurality of candidate characters in the candidate set corresponding to the dish are taken as negative sample data sets, and the meal is taken as positive sample data sets.

And S500, optimizing the initial error correction model according to the negative sample data set and the positive sample to obtain a target error correction model.

In the embodiment of the invention, the initial error correction model is continuously optimized on the basis of the obtained negative sample data set and the positive sample so as to obtain the target error correction model.

Specifically, as shown in fig. 5, optimizing the initial error correction model according to the negative sample data set and the positive sample to obtain a target error correction model includes:

s510, respectively acquiring the prediction probability of the negative sample data set and the prediction probability of the positive sample;

s520, optimizing the initial error correction model according to a comparison result of the prediction probability of the negative sample data set and the prediction probability of the positive sample, and obtaining a target error correction model.

Further specifically, optimizing the initial error correction model according to a comparison result of the prediction probability of the negative sample data set and the prediction probability of the positive sample to obtain a target error correction model, including:

It should be appreciated that after obtaining the positive/negative samples and their corresponding predictive probabilities, a model is trained by a Comparative Probability Optimization (CPO) target, which is defined as:

，

where N represents the data sample size for a single use in training, K is the selected negative sample size,is the kth negative sample in the negative sample dataset Neg, +>Indicating that the ith character in the input sequence X is predicted as positive sample +.>Probability of->Indicating that the ith character in the input sequence X is predicted as positive sample +.>，/>Indicating that the ith character in the input sequence X is predicted as the Kth negative sample of the negative samples +.>Probability of->Indicating that the ith character in the input sequence X is predicted as the Kth negative sample of the negative samples +.>. The CPO objective aims to teach the model to increase the predictive probability of positive samples and to decrease the predictive probability of negative samples and maximize the difference between the original probabilities of the two.

In order to maintain the generalization performance of the model, optimizing the initial error correction model according to the comparison result of the prediction probability of the negative sample data set and the prediction probability of the positive sample, to obtain a target error correction model, and further comprising:

It should be understood that at the original targetAnd the above-mentioned optimized CPO objective->On the basis of (a), the target loss function can be obtained:

，

wherein, the liquid crystal display device comprises a liquid crystal display device,and->Is the weighting factor of two targets, which works best when they are both 1, the cross entropy loss function is used in the experiment as the original loss function of the model +.>。

And finally obtaining the target error correction model through the optimization.

In summary, according to the text error correction model training method provided by the embodiment of the invention, the prediction probability of each character is obtained after the initial error correction model is trained, and then the positive and negative samples are constructed to optimize the initial error correction model, so that the target error correction model is obtained.

The implementation process of the text error correction model training method is described below with reference to a specific example.

First, a training dataset is prepared, the dataset in text form is converted into vector form, and then input into a model for training. Training data consisted of manually annotated samples from three public data sets, sighena 13, sighena 14, sighena 15, together with 271K samples automatically generated based on ASR and OCR. A total of 281K samples, each data consisting of one original sentence and one target sentence. As shown in table 1 below:

TABLE 1 Chinese spelling error correction training data sample

Test dataset: evaluation was performed using the SIGHAN13, SIGHAN14 and SIGHAN15 test sets, and the statistics of the data sets used are shown in table 2. The original data set of the SIGHAN adopts traditional Chinese, and for convenience in testing, the original data set is completely converted into simplified Chinese by using OpenCC 4.

Table 2 statistics of dataset

Training is carried out on the basis of the data set to obtain a target error correction model.

In order to further verify the effectiveness of the method proposed in this chapter, the index of comparison analysis and evaluation performed on the sign ha dataset with some methods currently mainstream respectively all use P, R, F three values. Table 3 illustrates the performance of the proposed method and comparative model, giving the results of the model.

Wherein the evaluation index P represents accuracy (Precision), R represents Recall (Recall), and F representsDivide (F) _1-Score ）。

，

In the above expression, TP, FP, FN represent the number of texts predicted to be correct and actually correct, the number of texts predicted to be correct and actually incorrect, and the number of texts predicted to be incorrect and actually correct, respectively.

Table 3 comparison of performance of different methods based on character-level error correction tasks on dataset sign ha15

Comparing the method of the present embodiment (GBERT (CPO)) with the SpellGCN and the Realist, from the experimental results, the method of the present embodiment performs better than the Realist and the SpellGCN in character-level and sentence-level error correction. Both methods are exceeded on three evaluation criteria, and this improvement is not only in the detection network but also in the error correction network.

In summary, after the multi-mode features are added, the model can cope with multiple error types, and on the basis of the multi-mode features, the comparison probability is optimized and added into the model, so that the gap between the pre-training model and the actual Chinese spelling error correction task is effectively reduced, and the effect of the model is improved. Experiments prove that the method is excellent in sentence-level Chinese pinyin error correction task.

As another embodiment of the present invention, there is provided a text error correction method, including, as shown in fig. 6:

s610, acquiring an input sequence corresponding to a text to be corrected;

s620, inputting an input sequence corresponding to the text to be corrected into a text correction model to obtain a text correction prediction result, wherein the text correction model is obtained according to the text correction model training method;

s630, decoding the text error correction prediction result to obtain a target text corresponding to the text to be corrected.

According to the text error correction method provided by the embodiment of the invention, the accurate text error correction model can be obtained through the text error correction model training method, so that the text error correction accuracy can be improved.

The specific implementation process of the text error correction method provided in the embodiment of the present invention may refer to the description of the text error correction training method, which is not repeated here.

As another embodiment of the present invention, there is provided an electronic apparatus including: the processor and the memory are used for storing computer instructions, and the processor is used for loading and executing the computer to realize the text error correction model training method or the text error correction method.

As shown in fig. 7, the electronic device may include: at least one processor 71, such as a CPU (Central Processing Unit ), at least one communication interface 73, a memory 74, at least one communication bus 72. Wherein the communication bus 72 is used to enable connected communication between these components. The communication interface 73 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional communication interface 73 may further include a standard wired interface and a wireless interface. The memory 74 may be a high-speed RAM memory (Random Access Memory, volatile random access memory) or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 74 may alternatively be at least one memory device located remotely from the processor 71. Wherein the memory 74 stores an application program and the processor 71 invokes the program code stored in the memory 74 for performing any of the method steps described above.

The communication bus 72 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus, an extended industry standard architecture (extended industry standard architecture, EISA) bus, or the like. The communication bus 72 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 7, but not only one bus or one type of bus.

Wherein the memory 74 may include volatile memory (English) such as random-access memory (RAM); the memory may also include a nonvolatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated as HDD) or a solid state disk (english: solid-state drive, abbreviated as SSD); memory 74 may also include a combination of the above types of memory.

The processor 71 may be a central processor (English: central processing unit, abbreviated: CPU), a network processor (English: network processor, abbreviated: NP) or a combination of CPU and NP.

The processor 71 may further include a hardware chip, among others. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof (English: programmable logic device). The PLD may be a complex programmable logic device (English: complex programmable logic device, abbreviated: CPLD), a field programmable gate array (English: field-programmable gate array, abbreviated: FPGA), a general-purpose array logic (English: generic arraylogic, abbreviated: GAL), or any combination thereof.

Optionally, the memory 74 is also used for storing program instructions. Processor 71 may invoke program instructions to implement the text correction model training method as shown in the fig. 1 embodiment of the present invention or to implement the text correction method as shown in the fig. 6 embodiment of the present invention.

As another embodiment of the present invention, a computer readable storage medium is provided, wherein the computer readable storage medium is configured to store computer instructions that, when loaded and executed by a processor, implement the text error correction model training method described above, or implement the text error correction method described above.

In an embodiment of the present invention, a non-transitory computer readable storage medium is provided, where the computer readable storage medium stores computer executable instructions that can perform the text error correction model training method or the text error correction method in any of the above method embodiments. Wherein the storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.

It is to be understood that the above embodiments are merely illustrative of the application of the principles of the present invention, but not in limitation thereof. Various modifications and improvements may be made by those skilled in the art without departing from the spirit and substance of the invention, and are also considered to be within the scope of the invention.

Claims

1. A text error correction model training method, comprising:

preprocessing the training data set to obtain an input sequence;

2. The text error correction model training method of claim 1, wherein constructing a negative sample data set corresponding to each character according to the prediction probability of the character and determining a positive sample data set corresponding to the character comprises:

the correct character is taken as a positive sample.

3. The text error correction model training method of claim 2, wherein the set corresponding to the character whose predicted probability is greater than a preset probability threshold and whose actual character does not conform to the correct character is used as the negative sample data set, comprising:

4. The text error correction model training method of claim 1, wherein optimizing the initial error correction model based on the negative sample data set and the positive sample to obtain a target error correction model comprises:

5. The text error correction model training method of claim 4, wherein optimizing the initial error correction model based on a comparison of the predictive probability of the negative sample dataset and the predictive probability of the positive sample, to obtain a target error correction model, comprises:

6. The text error correction model training method of claim 5, wherein optimizing the initial error correction model based on a comparison of the predictive probability of the negative sample dataset and the predictive probability of the positive sample, obtains a target error correction model, further comprising:

7. The text error correction model training method of claim 1, wherein training an initial error correction model based on the input sequence to obtain a predictive probability for each character in the input sequence comprises:

8. A method for text correction, comprising:

acquiring an input sequence corresponding to a text to be corrected;

inputting an input sequence corresponding to the text to be corrected into a text correction model to obtain a text correction prediction result, wherein the text correction model is obtained according to the text correction model training method of any one of claims 1 to 7;

9. An electronic device, comprising: a processor and a memory for storing computer instructions for loading and executing the computer execution to implement the text error correction model training method of any one of claims 1 to 7 or to implement the text error correction method of claim 8.

10. A computer readable storage medium storing computer instructions which, when loaded and executed by a processor, implement the text error correction model training method of any one of claims 1 to 7, or the text error correction method of claim 8.