CN115809655A

CN115809655A - Chinese character symbol correction method and system based on attribution network and BERT

Info

Publication number: CN115809655A
Application number: CN202111073538.1A
Authority: CN
Inventors: 陆雪松; 陈贝
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2023-03-17

Abstract

The invention discloses a Chinese character correction method based on attribution network and BERT, which comprises the following steps: collecting a Chinese text data set containing error characters, dividing the data set into a training set and a testing set, and using the training set and the testing set for training and testing a model or using a complete data set for training; training a BERT-based two-classification model for attribution to the network; carrying out feedforward calculation on sentences in a training set by using a binary classification model obtained by training, and outputting and recording the feedforward calculation as F (X); obtaining error attribution information by using the output F (X) to calculate the gradient of X; setting a threshold filter to process the obtained error attribution information; merging the filtered error attribution information into a second BERT model for character correction training; the target character of the erroneous character is predicted. The invention also discloses a system for realizing the method. The method can obviously improve the accuracy and the recall rate in the field of Chinese text error correction, and has strong universality and expandability.

Description

Chinese character symbol correction method and system based on attribution network and BERT

Technical Field

The invention belongs to the technical field of natural language processing and deep learning, and relates to a Chinese character correction method and a Chinese character correction system based on an attribution network and BERT.

Background

The Chinese character spelling correction is a valuable research work and has wide application in real life. For example, query keywords on a search platform automatically correct: the user may input the wrong characters with similar sound and shape during searching, and if the platform can correctly recognize and correct the wrong characters, the interference caused by spelling errors can be eliminated. As another example, spell correction feedback is provided for a chinese beginner: for Chinese beginners, chinese character correction can help them reduce the probability of writing wrongly written characters. In addition, when text is generated by using techniques such as speech recognition and optical character recognition, many wrongly written characters are generated, and further correction processing is required for the text.

The first category of the currently common chinese character correction tools is based on statistical language models. Such as pyrocorrector, which uses word segmentation, common dictionary matching, etc. to detect errors and then uses confusion set substitution, statistical language model to calculate the confusion degree, etc. to recall the error characters. The method is simple and easy to implement, but too depends on the quality of the constructed dictionary and the confusion set, and has insufficient flexibility.

The second type is based on the RNN model. Errors were detected early using RNN to assess the confusion of sentences/phrases, but such methods have difficulty in correctly correcting errors in sentences. In order to solve the problem, people adopt a more flexible Encode-Decoder model, an attention mechanism is added in the process, and similar characters are copied from an confusion set at a Decoder end by using a pointer network so as to achieve a more accurate recall effect. The method is based on an end-to-end method, and it is difficult to completely ensure that the generated sentences can keep complete and smooth semantics when decoding and generating.

The third category is based on pre-trained language models. With the rise of pre-training models, the pre-training models such as BERT can explicitly model the relationship between characters, and become useful tools for solving the problem of character spelling correction. Various attempts have been made to perform BERT-based filtering on BERT candidate sets using phonetic-glyph information, or to classify BERT-encoded characters using graph-neural networks incorporating confusion set information, etc. These attempts have achieved certain effects, but still have problems such as insufficient detection capability and difficulty in constructing a high-quality confusion set.

Disclosure of Invention

In order to solve the defects of the prior art, the invention aims to provide a Chinese character correction method based on attribution network and BERT. The method provides that a BERT-based attribution network is constructed by utilizing a binary classification model with tags being true/false, so that the probability that a sentence contains wrong characters is judged, and character error information in the sentence is extracted; the character error information is integrated into the self-attention score calculation of the BERT, so that the BERT can pay more attention to the error characters to improve the capability of detecting and correcting the error characters of the whole model, and the accuracy and recall rate can be still greatly improved under the condition of not using an confusion set, so that the model has strong flexibility and migration capability.

The Chinese character correction method based on attribution network and BERT provided by the invention comprises the following steps:

the method comprises the steps of firstly, collecting a Chinese text data set containing wrong characters, and dividing the data set into a training set and a testing set, wherein the training set is used for training a model, and the testing set is used for evaluating the effect of the model. Or training with a complete data set without division;

step two, training a BERT-based two-classification model used as an attribution network;

step three, performing feedforward calculation on sentences in the training set by using the two classification models obtained by training in the step two, and outputting and recording the output as F (X), wherein X = X _1:n N is the number of Chinese characters for the input sentence;

step four, solving the gradient of the output F (X) in the step three to X to obtain error attribution information;

step five, setting a threshold filter to process the error attribution information obtained in the step four;

step six, fusing the error attribution information filtered in the step five into a second BERT model, and performing character correction training by using correct and wrong sentence pairs of the Chinese data set;

and seventhly, in the reasoning stage, when a new sentence needing Chinese character correction is obtained, calculating error attribution information of the sentence by using the BERT model in the step two, and predicting a target character of the error character in the sentence by using the BERT model in the step six and combining the error attribution information.

In step one, the Chinese text data set is from a Chinese Spell Check (CSC) data set Sighan13 and additional data containing wrong characters automatically generated by a specific tool OpenCC; both correct and incorrect sentences are included in the chinese dataset.

In the second step, the two classification models are used for predicting whether errors exist in the sentence (1 represents errors, and 0 represents no errors), and the final output result is the probability that the sentence contains wrong characters; the attribution network is a trained binary classification model, and X = X is given as input _1:n And outputting F (X) by a binary model, and calculating components which play a key role in the output result F (X) of the model in X by utilizing the gradient, namely which characters in X are responsible for the prediction result, wherein n is the number of Chinese characters in X.

The formula for solving the gradient is

Wherein the content of the first and second substances,

is a vector with the same dimensionality as x, i represents the serial number of the input component in the model, and n represents the number of Chinese characters in the input sentence.

In the fourth step, the output result F (X) is subjected to gradient calculation to obtain an attribution information vector of each character in X, and the attribution information vectors are subjected to L2 standardized summation and normalized to obtain an attribution information scalar, that is, the probability of each character in X being in error.

In step five, the processing operation of the threshold filter is to reset the probability of the character with the probability less than 0.5 obtained in step four to 0.

Step six, the blending operation means that the filtered error attribution information is added into the self-attention score calculation of each layer of the second BERT model, and the information of the error characters is enhanced, so that the model focuses on the error characters, and better detection and recall effects are achieved; the formula for calculating the self-attention of the BERT is as follows:

wherein Q, K and V are three matrixes of Query, key and Value converted by model input X respectively, and d _k Is the number of columns of the Q, K matrix, QK ^T Is an n X n matrix, each row representing each character X in the sentence X _i (i ≦ n) for all characters, including itself, and Scors is also an n by n matrix, each row representing the probability of error for each character in X, the n row vectors of Scors being identical.

In the sixth step, the loss function of the model training adopts the following cross entropy loss function:

wherein the content of the first and second substances,

as a predicted character of the model, y _i For correct character, /) _cel Is a cross entropy function.

In step seven, x is input for each character _i The prediction formula of the target character is as follows:

wherein the content of the first and second substances,

representing the predicted probability distribution of all candidate characters, w representing a parameter matrix, e _i Denotes x _i Characterization after BERT encoding; and finally, taking the character with the maximum Softmax value as a predicted target character.

The specific content of the method is as follows: first, data sets are collected, and then an attribution net is designedIn general terms, for a given input X and model F (X), finding which components of the model input X play a key role in the prediction of the model. The invention adopts a simple and effective attribution technology, namely, the gradient is directly solved for the prediction output of the model, and the formula is as follows:

from a mathematical point of view, the following are developed according to taylor:

due to the fact that

Is a vector of the same dimension as x,

for its ith component, but for Δ x of the same magnitude _i ，

The greater the absolute value of (a), the greater the change in F (x + Δ x) with respect to F (x), that is:

the sensitivity of the model to the ith component of the input is measured and used

As an importance measure for the ith component. Based on the method, a BERT-based binary classification model is trained/fine-tuned by using a small amount of data, the binary classification model can predict whether errors exist in the sentence or not, then the gradient is calculated on the predicted error probability to obtain the attribution information vector of each character, namely the contribution degree of each character to the sentence error, the L2 standardized summation is carried out on the vectors, then all the L2 sums are normalized, and the obtained score can be regarded as the error probability of each character.

And then, resetting the character probability with the error probability less than 0.5 in the attribution detection information obtained after normalization to 0 by using a threshold filter, reducing the interference of correct characters in a subsequent model on the detection of the error characters, and improving the detection capability of the model.

After the attribution detection information is extracted, the attribution detection information is fused on the self-attention score of each layer of the BERT model, the information of the wrong characters is enhanced, the model focuses on the wrong characters, and therefore better detection and recall effects are achieved. The concrete formula is as follows:

q, K and V are three matrixes of Query, key and Value converted by inputting X through a model respectively, and d _k Is Q, the number of columns of the K matrix, QK ^T And Scors are both n × n matrices, each row of Scors represents the error probability of each character in sentence X, QK ^T Each line of (a) represents each character X in the sentence X _i All rows of Scores are the same as the connections between other characters, including itself. As can be seen from this formula, assume that for a sentence X (X) of input length n ₁ ,x ₂ ,…,x _n )，QK ^T Meaning that each character X in the sentence X _i The degree of association between (i ≦ n) and other characters (including itself) or the information, i.e., BERT may learn the context information. While in the self-attention score QK ^T The information of attribution detection is added, so that the mapping relation between the wrongly input characters and the correctly output characters is strengthened, and the prediction effect can be improved.

When BERT extracts the representation E = (E) of the sentence ₁ ,e ₂ ,…,e _n ) Then, using full link layer prediction e _i The target character is corresponding to the character, and the prediction formula of the target character is as follows:

wherein the content of the first and second substances,

representing the predictive probability distribution of all candidate characters, w representing a parameter matrix, e _i Denotes x _i Characterizing after BERT coding; and taking the first character of the BERT prediction candidate set, namely the character with the maximum Softmax value as a final correction result.

The principle of the invention is as follows: BERT has certain error detection and correction capability, a candidate set can be generated for each character in a sentence after BERT coding, but the detection and recall capability of BERT on Chinese error characters is insufficient due to the difference of MASK characters when a MASK model is pre-trained and fine-tuned in a downstream task. Such as: correcting the sentence 'want you live good' by using BERT, and outputting 'want you live good' as it is, which indicates that no error is detected; yet another correction like "i introduce his taiwan family dish" would generate "i introduce his taiwan new dish", stating that it would be subject to pre-trained parameters and ignore the information of the "family" word. After accurate detection information is extracted through the attribution network, the detection information is blended into the BERT, the model can learn more mapping information between error characters and correct characters (labels), the error characters are enabled to be more concerned by the model, and therefore detection and recall of the error characters are improved.

By adopting the technical scheme, the Chinese character correction method based on the attribution network and the BERT discards an confusion set or a filtering curve, can efficiently correct wrong characters in Chinese sentences, and can greatly improve the detection and correction accuracy and recall rate.

The invention also provides a system for realizing the method, which comprises a training module and an inference module;

the training module is used for completely executing the process of training the BERT-based two-classification model and the Chinese character correction model on the training set;

the reasoning module can execute the operation of predicting the target character of the error character on the test set or the data set of the user.

The beneficial effects of the invention include: in the field of text error correction, two important indexes are accuracy and recall rate, and compared with the prior related similar technology, the method can obviously improve the accuracy and the recall rate.

1) Compared with a BERT model, the method has stronger error detection capability. Since BERT is a large-scale pre-training model based on a masking task, part of characters in pre-training are replaced by MASK characters, and the characters only contain context information, the model has insufficient capability of detecting wrong characters. And after the attribution network is added, the model is more concerned about the wrong characters, so that the error correction accuracy and recall rate can be improved simultaneously.

2) The invention abandons the confusion set, so that the model is simpler and more convenient. The confusion set is complex to build, time consuming, labor intensive, and requires constant maintenance. Meanwhile, how to apply the confusion set is also a complicated problem, and new interference is brought to the model by using the improper confusion set.

3) The invention has stronger universality and expandability. Under the condition that a confusion set is abandoned and only BERT is utilized, the model can be applied to the error correction of other languages with characters as basic units, such as Japanese and Korean languages, and is not only Chinese.

4) The method can be used for searching keywords, recognizing optical characters, recognizing voice, automatically correcting wrongly written characters in Chinese language education and the like.

Drawings

FIG. 1 is a model structure of the present invention.

FIG. 2 is a prediction flow chart of the present invention.

FIG. 3 is a training flow diagram of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following specific examples and the accompanying drawings. The procedures, conditions, experimental methods and the like for carrying out the present invention are general knowledge and common general knowledge in the art except for the contents specifically mentioned below, and the present invention is not particularly limited.

Examples

First, a data set is collected. The data set adopted by the invention is a standard data set Sighan13 of a CSC (Chinese talking Check) task. This is a small chinese data set derived from a chinese test article, each piece of data (sentence) containing several wrong characters and having the correct character tags. Considering that the data size of the training set is small, and experiments prove that the larger the training set is, the better the effect is in the aspect of character error correction based on the BERT model, 27 pieces of extra data are added in the text, sentences containing wrong characters are automatically generated according to correct sentences by a specific tool OpenCC, and the format of the data is the same as that of a Sighan13 data set.

Second, a binary model is trained for attribution to the network. The classification model is also based on BERT, and the invention selects a model of 12-layer BERT. The training data includes correct and incorrect sentences, labeled correct/incorrect. After 16 epochs are trained, the accuracy rate on the test set can reach 98.6%.

And thirdly, predicting by using the model trained in the second step. Training data of a dimension (32,128,768) is input, meaning of the dimension is the size of the batch training data, the sentence length, and the dimension of the hidden layer, respectively. The input data is firstly subjected to the second classification model trained in the second step, and the mistake of the input sentence is predicted, namely whether the sentence has a mistake or not is marked as F (X).

And fourthly, solving the gradient of the F (X) to obtain attribution information. The dimension of the attribution information is (32,128,768) as the attribution information implies error detection information, and the error probability information with the dimension of (32,128,1) is obtained by normalizing the attribution information in the last dimension. Table 1 shows two examples of detecting erroneous characters using the designed attribution network. It can be seen that the attribution network can correctly locate the wrong character with a better discrimination between the correct and wrong characters.

As shown in table 1 below, "i am a very tall line today", whose attribution detection information is "0.17,0.15,0.11,0.4,0.67,1.0", represents the probability that each character is an erroneous character. The input "classmates were back corrected", and its attribution detection information is "0.12,0.07,0.04,1.00,0.43,0.20".

TABLE 1

Fifthly, in order to improve the discrimination of the error probability in the attribution detection information and highlight the probability of the error character to reduce the probability of the correct character, the invention is provided with a simple filter, namely the probability value less than 0.5 is set as 0, for example, the value of 0.17,0.15,0.11,0.4,0.67,1.0 is changed into 0,0,0,0,0.67,1.0 after passing through the filter. This is done to reduce the interference of the correct character in the subsequent model to the detection of the incorrect character, improving the model detection capability.

And sixthly, integrating the filtered cause detection information into a second BERT model. The integration method is to add the detection information to the self-attention score of each layer of the BERT. In the formula [2]Middle, QK ^T The value of (A) represents each character x in the sentence _i In connection with other characters, including itself, a larger corresponding value indicates more information carrying the character. Whereas in Scores the probability value for the wrong character is the highest and the probability value for the correct character is the lowest, so the QK ^T The + Scores means that the amount of information carried by the wrong character is much larger, and not only the context information, but also other correct characters can carry more wrong character information, so that the model can learn more mapping information between the wrong character and the correct label, and the detection and recall of the wrong character are improved.

The seventh step, the output after the factoring network and BERT coding, the dimension is also (32,128,768). The fully-connected layer is then used to predict the probability distribution of the target character for each input character, the formula is as follows:

the dimension is 768 (vocabulary size). Calculating the loss of each predicted character in an output sentence during model training, wherein a cross entropy loss function is adopted, and the formula is as follows:

wherein

As a predicted character of the model, y _i For correct character, /) _cel Is a cross entropy function. In addition, during the reasoning test phase after the model training is completed, for the model softmax (w × e) _i ) And giving the candidate character which is directly taken as the character with the maximum Softmax value as the final correction character. For example, "i am high today", the Softmax value of "xing" is the largest among the prediction candidates of "line", and thus the correction result is "i am high today"; for example, if the students have corrected, the Softmax value of "back" is the largest in the predicted candidate characters of "back", so the correction result is "the students have corrected".

The invention performs experiments on a Sighan13 data set, compares the experiments with a Pyrocarctor, an RNN-based LMC model and a BERT model, and respectively records as follows: pyrorector, LMC, BERT. The model of the invention is written as: BERT-G. Meanwhile, the comparison with the BERT model is carried out on a Sighan13+27 ten thousand data sets. The experimental results are shown in table 2, and it can be seen that after the attribution network is added, the detection and correction effects are obviously improved no matter on a small-scale or large-scale data set, which indicates that the Chinese character error correction method based on the attribution network and the BERT in the scheme obtains good effects.

TABLE 2

The protection content of the present invention is not limited to the above embodiments. Variations and advantages that may occur to those skilled in the art may be incorporated into the invention without departing from the spirit and scope of the inventive concept, which is set forth in the following claims.

Claims

1. A method for chinese character correction based on attribution networks and BERTs, the method comprising the steps of:

the method comprises the following steps of firstly, collecting a Chinese text data set containing error characters, and dividing the data set into a training set and a test set, wherein the training set is used for training a model, and the test set is used for evaluating the effect of the model; or using the complete data set to carry out model training;

thirdly, performing feedforward calculation on sentences in the training set by using the two classification models obtained by training in the second step, and outputting and recording the feedforward calculation as F (X), wherein X is an input sentence;

step four, solving the gradient of X by using the output F (X) in the step three to obtain error attribution information;

step six, fusing the error attribution information filtered in the step five into a second BERT model, and performing character correction training by using the correct-error sentence pairs in the Chinese data set;

and seventhly, in the reasoning stage, when a new sentence needing Chinese character correction is obtained, calculating error attribution information of the sentence by using the BERT model in the step two, and predicting the target character of each error character in the sentence by using the BERT model in the step six and combining the error attribution information.

2. The method of claim 1, wherein in step one, the Chinese text dataset is from Chinese Spell Check dataset Sighan13 and additional data containing erroneous characters automatically generated by a specific tool OpenCC; both correct and incorrect sentences are included in the chinese text dataset.

3. The method according to claim 1, wherein in step two, the two classification models are used for predicting whether there is an error in the sentence, and the final output result is the probability that the sentence contains the wrong character; the attribution network refers to a trained binary classification model for given inputX＝x _1:n And outputting F (X) by a binary model, calculating components which play a key role in the output result F (X) of the model in X by utilizing the gradient, wherein n is the number of Chinese characters in X.

4. The method of claim 1, wherein the gradient is formulated as

Wherein the content of the first and second substances,

5. The method of claim 1, wherein in step four, the output result F (X) is subjected to gradient to obtain attribution information vectors of each character in X, and the attribution information vectors are subjected to L2 normalized summation and normalization to obtain an attribution information scalar, wherein the attribution information scalar is the probability of each character in X being erroneous.

6. The method according to claim 1, wherein in step five, the processing of the threshold filter is operative to reset to 0 the probability of a character less than 0.5 among the probabilities obtained in step four.

7. The method according to claim 1, wherein in step six, the merging operation is to add the filtered error attribution information to the self-attention score calculation of each layer of the second BERT model, and to strengthen the information of the error character itself, so that the model focuses on the error character to achieve better detection and recall effects; the formula for calculating the self-attention of the BERT is as follows:

wherein, Q, K and V are three matrixes of Query, key and Value converted by model input X respectively, d _k The number of columns of the K matrix is Q, scors is an n X n matrix, and each row represents the error probability of each character in the sentence X; for a sentence X (X) of input length n ₁ ,x ₂ ,…,x _n )，QK ^T Is an n X n matrix, each row representing each character X in the sentence X _i The association with other characters, including itself, i ≦ n; the n row vectors of Scores are identical.

8. The method of claim 1, wherein in step six, the computation of the predicted loss for each of the output sentences in the model employs the following cross-entropy loss function:

wherein, the first and the second end of the pipe are connected with each other,

9. The method of claim 1, wherein in step seven, x is used for each input character _i The prediction formula of the target character is as follows:

representing the predicted probability distribution of all candidate characters, w representing a parameter matrix, e _i Denotes x _i Characterization after BERT encoding;

and taking the character with the maximum Softmax value as a predicted target character.

10. A system for implementing the method according to any one of claims 1 to 9, wherein the system comprises a training module and an inference module; the training module is used for completely executing the process of training the BERT-based two-classification model and the Chinese character correction model on the training set;

the reasoning module is used for executing the operation of predicting the target character of the wrong character on the test set or the data set of the user.