CN116681061A

CN116681061A - English grammar correction technology based on multitask learning and attention mechanism

Info

Publication number: CN116681061A
Application number: CN202310630375.5A
Authority: CN
Inventors: 赵铁军; 朱聪慧; 曹海龙; 刘梓航; 徐冰; 杨沐昀
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2023-05-31
Filing date: 2023-05-31
Publication date: 2023-09-01

Abstract

An English grammar correction technology based on a multitask learning and attention mechanism relates to an English grammar correction technology. The invention aims to solve the problems that the existing English grammar correction technology has poor adaptability and is inaccurate in grammar correction of some complex sentences. The method comprises the following steps: for an input sentence, reading an English word segmentation vocabulary and an editing tag vocabulary from a database; inputting the sentence into a pre-training coding model to obtain the context representation of the whole sentence; the obtained context feature vector passes through a self-attention layer; judging whether the input sub-words need editing operation or not, and classifying editing labels of the input sub-words by using a classifier with the size of a vocabulary; and carrying out post-processing on words in the input sentence according to the meaning corresponding to the correction label predicted by the model, and inputting the obtained post-processing result into the model for multiple iterations to obtain a final result. The invention belongs to the technical field of natural language processing.

Description

English grammar correction technology based on multitask learning and attention mechanism

Technical Field

The invention relates to an English grammar correction technology, and belongs to the technical field of natural language processing.

Background

Along with popularization of technologies such as internet and mobile equipment, the application range of English is wider and wider, and English grammar correction technology is also receiving more and more attention. Correct english grammar is critical for efficient communication. In writing or spoken language communication, grammatical errors can confuse the original meaning, resulting in confusion or misinterpretation. By correcting grammar errors, we can ensure that the information we convey is clear and accurate. Good grammar also improves the readability of the written text. When the reader encounters a grammatical error, more effort may be required to understand the text, which may be tiring and distracting. By correcting grammar errors, we can make text easier to read and more attractive. Grammar correction is also very important for language learning. By identifying and correcting errors, a learner may better understand grammar rules and improve his own writing and spoken language abilities. In addition, the automatic grammar correction tool can provide immediate feedback to learners to allow them to more quickly and effectively correct their own errors. However, the current english grammar correction technology still has room for improvement, and the english grammar correction technology of the multitask learning and attention mechanism proposed by the invention is focused on solving the following problems:

1. the grammar error defined in the field of English grammar correction is complex, and in order to adapt to training difficulty, the invention provides a training mode of multi-task learning joint constraint: the preface work of grammar correction is grammar checking, and the first task of training is to detect whether words in sentences have errors or not; the second task of training is to predict the editing labels corresponding to the words in the sentences; the third task of training is a margin loss based on contrast learning, and the expected model improves the classification confidence and gives a correct classification result more confidently;

2. english grammar correction models the syntactic and semantic information of English sentences by a layer dependency model. In order to further encode the syntactic information and the semantic information of an input sentence, the invention provides a method for outputting hidden layers of attention modeling, which comprises the steps of firstly carrying out average operation on the last three layers of output of a pre-training encoder to obtain more complete semantic representation, and then focusing each subword of the whole sentence on a part containing syntactic relations through an attention layer to obtain more optimal context representation fused with the syntactic information and the semantic information.

Disclosure of Invention

The invention aims to solve the problems of poor adaptability and inaccurate grammar correction of some complex sentences of the existing English grammar correction technology, and further provides an English grammar correction technology based on a multi-task learning and attention mechanism.

The technical scheme adopted by the invention for solving the problems is as follows: the method comprises the following steps:

step 1, for an input sentence, reading an English word segmentation vocabulary and an editing tag vocabulary from a database;

step 2, inputting sentences into a pre-training coding model to obtain the context representation of the whole sentences;

step 3, the obtained context feature vectors pass through a self-attention layer, and semantic vectors of all words in the sentence are further interacted through a self-attention mechanism;

step 4, judging whether the input sub-words need editing operation or not by using a two-class classifier, classifying the input sub-words by using a classifier with the size of a word list, and selecting the classification result with the highest score as the editing label of the corresponding sub-word;

and 5, performing post-processing on words in the input sentence according to the meaning corresponding to the correction label predicted by the model, and inputting the obtained post-processing result into the model for multiple iterations to obtain a final result.

Furthermore, the pre-training English coding model adopted in the step 2 comprises Roberta, XLnet and Deberta, and the three pre-training models are all improved versions of BERT, so that more pre-training corpus is used, more reasonable pre-training tasks and modeling mechanisms are trained, and the method has good effects in multiple English semantic modeling fields. The specific process comprises the following steps:

step 2.1, loading an English word segmentation tool for an input English sentence, and segmenting each word of the English sentence into sub-word forms;

step 2.2, mapping the sub word sequence of the English sentence into 768-dimensional vectors through a word embedding layer of a pre-training English coding model;

and 2.3, passing the mapping vector through a 12-layer pre-training English coding model, and splicing and averaging hidden layer vectors output by the last three layers of the model along the last one dimension, so as to obtain hidden layer vectors containing more semantic information.

Further, the self-attention layer in step 3 performs self-attention operation on the encoded representation containing semantic information output in step 2, so that the semantic representations in the whole sentence are further interacted,

Attn(x)＝(W ₂ (tanh(W ₁ *x+b ₁ ))+b ₂ )·x(1)，

in formula (1), x represents the semantic representation, W, of the sentence obtained in step 2 ₁ And W is ₂ 、b ₁ And b ₂ In order for the parameters to be trainable,h is the dimension of the last dimension of x, tanh is a hyperbolic tangent function, and serves as an activation function to provide nonlinear capability for the attention layer, the self-attention layer enables the representation vectors in the sentences to further interact, and more attention scores are distributed for the components containing the syntactic relations, so that the model can further model the syntactic relations of the sentences.

Further, in step 4, the optimization direction of the model is further constrained by using a multi-task learning mode in the training stage,

three constraint help models are provided in the training stage to achieve better;

Loss＝Loss _contrast +Loss _detect +Loss _label

equation (2) is the first loss constraint equation, where P _d (f _i I X) represents the probability of predicting that the i-position word is wrong according to the input sequence X model;

equation (3) is a second loss constraint equation, where P _i (y _i I X) represents the probability of a tag predicted from the input sequence X for the i-position word;

equation (4) is a third loss constraint equation, where y is a vector of all 1's, determining the direction of optimization;representing the subword x _i After the model is input, the model predicts that the current editing label is y _i Probability of (2); />The probability of the top5 high label representing the model output; mask _l Is a boolean vector for circumventing the case that the model output front 5 contains a real label: if the model outputs y _i When the probability value row of (1) is in the first 5, the mask of the corresponding position _l The value is 0, the loss calculation is not participated, and the rest 4 positions are all 1; mean is the averaging operation, which averages along the last bit of the vector; margin is the minimum interval of two acceptable probabilities, and the reference value can be set between 0.1 and 0.3; the goal of the MarginLoss is that for each input, the probability of the real label expected to be output by the model is larger than the average value of the top5 probability output by the model, and the confidence of the model classification correctness is adjusted through a proper margin value; ultimately multitasking learning through three constraintsThe model is better and has higher classification confidence.

Further, in the training process of the step 4, multi-stage training is adopted to gradually increase the training difficulty;

in the grammar error correction field, only a small part of sentences have grammar errors, i.e. most of the components need to remain unchanged, which requires that the model should see sentences without grammar errors during training; according to the invention, multi-stage training is adopted in the training stage, so that the training difficulty is gradually improved, and the accuracy of model correction is ensured; training of the model is specifically divided into three stages:

the first stage: training by adopting artificial generated pseudo data, wherein the data is formed by using a verb tense table and deleting and inserting English words immediately, and the total number of the data is 900 ten thousand; all the data contain errors, and the correction capability of the model to grammar error sentences is primarily improved through the fact that a sufficient amount of data are used as pre-training;

and a second stage: erroneous data from university papers and writings from non-english native speakers are employed. Compared with the pseudo data, the data is closer to the real situation, the data quality is relatively higher, and the training in the stage is helpful for modeling English sentences in the real world;

and a third stage: training takes English level certificate exams and erroneous data in papers from the native English speaker. The training data also comprises sentences without grammar errors in the stage, so that the training difficulty is further increased, and the model is required to distinguish whether the true sentences contain grammar errors or not, so that the model is prevented from being excessively corrected in the test stage, and the correction effect is prevented from being influenced; the data of the English native speaker has more difficulty, so that the training difficulty of the model can be further improved, and the English grammar correcting capability is improved.

Further, in step 5, the grammar error occurring in the sentence is found out to the maximum extent by using an iterative correction method.

The beneficial effects of the invention are as follows: the English grammar correction technology based on the multi-task learning and attention mechanism uses a pre-training model with stronger modeling on English semantic information, introduces the attention mechanism of the semantic information, acquires the context representation which better fuses the syntax information and the semantic information, introduces three tasks to constraint model training from different aspects in the training process, and thus obtains the English grammar error correction system with higher accuracy and confidence. Because of various types of English grammar errors and overlarge modification span, the method can enable a user to quickly find the grammar error types appearing in the text, give out the modification comments of correct sentences, and improve correction efficiency and interpretability. The invention can also be applied to search technology to correct the query information input by the user, better help the search system to identify the user intention and the query target, promote the search quality and optimize the user experience; or in English learning of non-English native, pertinence improvement is carried out on various grammar errors, so that a user is helped to quickly raise English grammar level; or in the checking system for English articles, proper sentence modification suggestions are given, and the accuracy and the readability of the articles are improved.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of a model structure of an English grammar correction model based on a multitask learning and attention mechanism in the present invention;

fig. 3 is a diagram illustrating a text correction effect in an embodiment of the present invention.

Detailed Description

The first embodiment is as follows: referring to fig. 1 to 3, an english grammar correction technique based on a multitasking learning and attention mechanism according to the present embodiment is implemented by:

In the embodiment, step 1 is a pre-step, and in the field of English grammar correction, error types comprise main and subordinate inconsistencies, verb tense errors, article deletion, preposition misuse and the like; the common implementation method in the field is to give correction comments to each word in English sentences, and then perform post-processing according to the correction comments to form a final correction result; therefore, compared with the word segmentation in the Chinese correction technology, the English word segmentation technology divides some words into a plurality of partial sub-words, such as English words 'transformers' are segmented into two parts of 'transformers' and 'ers', so that the English word segmentation vocabulary is firstly loaded to guide the segmentation of the input sentence; after dividing the sentence into words, the english grammar correction technique predicts the corresponding editing operation for each sub-word, for example, the editing tag corresponding to the sub-word "to" in the sentence "Go to home now" is "$delete", meaning that the current sub-word is deleted, so that by predicting all the sub-words of the whole sentence, an editing tag sequence can be obtained, and after the types of the editing tags are processed, the sentence after grammar correction can be obtained.

Step 2 is to convert the input English sentence into a vector composed of numbers. And obtaining the feature vector representing the context sentence information and the semantic information of the input sentence through the coded representation of the BERT.

And 3, enabling the context feature vectors output by the pre-training coding model in the step 2 to pass through a self-attention layer, enabling the representation vectors of different words to interact through a word attention mechanism, enabling the corresponding attention to be higher than other words in the sentence for the words containing syntax dependency, enabling the model to understand which parts of words are syntactically dependent, and further integrating the syntax information and semantic information in the input sentence.

And step 4, carrying out weighted fusion on the three representation input gating layers obtained in the step 3, and carrying out fusion on a plurality of modal representations to obtain the final representation of the Chinese characters.

Step 5, post-processing the editing tag sequence output in the step 4 and the input sentence sequence, and performing post-processing on each word in the sentence according to the meaning of the editing tag to form a grammar correction result; since the sentence with the english grammar error usually has a plurality of error superposition conditions, iterative error correction is required to be performed on the sentence to correct all the errors in the sentence as much as possible, so as to form a final correction result.

The second embodiment is as follows: with reference to fig. 1 to 3, the present embodiment is described, where the pre-training english coding model adopted in step 2 of the english grammar correction technique based on the multitask learning and attention mechanism includes RoBERTa, XLNet and DeBERTa, and the three pre-training models are all improved versions of BERT, using more pre-training corpora, completing more reasonable training tasks and modeling mechanism training, and having good effects in multiple english semantic modeling fields. The specific process comprises the following steps:

And a third specific embodiment: referring to fig. 1 to 3, the self-attention layer in step 3 of the english grammar correction technique based on the multitasking learning and attention mechanism according to the present embodiment performs self-attention operation on the encoded representation including the semantic information output in step 2, so that the semantic representations in the whole sentence are further interacted,

Attn(x)＝(W ₂ (tanh(W ₁ *x+b ₁ ))+b ₂ )·x (1)，

The specific embodiment IV is as follows: referring to fig. 1 to 3, the present embodiment describes an english grammar correction technique based on a multi-task learning and attention mechanism, in which the optimization direction of the model is further constrained by using the multi-task learning method in the training phase in step 4,

Loss＝Loss _contrast +Loss _detect +Loss _label

equation (3) is a second loss constraint equation, where P _l (y _i I X) represents the probability of a tag predicted from the input sequence X for the i-position word;

equation (4) is a third loss constraint equation, where y is a vector of all 1's, determining the direction of optimization;representing the subword x _i After the model is input, the model predicts that the current editing label is y _i Probability of (2); />The probability of the top5 high label representing the model output; mask _l Is a boolean vector for circumventing the case that the model output front 5 contains a real label: if the model outputs y _i When the probability value row of (1) is in the first 5, the mask of the corresponding position _l The value is 0, the loss calculation is not participated, and the rest 4 positions are all 1; mean is the averaging operation, which averages along the last bit of the vector; margin is the minimum interval of two acceptable probabilities, and the reference value can be set between 0.1 and 0.3; the goal of the MarginLoss is that for each input, the probability of the real label expected to be output by the model is larger than the average value of the top5 probability output by the model, and the confidence of the model classification correctness is adjusted through a proper margin value; finally, through three constraint multitask learning, the model is better and has higher classification confidence.

Fifth embodiment: referring to fig. 1 to 3, the present embodiment is described, in which a multi-stage training is adopted in the training process of step 4 of the english grammar correction technique based on the multi-task learning and attention mechanism to gradually increase the training difficulty;

Specific embodiment six: referring to fig. 1 to 3, the present embodiment describes an english grammar correction technique based on a multi-task learning and attention mechanism, which uses an iterative correction method to find out the grammar errors occurring in sentences to the maximum in step 5.

Grammar errors in part of English sentences are implication, namely, some errors can be found by a model only by correcting preamble errors; according to the correction mode of the edit tag prediction provided by the invention, firstly, correcting the single and plural nouns, and then connecting words by using hyphens to finish final error correction, thus requiring multi-step error correction; in addition, the model can deviate from semantic modeling of sentences with more grammar errors, which also requires iterative ways to allow the model to understand step by step; the use of multiple rounds of iterative error correction enables the model to find further implicit grammatical errors in sentences.

Examples

According to the steps, the invention can realize a simple automatic English grammar correction module which can be embedded into any existing system to achieve the effect of plug and play, and the invention has the following specific verification effects:

the embodiment is carried out according to the flow shown in fig. 1, and a Chinese spelling correction system based on multi-mode pre-training fusion is built. After the system is started, the input English text to be corrected is firstly taken out from the database and divided into sub words, and then the sub word sequences are input into a pre-training English coding model. The context characteristics output by the model pass through the self-attention layer to finish further interaction, and the final grammar correction result is obtained by using the representation containing the syntax information and the semantic information.

The invention selects a piece of content in the writing article of the non-English native speaker, and the correction result of the system built by the invention is shown in figure 3. According to the correction result and the correction opinion given in the figure, the English grammar correction technology based on the multi-task learning and attention mechanism used by the invention can realize the correction of the article missing error: "with certain" - "with acertain"; correction of preposition misuse errors can also be achieved: "diagnosed out" - "diagnosed"; meanwhile, the verb tense errors which are easy to make by learners can be corrected: "supported" - "supported". Through the method, a user can intuitively see English grammar errors in the article and quickly correct the English grammar errors.

The present invention is not limited to the preferred embodiments, but is capable of modification and variation in detail, and other embodiments, such as those described above, of making various modifications and equivalents will fall within the spirit and scope of the present invention.

Claims

1. An English grammar correcting technology based on a multitask learning and attention mechanism is characterized in that: the English grammar correction technology based on the multitask learning and attention mechanism is realized through the following steps:

2. The english grammar correction technique based on the multitasking learning and attention mechanism of claim 1, wherein: the pre-training English coding model adopted in the step 2 comprises RoBERTa, XLNet and DeBERTa, and the three pre-training models are all improved versions of BERT, so that more pre-training corpus is used, more reasonable pre-training tasks and modeling mechanisms are used for training, and the method has good effects in multiple English semantic modeling fields. The specific process comprises the following steps:

3. The english grammar correction technique based on the multitasking learning and attention mechanism of claim 1, wherein: the self-attention layer in step 3 performs self-attention operation on the encoded representation containing semantic information output in step 2, so that the semantic representations in the whole sentence are further interacted,

Attn(x)＝(W ₂ (tanh(W ₁ *x+b ₁ ))+b ₂ )·x (1)，

4. The english grammar correction technique based on the multitasking learning and attention mechanism of claim 1, wherein: the optimization direction of the model is further constrained in step 4 using a multitask learning approach during the training phase,

Loss＝Loss _contrast +Loss _detect +Loss _lobet

equation (4) is a third loss constraint equation, where y is a vector of all 1's, determining the direction of optimization;representing the subword x _i After the model is input, the model predicts that the current editing label is y _i Probability of (2); />The probability of the top5 high label representing the model output; mask _l Is a boolean vector for circumventing the case that the model output front 5 contains a real label: if the model outputs y _i When the probability value row of (1) is in the first 5, the mask of the corresponding position _l The value is 0, the loss calculation is not participated, and the rest 4 positions are all 1; mean is the averaging operation, which averages along the last bit of the vector; margin is the minimum interval between two acceptable probabilities, the reference value of which can be setBetween 0.1 and 0.3; the goal of the MarginLoss is that for each input, the probability of the real label expected to be output by the model is larger than the average value of the top5 probability output by the model, and the confidence of the model classification correctness is adjusted through a proper margin value; finally, through three constraint multitask learning, the model is better and has higher classification confidence.

5. The english grammar correction technique based on the multitasking learning and attention mechanism of claim 1 or 4, characterized by: in the training process of the step 4, multi-stage training is adopted to gradually improve the training difficulty;

6. The english grammar correction technique based on the multitasking learning and attention mechanism of claim 1, wherein: in step 5, the grammar error in the sentence is found out to the maximum extent by using an iterative correction mode.