CN115935957B - Sentence grammar error correction method and system based on syntactic analysis - Google Patents

Sentence grammar error correction method and system based on syntactic analysis Download PDF

Info

Publication number
CN115935957B
CN115935957B CN202211701494.7A CN202211701494A CN115935957B CN 115935957 B CN115935957 B CN 115935957B CN 202211701494 A CN202211701494 A CN 202211701494A CN 115935957 B CN115935957 B CN 115935957B
Authority
CN
China
Prior art keywords
sentence
vector
word
dependency
mask
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211701494.7A
Other languages
Chinese (zh)
Other versions
CN115935957A (en
Inventor
车万翔
孙博
王一轩
朱庆福
张斯尧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Nanfang Network Information Technology Co ltd
Original Assignee
Guangdong Nanfang Network Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Nanfang Network Information Technology Co ltd filed Critical Guangdong Nanfang Network Information Technology Co ltd
Priority to CN202211701494.7A priority Critical patent/CN115935957B/en
Publication of CN115935957A publication Critical patent/CN115935957A/en
Application granted granted Critical
Publication of CN115935957B publication Critical patent/CN115935957B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a sentence grammar error correction method and system based on syntactic analysis, the method comprises the steps of encoding a non-labeling corpus through a basic pre-training model to obtain a first sentence vector, and pre-training a dependency prediction model for the first sentence vector according to the non-labeling corpus and word dependency relationship to obtain a dependency prediction model; according to a first preset rule, masking and replacing words without marked linguistic data to obtain masking linguistic data, and performing masking language model training on the masking linguistic data to obtain a masking language model; building a syntactic reinforcement text encoder according to the text encoder, the dependency prediction model and the mask language model; sentence coding is carried out on the sentence to be corrected through a syntax enhanced text coder, and a second sentence vector is obtained; and decoding the second sentence vector through a syntax decoder to obtain a correction sentence corresponding to the sentence to be corrected. The embodiment enhances the grammar analysis perception and semantic representation capability and improves the accuracy of sentence grammar correction.

Description

Sentence grammar error correction method and system based on syntactic analysis
Technical Field
The invention relates to the field of sentence grammar error correction, in particular to a sentence grammar error correction method and system based on syntactic analysis.
Background
Grammar correction is an important component in the field of text correction, which is a Chinese sentence given a model, and then the model outputs a sentence corrected for grammar. The pre-training-based method is a common text error correction method, and mainly solves the problem of Chinese spelling errors, such as tentacle (deposit) cards and eye (mirror) snakes, which can be solved by pre-training corpus based on a large-scale pre-training model.
In the prior art, text error correction is performed through a text encoder, and the method has a plurality of defects, such as that a BERT language model obtains results on tasks of natural language processing, but a pre-training stage does not relate to pre-training tasks related to syntax, the model has poor performance on the tasks of grammar error correction, an error correction model based on the pre-training model excessively depends on data, and corpus generalization capability which cannot be covered by a training set is poor. For some grammar-related text errors, a simple pre-training model-based method has a great limitation, such as "full-time staff discusses and listens (listens and discusses) reports", the sentence has no spelling error, but the order of "listening" and "discussing" is reversed, the grammar errors are not solved by large-scale pre-training corpus, and similar examples are many, the pre-training corpus cannot cover all examples, and the model generalization capability is poor. The method for simultaneously solving spelling errors and grammar errors is model serial, namely, the text is input into the spelling error correction model in the first step to obtain the spelling error correction result, and the result is input into the grammar error correction model in the second step to obtain the grammar error correction result. If the result of the first step of spelling correction is wrong, the input of the grammar correction model is also wrong sentence, which greatly influences the performance of the grammar correction model, and the accuracy of grammar correction is low.
Disclosure of Invention
The invention provides a sentence grammar error correction method and a sentence grammar error correction system based on syntactic analysis, which are used for enhancing the grammar analysis perception capability and the semantic representation capability, reducing the progressive transmission of errors caused by model serial and improving the accuracy of sentence grammar error correction.
In order to solve the above technical problems, an embodiment of the present invention provides a sentence grammar error correction method based on syntactic analysis, including:
encoding the non-labeling corpus through a basic pre-training model to obtain a first sentence vector, and pre-training the dependency prediction model according to the non-labeling corpus and the word dependency relationship to obtain a dependency prediction model;
according to a first preset rule, masking and replacing words without marked linguistic data to obtain masking linguistic data, and performing masking language model training on the masking linguistic data to obtain a masking language model;
building a syntactic reinforcement text encoder according to the text encoder, the dependency prediction model and the mask language model;
sentence coding is carried out on the sentence to be corrected through a syntax enhanced text coder, and a second sentence vector is obtained;
and decoding the second sentence vector through a syntax decoder to obtain a correction sentence corresponding to the sentence to be corrected.
According to the embodiment of the invention, the unlabeled corpus is encoded through the basic pre-training model to obtain the first sentence vector, and the first sentence vector is pre-trained by the dependency prediction model according to the unlabeled corpus and the word dependency relationship to obtain the dependency prediction model; according to a first preset rule, masking and replacing words without marked linguistic data to obtain masking linguistic data, and performing masking language model training on the masking linguistic data to obtain a masking language model; building a syntactic reinforcement text encoder according to the text encoder, the dependency prediction model and the mask language model; sentence coding is carried out on the sentences to be corrected through a syntactic reinforcement text encoder, so that sentence vectors are obtained; and decoding the sentence vectors through a syntax decoder to obtain corrected sentences, and completing sentence syntax error correction. The pre-training of the dependency prediction model effectively enhances the grammar analysis perception capability and the semantic representation capability, and the word vectors coded by the model contain more abundant context information through the mask language model. According to the text encoder, the dependency prediction model and the mask language model, a syntax reinforced text encoder is established, excessive dependency data is avoided, corpus generalization capability which cannot be covered by a training set is improved, generalization capability is stronger under the same data and model parameters, grammar correction and correction can be carried out on the basis of correcting spelling errors, two types of text correction can be solved by a single model, error step-by-step transmission caused by model serial is reduced, and sentence grammar correction accuracy is improved.
As a preferred scheme, the sentence to be corrected is sentence-encoded by a syntax-enhanced text encoder, so as to obtain a second sentence vector, specifically:
marking a grammar error correction data set of the field where the sentence to be corrected is located according to a data format of a preset triplet to obtain a field data set;
according to the field data set, adjusting a syntactic reinforcement text encoder to obtain a field text encoder;
and carrying out sentence coding on the sentence to be corrected by using a field text coder to obtain a second sentence vector.
According to the embodiment of the invention, the model is trained in advance and then is finely adjusted on the field data set according to the field knowledge and the labeled field data set, the syntax enhanced text encoder is finely adjusted, the sentence grammar correction effect in different fields is improved, and the sentence accurate correction is realized.
As a preferred scheme, a non-labeling corpus is encoded through a basic pre-training model to obtain a first sentence vector, and the first sentence vector is pre-trained by a dependency prediction model according to the non-labeling corpus and the word dependency relationship to obtain the dependency prediction model, specifically:
inputting the unlabeled corpus into a syntactic analyzer to obtain triples of word dependency relationship of each sentence; wherein the triplet includes a parent node vector, a child node vector, and a dependency label;
Encoding the non-labeling corpus through a basic pre-training model to obtain a first sentence vector;
selecting word pairs according to a preset selection rule and a triplet of word dependency relationship of each sentence, and obtaining word pair vectors according to first sentence vectors corresponding to the word pairs; wherein the term pair vector comprises a first term vector and a second term vector;
and (3) pre-training the word pair vector by a dependency prediction model according to the dependency label, calculating a loss value through a cross entropy loss function, and updating the model parameters by gradient descent according to the loss value to update the parameters of the dependency prediction model so as to obtain the dependency prediction model.
As a preferable scheme, the word pair vector is pre-trained by a dependency prediction model according to the dependency label, a loss value is calculated through a cross entropy loss function, gradient descent update model parameters are carried out according to the loss value, and the parameters of the dependency prediction model are updated, so that the dependency prediction model is obtained, specifically:
inputting the first word vector into a pooling layer, outputting a third word vector, inputting the second word vector into the pooling layer, and outputting a fourth word vector; inputting the third word vector and the fourth word vector into a classifier according to the dependency labels, predicting the dependency relationship of the word pairs, and calculating a loss value through a cross entropy loss function according to the dependency relationship of the dependency labels and the word pairs;
And determining the gradient of the current cross entropy loss function according to the loss value, carrying out gradient descent and updating model parameters, and updating the parameters of the dependency prediction model to obtain the dependency prediction model.
As a preferred scheme, according to a first preset rule, masking and replacing words without marked linguistic data to obtain masked linguistic data, and training the masked linguistic data to obtain a masked linguistic model, specifically:
selecting 15% of words in sentences without corpus to mask according to a first preset rule to obtain mask words, and dividing the mask words into three groups according to a preset proportion to obtain a first group of mask words, a second group of mask words and a third group of mask words;
replacing the first group of mask words with mask marks, replacing the second group of mask words without replacing, replacing the third group of mask words with a random word, and obtaining mask corpus;
and carrying out mask language model training on the mask corpus to obtain a mask language model.
As a preferred scheme, a syntactic reinforcement text encoder is built according to the text encoder, the dependency prediction model and the mask language model, specifically:
according to the output vector of the word vector in the linear layer and the activation function, a first attention layer is established, and the formula is as follows:
Wherein Attention is the first Attention layer, Q is the output vector of the word vector at the first linear layer, K is the output vector of the word vector at the second linear layer, V is the output vector of the word vector at the third linear layer, softmax is the activation function, d k Is the word vector hidden layer dimension;
the feed-forward neural network layer is established through the fully connected neural network and the attention layer of the two layers, and the formula is as follows:
max(0,XW 1 +b 1 )W 2 +b 2
wherein W is 1 For the first weight matrix, W 2 For the second weight matrix, b 1 Is a first bias term, b 2 Is a second bias term, X is the first attention layer output;
establishing a text encoder according to the first attention layer and the feedforward neural network layer;
a syntactically enhanced text encoder is constructed from the text encoder, the dependency prediction model, and the mask language model.
As a preferred scheme, decoding the second sentence vector by a syntax decoder to obtain a corrected sentence corresponding to the sentence to be corrected, specifically:
establishing a second attention layer through masking operation;
establishing a third attention layer according to the output vector of the linear layer and the activation function;
establishing a softmax layer according to the probability of predicting the output word;
establishing a syntax decoder according to the second attention layer, the third attention layer and the softmax layer;
And decoding the second sentence vector through a syntax decoder to obtain a correction sentence corresponding to the sentence to be corrected.
In order to solve the same technical problem, the embodiment of the invention also provides a sentence grammar error correction system based on syntactic analysis, which comprises: a dependency prediction module, a mask language module, a syntax reinforcement module, an encoding module and a decoding module;
the dependency prediction module is used for encoding the unlabeled corpus through the basic pre-training model to obtain a first sentence vector, and pre-training the dependency prediction model according to the unlabeled corpus and the word dependency relationship to obtain a dependency prediction model;
the mask language module is used for carrying out mask replacement on words without marked linguistic data according to a first preset rule to obtain mask linguistic data, and carrying out mask language model training on the mask linguistic data to obtain a mask language model;
the syntax reinforcement module is used for building a syntax reinforcement text encoder according to the text encoder, the dependency prediction model and the mask language model;
the coding module is used for carrying out sentence coding on the sentence to be corrected through the syntactic reinforcement text coder to obtain a second sentence vector;
the decoding module is used for decoding the second sentence vector through the syntax decoder to obtain a correction sentence corresponding to the sentence to be corrected.
Preferably, the encoding module includes: the sentence coding system comprises a labeling unit, an adjusting unit and a sentence coding unit;
the labeling unit is used for labeling the grammar error correction data set of the field where the sentence to be corrected is located according to the data format of the preset triples to obtain a field data set;
the adjusting unit is used for adjusting the syntactic reinforcement text encoder according to the field data set to obtain a field text encoder;
the sentence coding unit is used for sentence coding the sentence to be corrected through the field text coder to obtain a second sentence vector.
Preferably, the dependency prediction module includes: the system comprises a syntax analysis unit, a basic coding unit, a word pair unit and a training unit;
the syntactic analysis unit is used for inputting the unlabeled corpus into the syntactic analyzer to obtain a triplet of word dependency relationship of each sentence; wherein the triplet includes a parent node vector, a child node vector, and a dependency label;
the basic coding unit is used for coding the non-labeling corpus through a basic pre-training model to obtain a first sentence vector;
the word pair unit is used for selecting word pairs according to a preset selection rule and a triplet of word dependency relation of each sentence, and obtaining word pair vectors according to first sentence vectors corresponding to the word pairs; wherein the term pair vector comprises a first term vector and a second term vector;
The training unit is used for pre-training the word pair vector dependency prediction model according to the dependency label, calculating a loss value through a cross entropy loss function, and updating model parameters by gradient descent according to the loss value to update the parameters of the dependency prediction model so as to obtain the dependency prediction model.
Drawings
Fig. 1: the invention provides a flow diagram of an embodiment of a sentence grammar error correction method based on syntactic analysis;
fig. 2: the invention provides an output schematic diagram of a dependency syntax analyzer of one embodiment of a sentence syntax error correction method based on syntax analysis;
fig. 3: the invention provides a syntax reinforced text encoder structure diagram of an embodiment of a sentence syntax error correction method based on syntax analysis;
fig. 4: the sentence correction flow chart of one embodiment of the sentence grammar correction method based on the syntactic analysis is provided by the invention;
fig. 5: the invention provides a connection schematic diagram of another embodiment of a sentence grammar error correction system based on syntactic analysis.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1, a schematic flow chart of a sentence grammar error correction method based on syntactic analysis is provided in an embodiment of the present invention. The sentence grammar error correction method is suitable for correcting text sentence grammar, and sentence coding is carried out by a grammar-enhanced text encoder, so that grammar analysis perception capability and semantic representation capability are enhanced, error progressive transmission caused by model serial is reduced, and sentence grammar error correction accuracy is improved. The sentence grammar error correction method comprises steps 101 to 105, wherein the steps are as follows:
step 101: encoding the non-labeling corpus through the basic pre-training model to obtain a first sentence vector, and pre-training the dependency prediction model according to the non-labeling corpus and the word dependency relationship to obtain the dependency prediction model.
Optionally, step 101 specifically includes steps 1011 to 1014, each of which specifically includes:
step 1011: inputting the unlabeled corpus into a syntactic analyzer to obtain triples of word dependency relationship of each sentence; wherein the triplet includes a parent node vector, a child node vector, and a dependency label.
In order to make the encoded sentence more implied The method is characterized by comprising the steps of enriching syntactic knowledge, enabling a dependency prediction model to learn syntactic knowledge in a pre-training stage, enhancing the awareness of the model on syntax, and firstly, pre-training large-scale unlabeled data, wherein a pre-training task comprises pre-training of the dependency prediction model and pre-training of a mask language model, the purpose of the dependency prediction model is to enable the model to learn additional syntactic knowledge, and the purpose of the mask language model is to enable word vectors coded by the model to contain more abundant context information. The syntactic knowledge is derived from the results of a dependency parser (syntactic analyzer), the output of which is illustrative, as shown in FIG. 2, which can recognize two words (t 1 And t 2 ) Dependency relationship (tag) between them to form a triplet (t) 1 ,t 2 Tag). The goal of our model is to obtain (t 1 ,t 2 Tag) such dependency knowledge.
In the process of establishing a syntactic reinforced text encoder, in the embodiment, firstly, a dependency prediction model is obtained through pre-training, a label of pre-training data is obtained, and a large-scale non-labeling corpus { x } 0 ,x 1 ,x 2 …x n Input into a syntax analyzer to obtain a dependency syntax output { s } for each sentence 0 ,s 1 ,s 2 …s n (s is therein i Is a series of triples { a } with word dependency relationship 0 ,a 1 ,a 2 …a m },a j =(t 1 ,t 2 Tag), the meaning in the triplet is: t is t 1 Is t 2 ,t 2 Is the parent node vector, t 1 Is a child node vector, and the dependency label is tag;
step 1012: encoding the non-labeling corpus through a basic pre-training model to obtain a first sentence vector;
in this embodiment, the input of the text encoder's dependency prediction model pre-training process is a non-labeled corpus, taking a sentence containing n words as an example, in a syntactically enhanced pre-trained text encoder, the original sentence { w ] is encoded by an existing pre-training model (e.g., BERT, roBERTa) 1 ,w 2 …w n Through the existing pre-training mouldAfter coding, a sentence vector { t } of a sentence is obtained 1 ,t 2 …t n }。
Step 1013: selecting word pairs according to a preset selection rule and a triplet of word dependency relationship of each sentence, and obtaining word pair vectors according to first sentence vectors corresponding to the word pairs; wherein the term pair vector comprises a first term vector and a second term vector;
in this embodiment, some word pair vectors { t } are selected by a predetermined selection rule (e.g., random, ordered) based on the dependency triples obtained in step 1011 i ,t j First word vector is { t } i Second word vector is { t } j }。
Step 1014: and (3) pre-training the word pair vector by a dependency prediction model according to the dependency label, calculating a loss value through a cross entropy loss function, and updating the model parameters by gradient descent according to the loss value to update the parameters of the dependency prediction model so as to obtain the dependency prediction model.
Optionally, step 1014 specifically includes: according to the dependency label, the word pair vector is pre-trained by a dependency prediction model, a loss value is calculated through a cross entropy loss function, gradient descent update model parameters are carried out according to the loss value, and the parameters of the dependency prediction model are updated, so that the dependency prediction model is obtained, specifically:
inputting the first word vector into a pooling layer, outputting a third word vector, inputting the second word vector into the pooling layer, and outputting a fourth word vector; inputting the third word vector and the fourth word vector into a classifier according to the dependency labels, predicting the dependency relationship of the word pairs, and calculating a loss value through a cross entropy loss function according to the dependency relationship of the dependency labels and the word pairs; and determining the gradient of the current cross entropy loss function according to the loss value, carrying out gradient descent and updating model parameters, and updating the parameters of the dependency prediction model to obtain the dependency prediction model.
In this embodiment, the text encoders for the two words are input to the pooling layer, and finally the output of the pooling layer is used as the input of the classifier by letting the model predict the dependency of the word pairsThe syntactic perceptibility of the model is enhanced. Word pair vector for selection: word 1 (first word vector): t is t i …t i+m Word 2 (second word vector): t is t j …t j+k Respectively inputting the two word vectors into a pooling layer to obtain word vectors corresponding to the two words, and a third word vector t 1 =pooling(t i ,…,t i+m ) Fourth word vector t 2 =pooling(t i ,…,t i+m ). Obtaining a term vector (t) 1 ,t 2 ) After representation, the resulting term vector is input into a classifier classifer (t 1 ,t 2 ) Predicting the dependency relationship, the label is derived from step 1011, and the loss value loss is calculated by using the cross entropy loss function, and the formula is as follows:
where L is the loss value loss, N is the number of samples, M is the number of categories, y ic Is a sign function (0 or 1), if the true class of sample i is equal to c, 1 is taken, otherwise 0, p is taken ic Is the predicted probability that observation sample i belongs to category c.
In this embodiment, the loss value loss is calculated using the output of the classifier and the label in step 1011, and then the gradient descent update model parameters are performed. The gradient descent process is as follows:
a. determining a gradient of a loss function for a current position for θ i The expression of its gradient:
b. multiplying the gradient of the loss function by the step length to obtain the distance of the current position drop, i.e
c. Determining whether all theta i The gradient descent distance is smaller than E, if the gradient descent distance is smaller than E, the algorithm is terminated, and all the gradient descent distances are currently smaller than E θ of (2) i (i 0,1, …, n) is the final result, otherwise go to step d
d. Updating all theta for theta i The updated expression is as follows, and the process continues to step a after the update is finished
Step 102: and according to a first preset rule, carrying out mask replacement on words without marked linguistic data to obtain mask linguistic data, and carrying out mask language model training on the mask linguistic data to obtain a mask language model.
Optionally, step 102 specifically includes: selecting 15% of words in sentences without corpus to mask according to a first preset rule to obtain mask words, and dividing the mask words into three groups according to a preset proportion to obtain a first group of mask words, a second group of mask words and a third group of mask words; replacing the first group of mask words with mask marks, replacing the second group of mask words without replacing, replacing the third group of mask words with a random word, and obtaining mask corpus; and carrying out mask language model training on the mask corpus to obtain a mask language model.
In this embodiment, to enhance the context representation capability of the masking language model, the masking language model trains, randomly selects 15% of the words in the sentence to Mask, selects the words as Mask, i.e., mask words, divides the Mask words into three groups according to a preset ratio (8:1:1), replaces 80% of the Mask words with true [ Mask ], replaces 10% of the Mask words with Mask marks, does not replace the second Mask words, and replaces 10% of the Mask words with one random word (replaces the third Mask words with one random word). The mask language is trained to obtain a mask language model, the mask language model is trained on the basis of a non-labeling corpus in a training process similar to a dependency prediction model training method, the mask language model is trained on the basis of a mask language, loss values loss of the mask language model are classified tasks of multiple categories, a cross entropy loss function in step 1014 is applied to calculate loss values loss, and the sum of the loss values loss of the dependency prediction model and the mask language model training task is used as the total loss of output.
Step 103: a syntactically enhanced text encoder is built from the text encoder, the dependency prediction model, and the mask language model.
Optionally, step 103 specifically includes: according to the output vector of the word vector in the linear layer and the activation function, a first attention layer is established, and the formula is as follows:
wherein Attention is the first Attention layer, Q is the output vector of the word vector at the first linear layer, K is the output vector of the word vector at the second linear layer, V is the output vector of the word vector at the third linear layer, softmax is the activation function, d k Is the word vector hidden layer dimension;
the feed-forward neural network layer is established through the fully connected neural network and the attention layer of the two layers, and the formula is as follows:
max(0,XW 1 +b 1 )W 2 +b 2
wherein W is 1 For the first weight matrix, W 2 For the second weight matrix, b 1 Is a first bias term, b 2 Is a second bias term, X is the first attention layer output;
establishing a text encoder according to the first attention layer and the feedforward neural network layer;
a syntactically enhanced text encoder is constructed from the text encoder, the dependency prediction model, and the mask language model.
In this embodiment, as shown in fig. 3, in the application of syntax correction of an actual sentence, a trained text encoder (syntax-enhanced text encoder) is used to encode a sentence, and the syntax-enhanced text encoder structure includes an attention mechanism (first attention layer) and a feedforward neural network layer, and the output of the attention mechanism (first attention layer) is:
Wherein Q, K, V are the output vectors of the word vectors at three linear layers, respectively, softmax is the activation function, d k Is the word vector hidden layer dimension. The feedforward neural network layer is simpler and is a full connection of two layers, the activation function of the first layer is ReLU, and the second layer is not applicable to the activation function, and the formula is as follows:
max(0,XW 1 +b 1 )W 2 +b 2
wherein W is 1 、W 2 Is a weight matrix, b 1 、b 2 Is a bias term and X is the output of the attention layer (first attention layer).
Step 104: and sentence coding is carried out on the sentence to be corrected through a syntactic reinforcement text encoder, and a second sentence vector is obtained.
Optionally, step 104 specifically includes: marking a grammar error correction data set of the field where the sentence to be corrected is located according to a data format of a preset triplet to obtain a field data set; according to the field data set, adjusting a syntactic reinforcement text encoder to obtain a field text encoder; and carrying out sentence coding on the sentence to be corrected by using a field text coder to obtain a second sentence vector.
In this embodiment, as shown in fig. 4, in order to enable the encoded sentence to contain more abundant syntactic knowledge, step a is specifically implemented by steps 101 to 103, and the pre-training task includes a dependency prediction model and a mask language model, where the dependency prediction model aims at enabling the model to learn additional syntactic knowledge, and the mask language model aims at enabling the word vector encoded by the model to contain more abundant contextual information. In step b, in order to adapt to the task of grammar error correction, marking the data set of grammar error correction in the field to obtain the field data set, wherein the format of each data is a triplet: (sequence number, sentence containing grammar error, corrected sentence), for example: (1, the whole plant staff member listens to and listens to the report, and the whole plant staff member listens to and listens to the report). The domain data set is used for model fine tuning. And if the field data set is not acquired, sentence encoding can be carried out on the sentence to be corrected through the field text encoder to obtain a corresponding sentence vector, decoding is continued, the corrected sentence is obtained, and sentence grammar correction is completed. And c, coding the sentence to be corrected by using a pre-trained field text coder to obtain a second sentence vector. Step d, decoding by using the second sentence vector coded in step c, and step e finally outputting the corrected sentence.
Step 105: and decoding the second sentence vector through a syntax decoder to obtain a correction sentence corresponding to the sentence to be corrected.
Optionally, step 105 specifically includes: establishing a second attention layer through masking operation;
establishing a third attention layer according to the output vector of the linear layer and the activation function;
establishing a softmax layer according to the probability of predicting the output word;
establishing a syntax decoder according to the second attention layer, the third attention layer and the softmax layer;
and decoding the second sentence vector through a syntax decoder to obtain a correction sentence corresponding to the sentence to be corrected.
In this embodiment, the corrected sentence is output according to the decoding of the encoded second sentence vector by the decoder, the decoder structure includes two attention layers, the first attention layer (second attention layer) uses the masking operation, the second attention layer (third attention layer) is similar to the first attention layer in step 104, and the last softmax layer calculates the probability of the next translated word. The decoding process can adopt a greedy search or cluster search strategy, and the specific decoding process is as follows: the decoder inputs a starting token, then passes through the self-attribute layer, then passes through the encoder-decoder attention layer, and then outputs through the forward layer, thereby obtaining the final representation of the token, and then predicts which word in the dictionary is output through a linear layer plus a softmax layer. The decoder decodes only one word at each step, outputs the word, puts the word into the input of the decoder, and repeats the above operation until the word is decoded to < eos >. And when the decoding is finished, the sentence output is finished, and finally, the corrected sentence corresponding to the sentence to be corrected is output.
As an example of the present embodiment, a sentence to be corrected: for your care of me, a lot is paid. Text is encoded by a syntax-enhanced text encoder (syntax-enhanced text encoder) through the dependency syntax, and then decoded by a bundle search strategy, outputting corrected sentences, and the corrected sentences: your effort to take care of me is very much.
According to the embodiment of the invention, the unlabeled corpus is encoded through the basic pre-training model to obtain the first sentence vector, and the first sentence vector is pre-trained by the dependency prediction model according to the unlabeled corpus and the word dependency relationship to obtain the dependency prediction model; according to a first preset rule, masking and replacing words without marked linguistic data to obtain masking linguistic data, and performing masking language model training on the masking linguistic data to obtain a masking language model; building a syntactic reinforcement text encoder according to the text encoder, the dependency prediction model and the mask language model; sentence coding is carried out on the sentences to be corrected through a syntactic reinforcement text encoder, so that sentence vectors are obtained; and decoding the sentence vectors through a syntax decoder to obtain corrected sentences, and completing sentence syntax error correction. The pre-training of the dependency prediction model effectively enhances the grammar analysis perception capability and the semantic representation capability, and the word vectors coded by the model contain more abundant context information through the mask language model. According to the text encoder, the dependency prediction model and the mask language model, a syntax reinforced text encoder is established, excessive dependency data is avoided, corpus generalization capability which cannot be covered by a training set is improved, generalization capability is stronger under the same data and model parameters, grammar correction and correction can be carried out on the basis of correcting spelling errors, two types of text correction can be solved by a single model, error step-by-step transmission caused by model serial is reduced, and sentence grammar correction accuracy is improved.
Example two
Accordingly, referring to fig. 5, fig. 5 is a schematic connection diagram of a second embodiment of a sentence grammar error correction system based on syntactic analysis according to the present invention. As shown in fig. 5, the syntax analysis based sentence syntax error correction system includes a dependency prediction module 501, a mask language module 502, a syntax reinforcement module 503, an encoding module 504, and a decoding module 505;
the dependency prediction module 501 is configured to encode the unlabeled corpus through a basic pre-training model to obtain a first sentence vector, and pre-train the first sentence vector into a dependency prediction model according to the unlabeled corpus and the word dependency relationship to obtain the dependency prediction model.
The dependency prediction module 501 includes: a syntax analysis unit 5011, a base encoding unit 5012, a word pair unit 5013, and a training unit 5014.
The syntactic analysis unit 5011 is used for inputting the unlabeled corpus into a syntactic analyzer to obtain triples of word dependency relationship of each sentence; wherein the triplet includes a parent node vector, a child node vector, and a dependency label.
The basic encoding unit 5012 is configured to encode the non-labeling corpus through a basic pre-training model, to obtain a first sentence vector.
The word pair unit 5013 is configured to select a word pair according to a preset selection rule and a triplet of word dependency relationships of each sentence, and obtain a word pair vector according to a first sentence vector corresponding to the word pair; wherein the term pair vector includes a first term vector and a second term vector.
The training unit 5014 is configured to pre-train the word pair vector by using a dependency prediction model according to the dependency label, calculate a loss value by using a cross entropy loss function, update model parameters by performing gradient descent according to the loss value, update parameters of the dependency prediction model, and obtain the dependency prediction model.
The mask language module 502 is configured to perform mask replacement on words with non-labeling linguistic data according to a first preset rule, obtain mask linguistic data, and perform mask language model training on the mask linguistic data, so as to obtain a mask language model.
The syntax reinforcement module 503 is configured to build a syntax reinforced text encoder based on the text encoder, the dependency prediction model, and the mask language model.
The encoding module 504 is configured to perform sentence encoding on the sentence to be corrected by using a syntactically enhanced text encoder, so as to obtain a second sentence vector.
The encoding module 504 includes: a labeling unit 5041, an adjusting unit 5042, and a sentence encoding unit 5043.
The labeling unit 5041 is configured to label, according to a data format of a preset triplet, a grammar error correction data set of a domain where a sentence to be corrected is located, and obtain a domain data set.
The adjusting unit 5042 is configured to adjust the syntax enhanced text encoder according to the domain data set, to obtain a domain text encoder.
The sentence coding unit 5043 is configured to perform sentence coding on the sentence to be corrected by using the field text encoder, so as to obtain a second sentence vector.
The decoding module 505 is configured to decode the second sentence vector through a syntax decoder to obtain a correction sentence corresponding to the sentence to be corrected.
By implementing the embodiment of the invention, the existing error correction model based on the pre-training model excessively depends on data, and has poor corpus generalization capability which cannot be covered by the training set. According to the grammar correction of the grammar analysis reinforcement and training, a grammar reinforced text encoder is established for sentence grammar correction, the grammar perception capability and the semantic representation capability of a model can be effectively enhanced, the generalization capability is stronger under the same data and model parameters, the grammar correction problem can be solved on the basis of correcting spelling errors, and the phenomenon of error accumulation in a pipeline system is greatly reduced. The existing method for solving spelling errors and grammar errors is model serial, namely, a text is input into a spelling error correction model in the first step to obtain a spelling error correction result, and the result is input into a grammar error correction model in the second step to obtain a grammar error correction result. If the result of the first step of spelling correction is erroneous, then the input of the grammar correction model is also an erroneous sentence, which greatly affects the performance of the grammar correction model. Therefore, the syntax enhanced text encoder model can solve two types of text error correction, and greatly reduces the phenomenon of error progressive transmission caused by model serial.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention, and are not to be construed as limiting the scope of the invention. It should be noted that any modifications, equivalent substitutions, improvements, etc. made by those skilled in the art without departing from the spirit and principles of the present invention are intended to be included in the scope of the present invention.

Claims (7)

1. A syntax analysis-based sentence grammar error correction method, comprising:
encoding the non-labeling corpus through a basic pre-training model to obtain a first sentence vector, and pre-training a dependency prediction model according to the non-labeling corpus and the word dependency relationship to obtain a dependency prediction model;
encoding the unlabeled corpus through a basic pre-training model to obtain a first sentence vector, and pre-training the first sentence vector into a dependency prediction model according to the unlabeled corpus and the word dependency relationship to obtain the dependency prediction model, wherein the method specifically comprises the following steps of:
inputting the unlabeled corpus into a syntactic analyzer to obtain triples of the word dependency relationship of each sentence; wherein the triplet includes a parent node vector, a child node vector, and a dependency label;
Encoding the non-labeling corpus through the basic pre-training model to obtain the first sentence vector;
selecting word pairs according to a preset selection rule and a triplet of word dependency relationship of each sentence, and obtaining word pair vectors according to the first sentence vectors corresponding to the word pairs; wherein the term pair vector comprises a first term vector and a second term vector;
performing dependency prediction model pre-training on the word pair vector according to the dependency label, calculating a loss value through a cross entropy loss function, performing gradient descent update model parameters according to the loss value, and updating the parameters of the dependency prediction model to obtain the dependency prediction model;
according to a first preset rule, masking and replacing words of the unmarked corpus to obtain masking corpus, and performing masking language model training on the masking corpus to obtain a masking language model;
establishing a syntactic reinforcement text encoder according to the text encoder, the dependency prediction model and the mask language model;
the method comprises the steps of establishing a syntactic reinforcement text encoder according to a text encoder, the dependency prediction model and the mask language model, wherein the syntactic reinforcement text encoder comprises the following specific steps:
According to the output vector of the word vector in the linear layer and the activation function, a first attention layer is established, and the formula is as follows:
wherein, attention is the first Attention layer, Q is the output vector of word vector at the first linear layer, K is the output vector of word vector at the second linear layer, V is the output vector of word vector at the third linear layer, softmax is the activation function, d k Is the word vector hidden layer dimension;
and establishing a feedforward neural network layer through the two-layer fully-connected neural network and the attention layer, wherein the formula is as follows:
max(0,XW 1 +b 1 )W 2 +b 2
wherein W is 1 For the first weight matrix, W 2 For the second weight matrix, b 1 Is a first bias term, b 2 Is a second bias term, X is the first attention layer output;
establishing the text encoder according to the first attention layer and the feedforward neural network layer;
constructing the syntactically enhanced text encoder from the text encoder, the dependency prediction model, and the mask language model;
sentence coding is carried out on the sentence to be corrected through the syntax enhanced text coder, and a second sentence vector is obtained;
and decoding the second sentence vector through a syntax decoder to obtain a correction sentence corresponding to the sentence to be corrected.
2. The syntax analysis based sentence grammar correction method according to claim 1, wherein said sentence to be corrected is sentence-encoded by said syntax enhanced text encoder to obtain a second sentence vector, specifically:
marking a grammar error correction data set of the field where the sentence to be corrected is located according to a data format of a preset triplet, and obtaining a field data set;
according to the field data set, adjusting the syntax enhanced text encoder to obtain a field text encoder;
and carrying out sentence coding on the sentence to be corrected by the field text coder to obtain the second sentence vector.
3. The sentence grammar error correction method based on syntactic analysis according to claim 1, characterized in that the word pair vector is pre-trained with a dependency prediction model according to the dependency label, a loss value is calculated through a cross entropy loss function, gradient descent update model parameters are performed according to the loss value, and parameters of the dependency prediction model are updated to obtain the dependency prediction model, specifically:
inputting the first word vector into a pooling layer, outputting a third word vector, inputting the second word vector into the pooling layer, and outputting a fourth word vector; inputting the third word vector and the fourth word vector into a classifier according to the dependency label, predicting the dependency relationship of the word pair, and calculating the loss value through a cross entropy loss function according to the dependency relationship of the dependency label and the word pair;
And determining the gradient of the current cross entropy loss function according to the loss value, carrying out gradient descent update model parameters, and updating the parameters of the dependency prediction model to obtain the dependency prediction model.
4. The method for correcting grammar of sentences based on syntactic analysis according to claim 1, wherein the masking substitution is performed on the words of the unlabeled corpus according to a first preset rule to obtain a masking corpus, and the masking corpus is subjected to masking language model training to obtain a masking language model, specifically:
selecting 15% of words in the sentences without the labeling corpus to mask according to a first preset rule to obtain mask words, and dividing the mask words into three groups according to a preset proportion to obtain a first group of mask words, a second group of mask words and a third group of mask words;
replacing the first set of mask words with mask marks, replacing the second set of mask words without replacing, and replacing the third set of mask words with a random word to obtain the mask corpus;
and carrying out mask language model training on the mask corpus to obtain a mask language model.
5. The syntax analysis based sentence grammar error correction method of claim 1, wherein said decoding the second sentence vector by a syntax decoder obtains a corrected sentence corresponding to the sentence to be corrected, specifically:
Establishing a second attention layer through masking operation;
establishing a third attention layer according to the output vector of the linear layer and the activation function;
establishing a softmax layer according to the probability of predicting the output word;
establishing the syntax decoder according to the second attention layer, the third attention layer and the softmax layer;
and decoding the second sentence vector through a syntax decoder to obtain a correction sentence corresponding to the sentence to be corrected.
6. A syntax analysis based sentence grammar error correction system, comprising: a dependency prediction module, a mask language module, a syntax reinforcement module, an encoding module and a decoding module;
the dependency prediction module is used for encoding the unlabeled corpus through a basic pre-training model to obtain a first sentence vector, and pre-training the dependency prediction model according to the unlabeled corpus and the word dependency relationship to obtain a dependency prediction model;
the dependency prediction module includes: the system comprises a syntax analysis unit, a basic coding unit, a word pair unit and a training unit;
the syntactic analysis unit is used for inputting the unlabeled corpus into a syntactic analyzer to obtain triples of the word dependency relationship of each sentence; wherein the triplet includes a parent node vector, a child node vector, and a dependency label;
The basic coding unit is used for coding the non-labeling corpus through the basic pre-training model to obtain the first sentence vector;
the word pair unit is used for selecting word pairs according to a preset selection rule and a triplet of word dependency relation of each sentence, and obtaining word pair vectors according to the first sentence vectors corresponding to the word pairs; wherein the term pair vector comprises a first term vector and a second term vector;
the training unit is used for pre-training the dependency prediction model of the word pair vector according to the dependency label, calculating a loss value through a cross entropy loss function, and updating model parameters by gradient descent according to the loss value to update the parameters of the dependency prediction model so as to obtain the dependency prediction model;
the mask language module is used for carrying out mask replacement on the words without the marked linguistic data according to a first preset rule to obtain mask linguistic data, and carrying out mask language model training on the mask linguistic data to obtain a mask language model;
the syntax reinforcement module is used for establishing a syntax reinforcement text encoder according to the text encoder, the dependency prediction model and the mask language model;
The method comprises the steps of establishing a syntactic reinforcement text encoder according to a text encoder, the dependency prediction model and the mask language model, wherein the syntactic reinforcement text encoder comprises the following specific steps:
according to the output vector of the word vector in the linear layer and the activation function, a first attention layer is established, and the formula is as follows:
wherein, attention is the first Attention layer, Q is the output vector of word vector at the first linear layer, K is the output vector of word vector at the second linear layer, V is the output vector of word vector at the third linear layer, softmax is the activation function, d k Is the word vector hidden layer dimension;
and establishing a feedforward neural network layer through the two-layer fully-connected neural network and the attention layer, wherein the formula is as follows:
max(0,XW 1 +b 1 )W 2 +b 2
wherein W is 1 For the first weight matrix, W 2 For the second weight matrix, b 1 Is a first bias term, b 2 Is a second bias term, X is the first attention layer output;
establishing the text encoder according to the first attention layer and the feedforward neural network layer;
constructing the syntactically enhanced text encoder from the text encoder, the dependency prediction model, and the mask language model;
the encoding module is used for sentence encoding of the sentence to be corrected through the syntax enhanced text encoder to obtain a second sentence vector;
The decoding module is used for decoding the second sentence vector through a syntax decoder to obtain a correction sentence corresponding to the sentence to be corrected.
7. The syntax analysis based sentence syntax error correction system according to claim 6, wherein said encoding module comprises: the sentence coding system comprises a labeling unit, an adjusting unit and a sentence coding unit;
the labeling unit is used for labeling the grammar error correction data set of the field where the sentence to be corrected is located according to the data format of the preset triplet to obtain a field data set;
the adjusting unit is used for adjusting the syntax enhanced text encoder according to the field data set to obtain a field text encoder;
the sentence coding unit is used for carrying out sentence coding on the sentence to be corrected through the field text coder to obtain the second sentence vector.
CN202211701494.7A 2022-12-29 2022-12-29 Sentence grammar error correction method and system based on syntactic analysis Active CN115935957B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211701494.7A CN115935957B (en) 2022-12-29 2022-12-29 Sentence grammar error correction method and system based on syntactic analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211701494.7A CN115935957B (en) 2022-12-29 2022-12-29 Sentence grammar error correction method and system based on syntactic analysis

Publications (2)

Publication Number Publication Date
CN115935957A CN115935957A (en) 2023-04-07
CN115935957B true CN115935957B (en) 2023-10-13

Family

ID=86555785

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211701494.7A Active CN115935957B (en) 2022-12-29 2022-12-29 Sentence grammar error correction method and system based on syntactic analysis

Country Status (1)

Country Link
CN (1) CN115935957B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116227466B (en) * 2023-05-06 2023-08-18 之江实验室 Sentence generation method, device and equipment with similar semantic different expressions
CN116775497B (en) * 2023-08-17 2023-11-14 北京遥感设备研究所 Database test case generation demand description coding method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062205A (en) * 2019-12-16 2020-04-24 北京大学 Dynamic mask training method in Chinese automatic grammar error correction
CN112364631A (en) * 2020-09-21 2021-02-12 山东财经大学 Chinese grammar error detection method and system based on hierarchical multitask learning
CN112668313A (en) * 2020-12-25 2021-04-16 平安科技(深圳)有限公司 Intelligent sentence error correction method and device, computer equipment and storage medium
CN113609824A (en) * 2021-08-10 2021-11-05 上海交通大学 Multi-turn dialog rewriting method and system based on text editing and grammar error correction
CN114386403A (en) * 2022-01-07 2022-04-22 北京方寸无忧科技发展有限公司 Method and system for correcting multiple same wrong words in Chinese text
WO2022121178A1 (en) * 2020-12-11 2022-06-16 平安科技(深圳)有限公司 Training method and apparatus and recognition method and apparatus for text error correction model, and computer device
CN114970506A (en) * 2022-06-09 2022-08-30 广东外语外贸大学 Grammar error correction method and system based on multi-granularity grammar error template learning fine tuning
CN115034218A (en) * 2022-06-10 2022-09-09 哈尔滨福涛科技有限责任公司 Chinese grammar error diagnosis method based on multi-stage training and editing level voting

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102013230B1 (en) * 2012-10-31 2019-08-23 십일번가 주식회사 Apparatus and method for syntactic parsing based on syntactic preprocessing

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062205A (en) * 2019-12-16 2020-04-24 北京大学 Dynamic mask training method in Chinese automatic grammar error correction
CN112364631A (en) * 2020-09-21 2021-02-12 山东财经大学 Chinese grammar error detection method and system based on hierarchical multitask learning
WO2022121178A1 (en) * 2020-12-11 2022-06-16 平安科技(深圳)有限公司 Training method and apparatus and recognition method and apparatus for text error correction model, and computer device
CN112668313A (en) * 2020-12-25 2021-04-16 平安科技(深圳)有限公司 Intelligent sentence error correction method and device, computer equipment and storage medium
CN113609824A (en) * 2021-08-10 2021-11-05 上海交通大学 Multi-turn dialog rewriting method and system based on text editing and grammar error correction
CN114386403A (en) * 2022-01-07 2022-04-22 北京方寸无忧科技发展有限公司 Method and system for correcting multiple same wrong words in Chinese text
CN114970506A (en) * 2022-06-09 2022-08-30 广东外语外贸大学 Grammar error correction method and system based on multi-granularity grammar error template learning fine tuning
CN115034218A (en) * 2022-06-10 2022-09-09 哈尔滨福涛科技有限责任公司 Chinese grammar error diagnosis method based on multi-stage training and editing level voting

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction;Shamil Chollampatt et.al;《A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction》;第1-8页 *
使用过训练提升词性标注依存句法联合模型的速度;车万翔 等;《智能计算机与应用》;第4卷(第4期);第21-24页 *
基于Soft-Masked BERT的新闻文本纠错研究;史健婷 等;《计算机技术与发展》;第32卷(第5期);第202-207页 *

Also Published As

Publication number Publication date
CN115935957A (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN115935957B (en) Sentence grammar error correction method and system based on syntactic analysis
CN111708882B (en) Transformer-based Chinese text information missing completion method
CN111563166B (en) Pre-training model method for classifying mathematical problems
CN110196913A (en) Multiple entity relationship joint abstracting method and device based on text generation formula
CN110688394B (en) NL generation SQL method for novel power supply urban rail train big data operation and maintenance
CN112559702B (en) Method for generating natural language problem in civil construction information field based on Transformer
CN113254616B (en) Intelligent question-answering system-oriented sentence vector generation method and system
CN114781377B (en) Error correction model, training and error correction method for non-aligned text
CN114648015B (en) Dependency relationship attention model-based aspect-level emotional word recognition method
CN114970503A (en) Word pronunciation and font knowledge enhancement Chinese spelling correction method based on pre-training
CN115630145A (en) Multi-granularity emotion-based conversation recommendation method and system
CN114154504A (en) Chinese named entity recognition algorithm based on multi-information enhancement
CN117094325B (en) Named entity identification method in rice pest field
CN112818698A (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN112199952A (en) Word segmentation method, multi-mode word segmentation model and system
CN116522165A (en) Public opinion text matching system and method based on twin structure
CN116187304A (en) Automatic text error correction algorithm and system based on improved BERT
CN116127978A (en) Nested named entity extraction method based on medical text
CN113642630B (en) Image description method and system based on double-path feature encoder
CN115270792A (en) Medical entity identification method and device
CN114297408A (en) Relation triple extraction method based on cascade binary labeling framework
CN113705222A (en) Slot recognition model training method and device and slot filling method and device
CN113672737A (en) Knowledge graph entity concept description generation system
JP2017182277A (en) Coding device, decoding device, discrete series conversion device, method and program
CN113553837A (en) Reading understanding model training method and device and text analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant