CN115935957B

CN115935957B - Sentence grammar error correction method and system based on syntactic analysis

Info

Publication number: CN115935957B
Application number: CN202211701494.7A
Authority: CN
Inventors: 车万翔; 孙博; 王一轩; 朱庆福; 张斯尧
Original assignee: Guangdong Nanfang Network Information Technology Co ltd
Current assignee: Guangdong Nanfang Network Information Technology Co ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-10-13
Anticipated expiration: 2042-12-29
Also published as: CN115935957A

Abstract

The invention discloses a sentence grammar error correction method and system based on syntactic analysis, the method comprises the steps of encoding a non-labeling corpus through a basic pre-training model to obtain a first sentence vector, and pre-training a dependency prediction model for the first sentence vector according to the non-labeling corpus and word dependency relationship to obtain a dependency prediction model; according to a first preset rule, masking and replacing words without marked linguistic data to obtain masking linguistic data, and performing masking language model training on the masking linguistic data to obtain a masking language model; building a syntactic reinforcement text encoder according to the text encoder, the dependency prediction model and the mask language model; sentence coding is carried out on the sentence to be corrected through a syntax enhanced text coder, and a second sentence vector is obtained; and decoding the second sentence vector through a syntax decoder to obtain a correction sentence corresponding to the sentence to be corrected. The embodiment enhances the grammar analysis perception and semantic representation capability and improves the accuracy of sentence grammar correction.

Description

Sentence grammar error correction method and system based on syntactic analysis

Technical Field

The invention relates to the field of sentence grammar error correction, in particular to a sentence grammar error correction method and system based on syntactic analysis.

Background

Grammar correction is an important component in the field of text correction, which is a Chinese sentence given a model, and then the model outputs a sentence corrected for grammar. The pre-training-based method is a common text error correction method, and mainly solves the problem of Chinese spelling errors, such as tentacle (deposit) cards and eye (mirror) snakes, which can be solved by pre-training corpus based on a large-scale pre-training model.

In the prior art, text error correction is performed through a text encoder, and the method has a plurality of defects, such as that a BERT language model obtains results on tasks of natural language processing, but a pre-training stage does not relate to pre-training tasks related to syntax, the model has poor performance on the tasks of grammar error correction, an error correction model based on the pre-training model excessively depends on data, and corpus generalization capability which cannot be covered by a training set is poor. For some grammar-related text errors, a simple pre-training model-based method has a great limitation, such as "full-time staff discusses and listens (listens and discusses) reports", the sentence has no spelling error, but the order of "listening" and "discussing" is reversed, the grammar errors are not solved by large-scale pre-training corpus, and similar examples are many, the pre-training corpus cannot cover all examples, and the model generalization capability is poor. The method for simultaneously solving spelling errors and grammar errors is model serial, namely, the text is input into the spelling error correction model in the first step to obtain the spelling error correction result, and the result is input into the grammar error correction model in the second step to obtain the grammar error correction result. If the result of the first step of spelling correction is wrong, the input of the grammar correction model is also wrong sentence, which greatly influences the performance of the grammar correction model, and the accuracy of grammar correction is low.

Disclosure of Invention

The invention provides a sentence grammar error correction method and a sentence grammar error correction system based on syntactic analysis, which are used for enhancing the grammar analysis perception capability and the semantic representation capability, reducing the progressive transmission of errors caused by model serial and improving the accuracy of sentence grammar error correction.

In order to solve the above technical problems, an embodiment of the present invention provides a sentence grammar error correction method based on syntactic analysis, including:

encoding the non-labeling corpus through a basic pre-training model to obtain a first sentence vector, and pre-training the dependency prediction model according to the non-labeling corpus and the word dependency relationship to obtain a dependency prediction model;

according to a first preset rule, masking and replacing words without marked linguistic data to obtain masking linguistic data, and performing masking language model training on the masking linguistic data to obtain a masking language model;

building a syntactic reinforcement text encoder according to the text encoder, the dependency prediction model and the mask language model;

sentence coding is carried out on the sentence to be corrected through a syntax enhanced text coder, and a second sentence vector is obtained;

and decoding the second sentence vector through a syntax decoder to obtain a correction sentence corresponding to the sentence to be corrected.

According to the embodiment of the invention, the unlabeled corpus is encoded through the basic pre-training model to obtain the first sentence vector, and the first sentence vector is pre-trained by the dependency prediction model according to the unlabeled corpus and the word dependency relationship to obtain the dependency prediction model; according to a first preset rule, masking and replacing words without marked linguistic data to obtain masking linguistic data, and performing masking language model training on the masking linguistic data to obtain a masking language model; building a syntactic reinforcement text encoder according to the text encoder, the dependency prediction model and the mask language model; sentence coding is carried out on the sentences to be corrected through a syntactic reinforcement text encoder, so that sentence vectors are obtained; and decoding the sentence vectors through a syntax decoder to obtain corrected sentences, and completing sentence syntax error correction. The pre-training of the dependency prediction model effectively enhances the grammar analysis perception capability and the semantic representation capability, and the word vectors coded by the model contain more abundant context information through the mask language model. According to the text encoder, the dependency prediction model and the mask language model, a syntax reinforced text encoder is established, excessive dependency data is avoided, corpus generalization capability which cannot be covered by a training set is improved, generalization capability is stronger under the same data and model parameters, grammar correction and correction can be carried out on the basis of correcting spelling errors, two types of text correction can be solved by a single model, error step-by-step transmission caused by model serial is reduced, and sentence grammar correction accuracy is improved.

As a preferred scheme, the sentence to be corrected is sentence-encoded by a syntax-enhanced text encoder, so as to obtain a second sentence vector, specifically:

marking a grammar error correction data set of the field where the sentence to be corrected is located according to a data format of a preset triplet to obtain a field data set;

according to the field data set, adjusting a syntactic reinforcement text encoder to obtain a field text encoder;

and carrying out sentence coding on the sentence to be corrected by using a field text coder to obtain a second sentence vector.

According to the embodiment of the invention, the model is trained in advance and then is finely adjusted on the field data set according to the field knowledge and the labeled field data set, the syntax enhanced text encoder is finely adjusted, the sentence grammar correction effect in different fields is improved, and the sentence accurate correction is realized.

As a preferred scheme, a non-labeling corpus is encoded through a basic pre-training model to obtain a first sentence vector, and the first sentence vector is pre-trained by a dependency prediction model according to the non-labeling corpus and the word dependency relationship to obtain the dependency prediction model, specifically:

inputting the unlabeled corpus into a syntactic analyzer to obtain triples of word dependency relationship of each sentence; wherein the triplet includes a parent node vector, a child node vector, and a dependency label;

Encoding the non-labeling corpus through a basic pre-training model to obtain a first sentence vector;

selecting word pairs according to a preset selection rule and a triplet of word dependency relationship of each sentence, and obtaining word pair vectors according to first sentence vectors corresponding to the word pairs; wherein the term pair vector comprises a first term vector and a second term vector;

and (3) pre-training the word pair vector by a dependency prediction model according to the dependency label, calculating a loss value through a cross entropy loss function, and updating the model parameters by gradient descent according to the loss value to update the parameters of the dependency prediction model so as to obtain the dependency prediction model.

As a preferable scheme, the word pair vector is pre-trained by a dependency prediction model according to the dependency label, a loss value is calculated through a cross entropy loss function, gradient descent update model parameters are carried out according to the loss value, and the parameters of the dependency prediction model are updated, so that the dependency prediction model is obtained, specifically:

inputting the first word vector into a pooling layer, outputting a third word vector, inputting the second word vector into the pooling layer, and outputting a fourth word vector; inputting the third word vector and the fourth word vector into a classifier according to the dependency labels, predicting the dependency relationship of the word pairs, and calculating a loss value through a cross entropy loss function according to the dependency relationship of the dependency labels and the word pairs;

And determining the gradient of the current cross entropy loss function according to the loss value, carrying out gradient descent and updating model parameters, and updating the parameters of the dependency prediction model to obtain the dependency prediction model.

As a preferred scheme, according to a first preset rule, masking and replacing words without marked linguistic data to obtain masked linguistic data, and training the masked linguistic data to obtain a masked linguistic model, specifically:

selecting 15% of words in sentences without corpus to mask according to a first preset rule to obtain mask words, and dividing the mask words into three groups according to a preset proportion to obtain a first group of mask words, a second group of mask words and a third group of mask words;

replacing the first group of mask words with mask marks, replacing the second group of mask words without replacing, replacing the third group of mask words with a random word, and obtaining mask corpus;

and carrying out mask language model training on the mask corpus to obtain a mask language model.

As a preferred scheme, a syntactic reinforcement text encoder is built according to the text encoder, the dependency prediction model and the mask language model, specifically:

according to the output vector of the word vector in the linear layer and the activation function, a first attention layer is established, and the formula is as follows:

Wherein Attention is the first Attention layer, Q is the output vector of the word vector at the first linear layer, K is the output vector of the word vector at the second linear layer, V is the output vector of the word vector at the third linear layer, softmax is the activation function, d _k Is the word vector hidden layer dimension;

the feed-forward neural network layer is established through the fully connected neural network and the attention layer of the two layers, and the formula is as follows:

max(0,XW ₁ +b ₁ )W ₂ +b ₂

wherein W is ₁ For the first weight matrix, W ₂ For the second weight matrix, b ₁ Is a first bias term, b ₂ Is a second bias term, X is the first attention layer output;

establishing a text encoder according to the first attention layer and the feedforward neural network layer;

a syntactically enhanced text encoder is constructed from the text encoder, the dependency prediction model, and the mask language model.

As a preferred scheme, decoding the second sentence vector by a syntax decoder to obtain a corrected sentence corresponding to the sentence to be corrected, specifically:

establishing a second attention layer through masking operation;

establishing a third attention layer according to the output vector of the linear layer and the activation function;

establishing a softmax layer according to the probability of predicting the output word;

establishing a syntax decoder according to the second attention layer, the third attention layer and the softmax layer;

In order to solve the same technical problem, the embodiment of the invention also provides a sentence grammar error correction system based on syntactic analysis, which comprises: a dependency prediction module, a mask language module, a syntax reinforcement module, an encoding module and a decoding module;

the dependency prediction module is used for encoding the unlabeled corpus through the basic pre-training model to obtain a first sentence vector, and pre-training the dependency prediction model according to the unlabeled corpus and the word dependency relationship to obtain a dependency prediction model;

the mask language module is used for carrying out mask replacement on words without marked linguistic data according to a first preset rule to obtain mask linguistic data, and carrying out mask language model training on the mask linguistic data to obtain a mask language model;

the syntax reinforcement module is used for building a syntax reinforcement text encoder according to the text encoder, the dependency prediction model and the mask language model;

the coding module is used for carrying out sentence coding on the sentence to be corrected through the syntactic reinforcement text coder to obtain a second sentence vector;

the decoding module is used for decoding the second sentence vector through the syntax decoder to obtain a correction sentence corresponding to the sentence to be corrected.

Preferably, the encoding module includes: the sentence coding system comprises a labeling unit, an adjusting unit and a sentence coding unit;

the labeling unit is used for labeling the grammar error correction data set of the field where the sentence to be corrected is located according to the data format of the preset triples to obtain a field data set;

the adjusting unit is used for adjusting the syntactic reinforcement text encoder according to the field data set to obtain a field text encoder;

the sentence coding unit is used for sentence coding the sentence to be corrected through the field text coder to obtain a second sentence vector.

Preferably, the dependency prediction module includes: the system comprises a syntax analysis unit, a basic coding unit, a word pair unit and a training unit;

the syntactic analysis unit is used for inputting the unlabeled corpus into the syntactic analyzer to obtain a triplet of word dependency relationship of each sentence; wherein the triplet includes a parent node vector, a child node vector, and a dependency label;

the basic coding unit is used for coding the non-labeling corpus through a basic pre-training model to obtain a first sentence vector;

the word pair unit is used for selecting word pairs according to a preset selection rule and a triplet of word dependency relation of each sentence, and obtaining word pair vectors according to first sentence vectors corresponding to the word pairs; wherein the term pair vector comprises a first term vector and a second term vector;

The training unit is used for pre-training the word pair vector dependency prediction model according to the dependency label, calculating a loss value through a cross entropy loss function, and updating model parameters by gradient descent according to the loss value to update the parameters of the dependency prediction model so as to obtain the dependency prediction model.

Drawings

Fig. 1: the invention provides a flow diagram of an embodiment of a sentence grammar error correction method based on syntactic analysis;

fig. 2: the invention provides an output schematic diagram of a dependency syntax analyzer of one embodiment of a sentence syntax error correction method based on syntax analysis;

fig. 3: the invention provides a syntax reinforced text encoder structure diagram of an embodiment of a sentence syntax error correction method based on syntax analysis;

fig. 4: the sentence correction flow chart of one embodiment of the sentence grammar correction method based on the syntactic analysis is provided by the invention;

fig. 5: the invention provides a connection schematic diagram of another embodiment of a sentence grammar error correction system based on syntactic analysis.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1, a schematic flow chart of a sentence grammar error correction method based on syntactic analysis is provided in an embodiment of the present invention. The sentence grammar error correction method is suitable for correcting text sentence grammar, and sentence coding is carried out by a grammar-enhanced text encoder, so that grammar analysis perception capability and semantic representation capability are enhanced, error progressive transmission caused by model serial is reduced, and sentence grammar error correction accuracy is improved. The sentence grammar error correction method comprises steps 101 to 105, wherein the steps are as follows:

step 101: encoding the non-labeling corpus through the basic pre-training model to obtain a first sentence vector, and pre-training the dependency prediction model according to the non-labeling corpus and the word dependency relationship to obtain the dependency prediction model.

Optionally, step 101 specifically includes steps 1011 to 1014, each of which specifically includes:

step 1011: inputting the unlabeled corpus into a syntactic analyzer to obtain triples of word dependency relationship of each sentence; wherein the triplet includes a parent node vector, a child node vector, and a dependency label.

In order to make the encoded sentence more implied The method is characterized by comprising the steps of enriching syntactic knowledge, enabling a dependency prediction model to learn syntactic knowledge in a pre-training stage, enhancing the awareness of the model on syntax, and firstly, pre-training large-scale unlabeled data, wherein a pre-training task comprises pre-training of the dependency prediction model and pre-training of a mask language model, the purpose of the dependency prediction model is to enable the model to learn additional syntactic knowledge, and the purpose of the mask language model is to enable word vectors coded by the model to contain more abundant context information. The syntactic knowledge is derived from the results of a dependency parser (syntactic analyzer), the output of which is illustrative, as shown in FIG. 2, which can recognize two words (t ₁ And t ₂ ) Dependency relationship (tag) between them to form a triplet (t) ₁ ,t ₂ Tag). The goal of our model is to obtain (t ₁ ,t ₂ Tag) such dependency knowledge.

In the process of establishing a syntactic reinforced text encoder, in the embodiment, firstly, a dependency prediction model is obtained through pre-training, a label of pre-training data is obtained, and a large-scale non-labeling corpus { x } ₀ ,x ₁ ,x ₂ …x _n Input into a syntax analyzer to obtain a dependency syntax output { s } for each sentence ₀ ,s ₁ ,s ₂ …s _n (s is therein _i Is a series of triples { a } with word dependency relationship ₀ ,a ₁ ,a ₂ …a _m },a _j ＝(t ₁ ,t ₂ Tag), the meaning in the triplet is: t is t ₁ Is t ₂ ，t ₂ Is the parent node vector, t ₁ Is a child node vector, and the dependency label is tag;

step 1012: encoding the non-labeling corpus through a basic pre-training model to obtain a first sentence vector;

in this embodiment, the input of the text encoder's dependency prediction model pre-training process is a non-labeled corpus, taking a sentence containing n words as an example, in a syntactically enhanced pre-trained text encoder, the original sentence { w ] is encoded by an existing pre-training model (e.g., BERT, roBERTa) ₁ ,w ₂ …w _n Through the existing pre-training mouldAfter coding, a sentence vector { t } of a sentence is obtained ₁ ,t ₂ …t _n }。

Step 1013: selecting word pairs according to a preset selection rule and a triplet of word dependency relationship of each sentence, and obtaining word pair vectors according to first sentence vectors corresponding to the word pairs; wherein the term pair vector comprises a first term vector and a second term vector;

in this embodiment, some word pair vectors { t } are selected by a predetermined selection rule (e.g., random, ordered) based on the dependency triples obtained in step 1011 _i ,t _j First word vector is { t } _i Second word vector is { t } _j }。

Step 1014: and (3) pre-training the word pair vector by a dependency prediction model according to the dependency label, calculating a loss value through a cross entropy loss function, and updating the model parameters by gradient descent according to the loss value to update the parameters of the dependency prediction model so as to obtain the dependency prediction model.

Optionally, step 1014 specifically includes: according to the dependency label, the word pair vector is pre-trained by a dependency prediction model, a loss value is calculated through a cross entropy loss function, gradient descent update model parameters are carried out according to the loss value, and the parameters of the dependency prediction model are updated, so that the dependency prediction model is obtained, specifically:

inputting the first word vector into a pooling layer, outputting a third word vector, inputting the second word vector into the pooling layer, and outputting a fourth word vector; inputting the third word vector and the fourth word vector into a classifier according to the dependency labels, predicting the dependency relationship of the word pairs, and calculating a loss value through a cross entropy loss function according to the dependency relationship of the dependency labels and the word pairs; and determining the gradient of the current cross entropy loss function according to the loss value, carrying out gradient descent and updating model parameters, and updating the parameters of the dependency prediction model to obtain the dependency prediction model.

In this embodiment, the text encoders for the two words are input to the pooling layer, and finally the output of the pooling layer is used as the input of the classifier by letting the model predict the dependency of the word pairsThe syntactic perceptibility of the model is enhanced. Word pair vector for selection: word 1 (first word vector): t is t _i …t _i+m Word 2 (second word vector): t is t _j …t _j+k Respectively inputting the two word vectors into a pooling layer to obtain word vectors corresponding to the two words, and a third word vector t ₁ ＝pooling(t _i ,…,t _i+m ) Fourth word vector t ₂ ＝pooling(t _i ,…,t _i+m ). Obtaining a term vector (t) ₁ ,t ₂ ) After representation, the resulting term vector is input into a classifier classifer (t ₁ ,t ₂ ) Predicting the dependency relationship, the label is derived from step 1011, and the loss value loss is calculated by using the cross entropy loss function, and the formula is as follows:

where L is the loss value loss, N is the number of samples, M is the number of categories, y _ic Is a sign function (0 or 1), if the true class of sample i is equal to c, 1 is taken, otherwise 0, p is taken _ic Is the predicted probability that observation sample i belongs to category c.

In this embodiment, the loss value loss is calculated using the output of the classifier and the label in step 1011, and then the gradient descent update model parameters are performed. The gradient descent process is as follows:

a. determining a gradient of a loss function for a current position for θ _i The expression of its gradient:

b. multiplying the gradient of the loss function by the step length to obtain the distance of the current position drop, i.e

c. Determining whether all theta _i The gradient descent distance is smaller than E, if the gradient descent distance is smaller than E, the algorithm is terminated, and all the gradient descent distances are currently smaller than E θ of (2) _i (i 0,1, …, n) is the final result, otherwise go to step d

d. Updating all theta for theta _i The updated expression is as follows, and the process continues to step a after the update is finished

Step 102: and according to a first preset rule, carrying out mask replacement on words without marked linguistic data to obtain mask linguistic data, and carrying out mask language model training on the mask linguistic data to obtain a mask language model.

Optionally, step 102 specifically includes: selecting 15% of words in sentences without corpus to mask according to a first preset rule to obtain mask words, and dividing the mask words into three groups according to a preset proportion to obtain a first group of mask words, a second group of mask words and a third group of mask words; replacing the first group of mask words with mask marks, replacing the second group of mask words without replacing, replacing the third group of mask words with a random word, and obtaining mask corpus; and carrying out mask language model training on the mask corpus to obtain a mask language model.

In this embodiment, to enhance the context representation capability of the masking language model, the masking language model trains, randomly selects 15% of the words in the sentence to Mask, selects the words as Mask, i.e., mask words, divides the Mask words into three groups according to a preset ratio (8:1:1), replaces 80% of the Mask words with true [ Mask ], replaces 10% of the Mask words with Mask marks, does not replace the second Mask words, and replaces 10% of the Mask words with one random word (replaces the third Mask words with one random word). The mask language is trained to obtain a mask language model, the mask language model is trained on the basis of a non-labeling corpus in a training process similar to a dependency prediction model training method, the mask language model is trained on the basis of a mask language, loss values loss of the mask language model are classified tasks of multiple categories, a cross entropy loss function in step 1014 is applied to calculate loss values loss, and the sum of the loss values loss of the dependency prediction model and the mask language model training task is used as the total loss of output.

Step 103: a syntactically enhanced text encoder is built from the text encoder, the dependency prediction model, and the mask language model.

Optionally, step 103 specifically includes: according to the output vector of the word vector in the linear layer and the activation function, a first attention layer is established, and the formula is as follows:

max(0,XW ₁ +b ₁ )W ₂ +b ₂

In this embodiment, as shown in fig. 3, in the application of syntax correction of an actual sentence, a trained text encoder (syntax-enhanced text encoder) is used to encode a sentence, and the syntax-enhanced text encoder structure includes an attention mechanism (first attention layer) and a feedforward neural network layer, and the output of the attention mechanism (first attention layer) is:

Wherein Q, K, V are the output vectors of the word vectors at three linear layers, respectively, softmax is the activation function, d _k Is the word vector hidden layer dimension. The feedforward neural network layer is simpler and is a full connection of two layers, the activation function of the first layer is ReLU, and the second layer is not applicable to the activation function, and the formula is as follows:

max(0,XW ₁ +b ₁ )W ₂ +b ₂

wherein W is ₁ 、W ₂ Is a weight matrix, b ₁ 、b ₂ Is a bias term and X is the output of the attention layer (first attention layer).

Step 104: and sentence coding is carried out on the sentence to be corrected through a syntactic reinforcement text encoder, and a second sentence vector is obtained.

Optionally, step 104 specifically includes: marking a grammar error correction data set of the field where the sentence to be corrected is located according to a data format of a preset triplet to obtain a field data set; according to the field data set, adjusting a syntactic reinforcement text encoder to obtain a field text encoder; and carrying out sentence coding on the sentence to be corrected by using a field text coder to obtain a second sentence vector.

In this embodiment, as shown in fig. 4, in order to enable the encoded sentence to contain more abundant syntactic knowledge, step a is specifically implemented by steps 101 to 103, and the pre-training task includes a dependency prediction model and a mask language model, where the dependency prediction model aims at enabling the model to learn additional syntactic knowledge, and the mask language model aims at enabling the word vector encoded by the model to contain more abundant contextual information. In step b, in order to adapt to the task of grammar error correction, marking the data set of grammar error correction in the field to obtain the field data set, wherein the format of each data is a triplet: (sequence number, sentence containing grammar error, corrected sentence), for example: (1, the whole plant staff member listens to and listens to the report, and the whole plant staff member listens to and listens to the report). The domain data set is used for model fine tuning. And if the field data set is not acquired, sentence encoding can be carried out on the sentence to be corrected through the field text encoder to obtain a corresponding sentence vector, decoding is continued, the corrected sentence is obtained, and sentence grammar correction is completed. And c, coding the sentence to be corrected by using a pre-trained field text coder to obtain a second sentence vector. Step d, decoding by using the second sentence vector coded in step c, and step e finally outputting the corrected sentence.

Step 105: and decoding the second sentence vector through a syntax decoder to obtain a correction sentence corresponding to the sentence to be corrected.

Optionally, step 105 specifically includes: establishing a second attention layer through masking operation;

In this embodiment, the corrected sentence is output according to the decoding of the encoded second sentence vector by the decoder, the decoder structure includes two attention layers, the first attention layer (second attention layer) uses the masking operation, the second attention layer (third attention layer) is similar to the first attention layer in step 104, and the last softmax layer calculates the probability of the next translated word. The decoding process can adopt a greedy search or cluster search strategy, and the specific decoding process is as follows: the decoder inputs a starting token, then passes through the self-attribute layer, then passes through the encoder-decoder attention layer, and then outputs through the forward layer, thereby obtaining the final representation of the token, and then predicts which word in the dictionary is output through a linear layer plus a softmax layer. The decoder decodes only one word at each step, outputs the word, puts the word into the input of the decoder, and repeats the above operation until the word is decoded to < eos >. And when the decoding is finished, the sentence output is finished, and finally, the corrected sentence corresponding to the sentence to be corrected is output.

As an example of the present embodiment, a sentence to be corrected: for your care of me, a lot is paid. Text is encoded by a syntax-enhanced text encoder (syntax-enhanced text encoder) through the dependency syntax, and then decoded by a bundle search strategy, outputting corrected sentences, and the corrected sentences: your effort to take care of me is very much.

Example two

Accordingly, referring to fig. 5, fig. 5 is a schematic connection diagram of a second embodiment of a sentence grammar error correction system based on syntactic analysis according to the present invention. As shown in fig. 5, the syntax analysis based sentence syntax error correction system includes a dependency prediction module 501, a mask language module 502, a syntax reinforcement module 503, an encoding module 504, and a decoding module 505;

the dependency prediction module 501 is configured to encode the unlabeled corpus through a basic pre-training model to obtain a first sentence vector, and pre-train the first sentence vector into a dependency prediction model according to the unlabeled corpus and the word dependency relationship to obtain the dependency prediction model.

The dependency prediction module 501 includes: a syntax analysis unit 5011, a base encoding unit 5012, a word pair unit 5013, and a training unit 5014.

The syntactic analysis unit 5011 is used for inputting the unlabeled corpus into a syntactic analyzer to obtain triples of word dependency relationship of each sentence; wherein the triplet includes a parent node vector, a child node vector, and a dependency label.

The basic encoding unit 5012 is configured to encode the non-labeling corpus through a basic pre-training model, to obtain a first sentence vector.

The word pair unit 5013 is configured to select a word pair according to a preset selection rule and a triplet of word dependency relationships of each sentence, and obtain a word pair vector according to a first sentence vector corresponding to the word pair; wherein the term pair vector includes a first term vector and a second term vector.

The training unit 5014 is configured to pre-train the word pair vector by using a dependency prediction model according to the dependency label, calculate a loss value by using a cross entropy loss function, update model parameters by performing gradient descent according to the loss value, update parameters of the dependency prediction model, and obtain the dependency prediction model.

The mask language module 502 is configured to perform mask replacement on words with non-labeling linguistic data according to a first preset rule, obtain mask linguistic data, and perform mask language model training on the mask linguistic data, so as to obtain a mask language model.

The syntax reinforcement module 503 is configured to build a syntax reinforced text encoder based on the text encoder, the dependency prediction model, and the mask language model.

The encoding module 504 is configured to perform sentence encoding on the sentence to be corrected by using a syntactically enhanced text encoder, so as to obtain a second sentence vector.

The encoding module 504 includes: a labeling unit 5041, an adjusting unit 5042, and a sentence encoding unit 5043.

The labeling unit 5041 is configured to label, according to a data format of a preset triplet, a grammar error correction data set of a domain where a sentence to be corrected is located, and obtain a domain data set.

The adjusting unit 5042 is configured to adjust the syntax enhanced text encoder according to the domain data set, to obtain a domain text encoder.

The sentence coding unit 5043 is configured to perform sentence coding on the sentence to be corrected by using the field text encoder, so as to obtain a second sentence vector.

The decoding module 505 is configured to decode the second sentence vector through a syntax decoder to obtain a correction sentence corresponding to the sentence to be corrected.

By implementing the embodiment of the invention, the existing error correction model based on the pre-training model excessively depends on data, and has poor corpus generalization capability which cannot be covered by the training set. According to the grammar correction of the grammar analysis reinforcement and training, a grammar reinforced text encoder is established for sentence grammar correction, the grammar perception capability and the semantic representation capability of a model can be effectively enhanced, the generalization capability is stronger under the same data and model parameters, the grammar correction problem can be solved on the basis of correcting spelling errors, and the phenomenon of error accumulation in a pipeline system is greatly reduced. The existing method for solving spelling errors and grammar errors is model serial, namely, a text is input into a spelling error correction model in the first step to obtain a spelling error correction result, and the result is input into a grammar error correction model in the second step to obtain a grammar error correction result. If the result of the first step of spelling correction is erroneous, then the input of the grammar correction model is also an erroneous sentence, which greatly affects the performance of the grammar correction model. Therefore, the syntax enhanced text encoder model can solve two types of text error correction, and greatly reduces the phenomenon of error progressive transmission caused by model serial.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention, and are not to be construed as limiting the scope of the invention. It should be noted that any modifications, equivalent substitutions, improvements, etc. made by those skilled in the art without departing from the spirit and principles of the present invention are intended to be included in the scope of the present invention.

Claims

1. A syntax analysis-based sentence grammar error correction method, comprising:

encoding the non-labeling corpus through a basic pre-training model to obtain a first sentence vector, and pre-training a dependency prediction model according to the non-labeling corpus and the word dependency relationship to obtain a dependency prediction model;

encoding the unlabeled corpus through a basic pre-training model to obtain a first sentence vector, and pre-training the first sentence vector into a dependency prediction model according to the unlabeled corpus and the word dependency relationship to obtain the dependency prediction model, wherein the method specifically comprises the following steps of:

inputting the unlabeled corpus into a syntactic analyzer to obtain triples of the word dependency relationship of each sentence; wherein the triplet includes a parent node vector, a child node vector, and a dependency label;

Encoding the non-labeling corpus through the basic pre-training model to obtain the first sentence vector;

selecting word pairs according to a preset selection rule and a triplet of word dependency relationship of each sentence, and obtaining word pair vectors according to the first sentence vectors corresponding to the word pairs; wherein the term pair vector comprises a first term vector and a second term vector;

performing dependency prediction model pre-training on the word pair vector according to the dependency label, calculating a loss value through a cross entropy loss function, performing gradient descent update model parameters according to the loss value, and updating the parameters of the dependency prediction model to obtain the dependency prediction model;

according to a first preset rule, masking and replacing words of the unmarked corpus to obtain masking corpus, and performing masking language model training on the masking corpus to obtain a masking language model;

establishing a syntactic reinforcement text encoder according to the text encoder, the dependency prediction model and the mask language model;

the method comprises the steps of establishing a syntactic reinforcement text encoder according to a text encoder, the dependency prediction model and the mask language model, wherein the syntactic reinforcement text encoder comprises the following specific steps:

wherein, attention is the first Attention layer, Q is the output vector of word vector at the first linear layer, K is the output vector of word vector at the second linear layer, V is the output vector of word vector at the third linear layer, softmax is the activation function, d _k Is the word vector hidden layer dimension;

and establishing a feedforward neural network layer through the two-layer fully-connected neural network and the attention layer, wherein the formula is as follows:

max(0,XW ₁ +b ₁ )W ₂ +b ₂

establishing the text encoder according to the first attention layer and the feedforward neural network layer;

constructing the syntactically enhanced text encoder from the text encoder, the dependency prediction model, and the mask language model;

sentence coding is carried out on the sentence to be corrected through the syntax enhanced text coder, and a second sentence vector is obtained;

2. The syntax analysis based sentence grammar correction method according to claim 1, wherein said sentence to be corrected is sentence-encoded by said syntax enhanced text encoder to obtain a second sentence vector, specifically:

marking a grammar error correction data set of the field where the sentence to be corrected is located according to a data format of a preset triplet, and obtaining a field data set;

according to the field data set, adjusting the syntax enhanced text encoder to obtain a field text encoder;

and carrying out sentence coding on the sentence to be corrected by the field text coder to obtain the second sentence vector.

3. The sentence grammar error correction method based on syntactic analysis according to claim 1, characterized in that the word pair vector is pre-trained with a dependency prediction model according to the dependency label, a loss value is calculated through a cross entropy loss function, gradient descent update model parameters are performed according to the loss value, and parameters of the dependency prediction model are updated to obtain the dependency prediction model, specifically:

inputting the first word vector into a pooling layer, outputting a third word vector, inputting the second word vector into the pooling layer, and outputting a fourth word vector; inputting the third word vector and the fourth word vector into a classifier according to the dependency label, predicting the dependency relationship of the word pair, and calculating the loss value through a cross entropy loss function according to the dependency relationship of the dependency label and the word pair;

And determining the gradient of the current cross entropy loss function according to the loss value, carrying out gradient descent update model parameters, and updating the parameters of the dependency prediction model to obtain the dependency prediction model.

4. The method for correcting grammar of sentences based on syntactic analysis according to claim 1, wherein the masking substitution is performed on the words of the unlabeled corpus according to a first preset rule to obtain a masking corpus, and the masking corpus is subjected to masking language model training to obtain a masking language model, specifically:

selecting 15% of words in the sentences without the labeling corpus to mask according to a first preset rule to obtain mask words, and dividing the mask words into three groups according to a preset proportion to obtain a first group of mask words, a second group of mask words and a third group of mask words;

replacing the first set of mask words with mask marks, replacing the second set of mask words without replacing, and replacing the third set of mask words with a random word to obtain the mask corpus;

5. The syntax analysis based sentence grammar error correction method of claim 1, wherein said decoding the second sentence vector by a syntax decoder obtains a corrected sentence corresponding to the sentence to be corrected, specifically:

Establishing a second attention layer through masking operation;

establishing the syntax decoder according to the second attention layer, the third attention layer and the softmax layer;

6. A syntax analysis based sentence grammar error correction system, comprising: a dependency prediction module, a mask language module, a syntax reinforcement module, an encoding module and a decoding module;

the dependency prediction module is used for encoding the unlabeled corpus through a basic pre-training model to obtain a first sentence vector, and pre-training the dependency prediction model according to the unlabeled corpus and the word dependency relationship to obtain a dependency prediction model;

the dependency prediction module includes: the system comprises a syntax analysis unit, a basic coding unit, a word pair unit and a training unit;

the syntactic analysis unit is used for inputting the unlabeled corpus into a syntactic analyzer to obtain triples of the word dependency relationship of each sentence; wherein the triplet includes a parent node vector, a child node vector, and a dependency label;

The basic coding unit is used for coding the non-labeling corpus through the basic pre-training model to obtain the first sentence vector;

the word pair unit is used for selecting word pairs according to a preset selection rule and a triplet of word dependency relation of each sentence, and obtaining word pair vectors according to the first sentence vectors corresponding to the word pairs; wherein the term pair vector comprises a first term vector and a second term vector;

the training unit is used for pre-training the dependency prediction model of the word pair vector according to the dependency label, calculating a loss value through a cross entropy loss function, and updating model parameters by gradient descent according to the loss value to update the parameters of the dependency prediction model so as to obtain the dependency prediction model;

the mask language module is used for carrying out mask replacement on the words without the marked linguistic data according to a first preset rule to obtain mask linguistic data, and carrying out mask language model training on the mask linguistic data to obtain a mask language model;

the syntax reinforcement module is used for establishing a syntax reinforcement text encoder according to the text encoder, the dependency prediction model and the mask language model;

max(0,XW ₁ +b ₁ )W ₂ +b ₂

the encoding module is used for sentence encoding of the sentence to be corrected through the syntax enhanced text encoder to obtain a second sentence vector;

The decoding module is used for decoding the second sentence vector through a syntax decoder to obtain a correction sentence corresponding to the sentence to be corrected.

7. The syntax analysis based sentence syntax error correction system according to claim 6, wherein said encoding module comprises: the sentence coding system comprises a labeling unit, an adjusting unit and a sentence coding unit;

the labeling unit is used for labeling the grammar error correction data set of the field where the sentence to be corrected is located according to the data format of the preset triplet to obtain a field data set;

the adjusting unit is used for adjusting the syntax enhanced text encoder according to the field data set to obtain a field text encoder;

the sentence coding unit is used for carrying out sentence coding on the sentence to be corrected through the field text coder to obtain the second sentence vector.