CN117151084B

CN117151084B - Chinese spelling and grammar error correction method, storage medium and equipment

Info

Publication number: CN117151084B
Application number: CN202311425616.9A
Authority: CN
Inventors: 宋耀; 魏传强; 司君波; 李喆; 刘鹏
Original assignee: Shandong Qilu Yidian Media Co ltd
Current assignee: Shandong Qilu Yidian Media Co ltd
Priority date: 2023-10-31
Filing date: 2023-10-31
Publication date: 2024-02-23
Anticipated expiration: 2043-10-31
Also published as: CN117151084A

Abstract

The invention belongs to the technical field of language processing, and particularly relates to a Chinese spelling and grammar error correction method, a storage medium and equipment, which can detect and correct spelling errors and grammar errors in an input text. On the basis that the original RoBERTa model can only process spelling error correction tasks, the text spelling errors and text grammar errors are corrected at the same time by improving the addition generator, so that the error correction efficiency is obviously improved.

Description

Chinese spelling and grammar error correction method, storage medium and equipment

Technical Field

The invention belongs to language processing technology, and particularly relates to a Chinese spelling and grammar error correction method, a storage medium and equipment.

Background

In conventional natural language processing, spelling correction and grammar correction are typically two tasks that are handled separately. Spelling error correction is focused on detecting and correcting spelling errors, while grammar error correction aims at repairing grammar errors. However, this method of separate processing may result in loss of information and accumulation of errors. The native RoBERTa model can only handle text spelling error correction, and cannot detect grammar errors at the same time.

Disclosure of Invention

In order to simultaneously process spelling and grammar error correction, the application provides a new method based on RoBERTa, and unified detection and correction of text spelling errors and grammar errors are realized. By entering the text containing the errors into a new model, more comprehensive context information can be obtained and compared to the correct text to discover and correct spelling and grammar errors simultaneously. The comprehensive method can better solve the complex error condition in the text, and has higher accuracy and robustness. The technical proposal is that,

a Chinese spelling and grammar error correction method comprises the following steps:

s1, using a Roberta encoder model, inputting a sequence X (X ₁ ,x ₂ ,..,x _n ) Coding to obtain an output sequence H (H ₁ ,h ₂ ,..,h _n ) Wherein x is _n Token, h, being the nth position of input sequence X _n A token at the nth position of the output sequence H;

s2, adding a CNN convolution layer after the RoBERTa encoder model outputs a sequence H, and extracting a local feature C output by the encoder through a convolution kernel to obtain a local feature tensor; fusing the local feature tensor and the encoder output sequence H through residual connection to obtain a fused semantic representation sequence H';

s3, carrying out maximum pooling operation on the fused semantic representation sequence H' to obtain a representation vector V with a fixed length;

s4, transmitting the representation vector V into a full connection layer to obtain the prediction distribution of the length of the target sequence;

s5, inputting the output of the encoder and the target word into a decoder module, and enabling the decoder to simultaneously correct spelling errors and repair grammar errors by combining an attention mechanism and a pointer network.

Preferably, in step S2, the local feature C of the output is extracted, and the specific formula is as follows:

C＝Conv1D(H)；

wherein Conv1D is a 1-dimensional convolution function.

Preferably, in step S2, the extracted and output local feature tensor and the output sequence H of the RoBERTa model are combined and fused through residual connection to obtain a fused semantic representation sequence H ' = (H ' ' ₁ ,h′ ₂ ,..,h′ _n ) And taking the fused semantic representation sequence H' as the input of the step S3.

Preferably, in step S4, the target sequence length prediction is performed on the mapped expression vector V through the full-link layer to obtain the prediction distribution p _len ，

p _len ＝WV+b；

Where W is the full connection layer weight and b is the bias term.

Preferably, in step S5,

s51, calculating attention weight a _t

Wherein h is _i Output for the ith position of the encoder, b _attn As a parameter of the bias it is possible,as a learnable weight matrix for mapping context information in the attention mechanism to the appropriate dimension, W _h And W is _s For a weight matrix that can be learned for h _i And the decoder state s of the last time step _t-1 Mapping to the appropriate dimension;

s52, based on a _t Generating context vector c _t

Wherein a is _ti Represents the attention weight, H, to the ith position of the input sequence at time t _i Representing Roberta

Semantic features of the i-th location;

s53, c _t As input, the decoder state s is updated _t

s _t ＝RNN([s _t-1 ,c _t ])

S54, through s _t And c _t Calculating probability distribution p of generated words _vocab Generating a duplication probability distribution P in an input sequence based on H _copy

p _vocab ＝softmax(Linear([c _t ,s _t ]))

P _copy ＝sigmoid(Linear(H))

S55, generating final distribution p by using pointer mechanism

p＝p _copy *a _t +(1-p _copy )*p _vocab 。

Preferably, in step S5, a Loss function Loss is calculated, where the Loss function includes a generation Loss, a pointer network Loss, and a length prediction Loss, and the calculation process is as follows:

calculating generation loss: the decoder, based on the prediction distribution of vocab, cross entropy loss with the target word,

loss1＝CrossEntropyLoss(p,y)；

wherein y is a target word;

calculating pointer network loss: the pointer network is utilized to directly copy the loss of the original words, calculate the attention distribution and the cross entropy of the target words,

loss2＝CrossEntropyLoss(p _copy ,y)；

calculating a length prediction loss: a loss between the predicted length and the target length,

loss3＝L1Loss(p _len ,l _tag )

wherein p is _len To predict length, l _tag Is the target length;

synthesizing a Loss function Loss:

Loss＝w ₁ loss1+w ₂ loss2+w ₃ loss3

wherein w is ₁ ,w ₂ ,w ₃ All are loss weights and are trained by the model.

A computer readable storage medium containing program instructions stored thereon that when executed perform a RoBERTa-based spelling, grammar error correction method.

An electronic device comprising a memory and a processor, the memory storing computer readable instructions that, when executed by the processor, perform steps in a RoBERTa-based spelling, grammar error correction method.

Compared with the prior art, the beneficial effects of the application are as follows:

the invention realizes the unification of text spelling check and grammar check by adding the modules such as the convolution layer, residual error connection, pointer network and the like on the RoBERTa model, optimizes the error correction performance and has obvious technical progress.

Drawings

Fig. 1 is a flow chart of the present application.

Detailed Description

The following detailed description is exemplary and is intended to provide further explanation of the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application.

Roberta is proposed in the paper Roberta A Robustly Optimized BERT Pretraining Approach. The method belongs to the enhanced version of BERT and is also a finer tuning version of BERT model. The RoBERTa model is an improved version of BERT (A Robustly Optimized BERT, from its name, a simple rough BERT method called brute force optimization).

s1, utilizing a Roberta encoder model, and correcting Chinese sentencesAs a pair of input sequences X (X ₁ ,x ₂ ,..,x _n ) Coding to obtain an output sequence H (H ₁ ,h ₂ ,..,h _n )；

the local feature C of the output is extracted, and the specific formula is as follows:

C＝Conv1D(H)；

wherein Conv1D is a 1-dimensional convolution function.

Combining and fusing the extracted and output local feature tensor and the output sequence H of the RoBERTa model through residual connection to obtain a fused semantic representation sequence H ' = (H ') ' ₁ ,h′ ₂ ,..,h′ _n ) And taking the fused semantic representation sequence H' as the input of the step S3.

S3, carrying out maximum pooling on the input H' to obtain a fixed-length representation vector V.

predicting the length of the target sequence by using the representation vector V after pulling through the full connection layer to obtain a prediction distribution p _len ，

p _len ＝WV+b；

Where W is the full connection layer weight and b is the bias term.

S51, calculating attention weight a _t

s52, based on a _t Generating context vector c _t

Semantic features of the i-th location.

S53, c _t As input, the decoder state s is updated _t

s _t ＝RNN([s _t-1 ,c _t ])

p _vocab ＝softmax(Linear([c _t ,s _t ]))

P _copy ＝sigmoid(Linear(H))

S55, generating final distribution p by using pointer mechanism

p＝p _copy *a _t +(1-p _copy )*p _vocab 。

Calculating a Loss function Loss, wherein the Loss function comprises a generation Loss, a pointer network Loss and a length prediction Loss, and the calculation process comprises the following steps:

loss1＝CrossEntropyLoss(p,y)；

wherein y is a target word;

loss2＝CrossEntropyLoss(p _copy ,y)；

loss3＝L1Loss(p _len ,l _tag )

wherein p is _len To predict length, l _tag Is the target length;

synthesizing a Loss function Loss:

Loss＝w ₁ loss1+w ₂ loss2+w ₃ loss3

Experimental data 1 spelling error correction effect

Model	Precision	Recall	F1
				Bert	0.8107	0.6390	0.7147
RoBERTa	0.825	0.7293	0.7742
				The invention is that	0.8713	0.7634	0.8138

The experimental results show that the spelling error correction Precision, recall and F1 fractions of the method are obviously superior to those of the Bert and original RoBERTa models, and the spelling error correction effect is greatly improved.

Experimental data 2 grammar spelling and error correction effect

Model	Precision	Recall	F0.5
				Convseq2seq	0.362	0.354	0.360
T5	0.506	0.496	0.504
				The invention is that	0.576	0.567	0.574

As can be seen from experimental results, spelling and grammar error correction are greatly improved in both Precision and Recall compared with Convseq2 seq.

Experiment number 3 efficiency of execution

Model	QPS
		Bert	3
RoBERTa	3
		Convseq2seq	5
T5	94
		The invention is that	51

The T5 model is a pre-training language model based on a transducer architecture, and has the advantages of high training efficiency, strong generalization capability, adaptation to various natural language processing tasks and the like.

In the task of natural language generation, most of which are implemented based on the Seq2Seq model, conv Seq2Seq is a relatively new method based on CNN.

In terms of calculation efficiency, the invention has slower speed than the models which can only correct spelling and error by Bert and RoBERTa because of combining spelling and grammar error correction, but has obvious speed improvement compared with the grammar error model.

Claims

1. A Chinese spelling and grammar error correction method is characterized by comprising the following steps:

s1, using a Roberta encoder model, inputting a sequence X (X ₁ ，x ₂ ，..，x _n ) Coding to obtain an output sequence H (H ₁ ，h ₂ ，..，h _n ) Wherein x is _n Token, h, being the nth position of input sequence X _n A token at the nth position of the output sequence H;

s5, inputting the output of the encoder and the target word y into a decoder module, and enabling the decoder to simultaneously correct spelling errors and repair grammar errors by combining an attention mechanism and a pointer network;

S51.calculating the attention weight a _t

Wherein h is _i Output for the ith position of the encoder, b _attn As a parameter of the bias it is possible,as a learnable weight matrix for mapping context information in the attention mechanism to the appropriate dimension, W _h And Ws is a learnable weight matrix for h _i And the decoder state s of the last time step _t-1 Mapping to the appropriate dimension;

s52, based on a _t Generating context vector c _t

Wherein a is _ti Represents the attention weight, H, to the ith position of the input sequence at time t _i Semantic features representing the ith position of RoBERTa;

s53, c _t As input, the decoder state s is updated _t

s _t ＝RNN([s _t-1 ，c _t ])

p _vocab ＝softmax(Linear([c _t ，s _t ]))

P _copy ＝sigmoid(Linear(H))

S55, generating final distribution p by using pointer mechanism

p＝p _copy *a _t +(1-p _copy )*p _vocab ；

loss1＝CrossEntropyLoss(p，y)；

wherein y is a target word;

loss2＝CrossEntropyLoss(p _copy ，y)；

loss3＝L1Loss(p _len ，l _tag )

wherein p is _len To predict length, l _tag Is the target length;

synthesizing a Loss function Loss:

Loss＝w ₁ loss1+w ₂ loss2+w ₃ loss3

wherein w is ₁ ，w ₂ ，w ₃ All are loss weights and are trained by the model.

2. The method for correcting spelling and grammar of Chinese characters according to claim 1, wherein in step S2, the output local feature C is extracted by the following specific formula:

C＝Conv1D(H)；

wherein Conv1D is a 1-dimensional convolution function.

3. The method of claim 2, wherein in step S2, the extracted local feature tensor and the output sequence H of the RoBERTa model are combined and fused by residual connection to obtain a fused semantic representation sequence H ' = (H ' ' ₁ ，h′ ₂ ，..，h′ _n ) And taking the fused semantic representation sequence H' as the input of the step S3.

4. A according to claim 1A Chinese spelling and grammar error correction method is characterized in that in step S4, a target sequence length prediction is carried out on a mapped expression vector V through a full connection layer to obtain a prediction distribution p _len ，

p _len ＝WV+b；

Where W is the full connection layer weight and b is the bias term.

5. A computer readable storage medium containing program instructions stored thereon, which when executed, are adapted to perform a chinese spelling, grammar error correction method according to any one of claims 1-4.

6. An electronic device, comprising: a memory and a processor, the memory storing computer readable instructions that, when executed by the processor, perform the steps in the method of any of claims 1-4.