CN115630635B

CN115630635B - Chinese text proofreading method, system and equipment based on retrieval and multiple stages

Info

Publication number: CN115630635B
Application number: CN202211639239.4A
Authority: CN
Inventors: 曹自强; 宋思琦; 吕奇; 耿磊; 付国宏
Original assignee: Suzhou University
Current assignee: Suzhou Chada Software Research & Development Co ltd
Priority date: 2022-12-20
Filing date: 2022-12-20
Publication date: 2023-04-25
Anticipated expiration: 2042-12-20
Also published as: CN115630635A

Abstract

The embodiment of the invention provides a Chinese text proofreading method, a Chinese text proofreading system and Chinese text proofreading equipment based on retrieval and multiple stages, wherein the method comprises the steps of inputting error correction texts, searching texts which are most similar to the error correction texts in a database, and splicing the most similar texts and the error correction texts to obtain spliced texts; performing spelling correction on the spliced text; performing grammar correction from sequence to editing on the spelling corrected text to obtain a first modification result; performing confusion comparison on the first modification result and the second modification result obtained by setting a threshold range based on the grammar correction from sequence to sequence; the modification result with low confusion is taken as the final modification result. The invention can effectively improve the robustness of the system, and improve the accuracy of detecting errors and correcting errors while improving various types of text errors.

Description

Chinese text proofreading method, system and equipment based on retrieval and multiple stages

Technical Field

The invention relates to the technical field of automatic Chinese text proofreading, in particular to a Chinese text proofreading method, system and equipment based on retrieval and multiple stages.

Background

The Chinese text proofreading is to detect and correct errors occurring in Chinese text, so as to obtain correct sentences conforming to original meaning. The types of common errors are classified into four categories, redundancy, missing, misspellings, misordering, where misspellings are most likely to occur. Chinese text proofreading can effectively correct text errors, and many researchers are devoted to Chinese text proofreading.

The current common Chinese text proofreading methods include a Chinese spelling error correction method which only solves spelling errors, a sequence-to-edit-based method and a sequence-to-sequence-based method. These methods for Chinese proofing have thousands of years in terms of two important consideration indices for Chinese proofing, namely, the accuracy of error detection and error correction. The Chinese spelling error correction method has good effects on both error detection and correction, but is only aimed at spelling errors, can not effectively correct other errors, is limited by training data, has limited correction capability when migrating from a pre-training stage to a downstream task, and has poor robustness. The sequence-to-edit based approach is capable of correcting four error types, but is weak in detecting errors, does not find errors in text well and corrects them, and does not have the same ability to modify spelling error types as spelling error correction models. Although the sequence-to-sequence based approach is more capable of detecting errors, it is less capable of correcting errors and is also relatively less capable of correcting spelling errors.

Therefore, there is a need to propose a new text collation system to solve the above problems.

Disclosure of Invention

Therefore, the embodiment of the invention provides a Chinese text proofreading method, a Chinese text proofreading system and Chinese text proofreading equipment based on retrieval and multiple stages, which are used for solving the problems of low accuracy and poor robustness of error detection and error correction of the Chinese text proofreading method in the prior art.

The invention provides a Chinese text proofreading method based on retrieval and multiple stages, which comprises the following steps:

s1: inputting error correction texts, searching texts which are most similar to the error correction texts in a database, and splicing the most similar texts and the error correction texts to obtain spliced texts;

s2: performing spelling correction on the spliced text;

s3: performing grammar correction from sequence to editing on the spelling corrected text to obtain a first modification result;

s4: performing confusion comparison on the first modification result and the second modification result obtained by setting a threshold range based on the grammar correction from sequence to sequence;

s5: the modification result with low confusion is taken as the final modification result.

Preferably, the spelling correction of the spliced text comprises the following steps:

processing the error sentence pair and the correct sentence pair into a word alignment format, sending the word alignment format into a Bert encoder to obtain original sentence characteristics, and adopting a Glyce encoder with a CNN structure as a visual information encoder to obtain characteristics of text word pronunciation and fonts;

and integrating the character sound and the character shape characteristics of the text with the characteristics of the original sentence, inputting the integrated character sound and character shape characteristics of the text into a transducer encoder, and obtaining the spelling corrected sentence through a full connection layer.

Preferably, the method for correcting the spelling corrected text based on the grammar from sequence to edition to obtain the first modified result is as follows:

and carrying out grammar correction on the spelling corrected text based on the GECToR model to obtain a first modification result.

Preferably, the basic training data of the gemtor model is an error-correct sentence pair, the input is an error sentence, the input sentence is converted into a corresponding transformation tag, the iteration error correction is carried out through a BERT encoder and a sequence tag transformation, and finally the corrected sentence is output.

Preferably, the method for correcting the modification result II obtained by setting the threshold range based on the sequence-to-sequence grammar by the error correction text is as follows:

the error correction text obtains a modification result II by setting a threshold range to control error correction based on the seq2seq model.

Preferably, the basic training data of the seq2seq model is a correction sentence pair composed of an original sentence and a correct sentence, the correction sentence pair is input as a wrong sentence, and the corrected sentence is output through an encoder-decoder model.

Preferably, the error sentence needs to be processed by BPE before being input into the encoder-decoder model, and needs to be restored after being output from the encoder-decoder model.

Preferably, the first modification result and the error correction text are subjected to confusion comparison based on the second modification result obtained by setting a threshold range through the grammar correction of the sequence-to-sequence, wherein the confusion formula is expressed as follows:

wherein,,

representing sentences, & lt>

Representing sentence length,/->

Indicate->

Personal word (s)/(s)>

Indicate->

Probability of individual words.

The invention provides a Chinese text proofreading system based on retrieval and multiple stages, which comprises:

the input module is used for inputting error correction text;

the retrieval module is used for searching the text which is most similar to the error correction text, and splicing the most similar text with the error correction text to obtain a spliced text;

the spelling correction module is used for correcting the spelling of the spliced text;

the sequence-based editing module is used for carrying out grammar correction based on sequence-based editing on the spelling corrected text to obtain a modification result I;

the sequence-to-sequence-based module is used for correcting the error correction text based on the grammar from sequence to sequence by setting a threshold range to obtain a second modification result;

the confusion degree selection module is used for carrying out confusion degree comparison on the first modification result and the second modification result;

and the output module is used for outputting the modification result with low confusion degree.

The invention also provides a Chinese text proofreading device which comprises the Chinese text proofreading method based on the searching and the multiple stages, and is used for realizing Chinese text proofreading.

Compared with the prior art, the technical scheme of the invention has the following advantages:

the invention provides a Chinese text correction method, a system and equipment based on retrieval and multiple stages, wherein a retrieval module is added on the basis of a Chinese spelling error correction method to provide a certain correct modification opinion for an error fragment, so that the robustness and the accuracy of text correction are enhanced and the cost is reduced under the condition of not using additional manual labeling data; based on a grammar error correction model from sequence to sequence, the error correction is avoided by controlling the output through a threshold value, and the accuracy of text correction is enhanced; combining the spelling error correction model, the sequence-to-sequence-based grammar error correction model and the sequence-to-edit-based grammar error correction model in a strategy, and improving the accuracy of text correction on the basis of keeping the advantages of the three models in text correction; the invention provides a method for selecting a modification result by using the confusion degree of a language model, thereby improving the robustness and the accuracy of text correction.

Drawings

For a clearer description of embodiments of the invention or of solutions in the prior art, reference will be made below to the accompanying drawings, which are used in the embodiments and which are intended to illustrate, but not to limit, the invention, and from which other drawings can be obtained without inventive effort for a person skilled in the art. Wherein:

FIG. 1 is a flow chart of a search and multi-stage based Chinese text collation method in accordance with an embodiment;

FIG. 2 is a flow diagram of a spelling error correction model inference provided in accordance with an embodiment;

FIG. 3 is a flow chart of GECToR model reasoning provided in accordance with an embodiment;

FIG. 4 is a flow chart of the inference of the seq2seq model provided in accordance with an embodiment;

FIG. 5 is a block diagram of a search and multi-stage based Chinese text collation system in accordance with an embodiment.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the invention and practice it.

The database used in the embodiments of the present invention is the data set provided by Wang et al, comprising 271329 correct-incorrect sentence pairs, using the correct sentences therein as the data for the retrieval module database.

Referring to fig. 1, an embodiment of the present invention provides a search and multi-stage based chinese text collation method, comprising:

s2: performing spelling correction on the spliced text;

The invention provides a Chinese text correction method based on retrieval and multiple stages, which is characterized in that a text which is most similar to an error correction text is searched in a database, the most similar text and the error correction text are spliced, a certain correct modification opinion is provided for the error text, and the robustness and the accuracy of text correction are enhanced and the cost is reduced under the condition that additional manual labeling data is not used; the correction text corrects a second modification result obtained by setting a threshold range based on grammar, so that the situation of error correction is avoided, and the accuracy of text correction is enhanced; through the spelling error correction model, the sequence-to-sequence-based grammar error correction model and the sequence-to-edit-based grammar error correction model, the accuracy of text correction is improved on the basis of keeping the advantages of the three models in text correction; and through confusion comparison, the robustness and accuracy of text correction are improved.

Further, in step S1:

inputting error correction text, searching the text most similar to the error correction text in a database by using a BM25 search algorithm, and splicing the most similar text and the error correction text to obtain a spliced text, so that a certain correct modification opinion is provided for the error text, and the robustness and the accuracy of text correction are enhanced and the cost is reduced under the condition that no additional manual labeling data is used.

Further, in step S2:

as shown in fig. 2, the spliced text is input into a spelling error correction model, and the spelling error correction model can only be used for correcting sentences only containing spelling errors, namely, the lengths of the erroneous sentences and the correct sentences are aligned, so that the erroneous and correct sentence pairs are firstly processed into a word aligned format and then are sent into a Bert encoder to obtain the characteristics of original sentences; as most spelling errors can be summarized into two types of word sound errors and font errors, a Glyce encoder with a CNN structure is adopted as a visual information encoder, so that the characteristics of text word sound and font are obtained; finally, the character and the font characteristics and the original characteristics are integrated and input into a transducer encoder, and finally, a full-connection layer is passed through to obtain a corrected sentence.

Further, in step S3:

as shown in fig. 3, the spelling corrected text is modified by grammar correction based on the gemtor model (based on the sequence-to-edit grammar correction model). The idea of the gemtor model is to convert the grammar correction task into a sequence tag task, tag each token, these tag types are shown in table 1, and since the training data is only a wrong-correct sentence pair, the input needs to be first converted into a corresponding conversion tag, and then sent to the model. The model structure of GECToR is a BERT-like transducer model with two fully connected layers and a softmax at the top. Through the transformation of the labels, the error correction operations such as insertion, deletion, replacement and the like can be realized, and multiple rounds of iterative labeling can be performed until no new error is found, and the final result is output.

TABLE 1

Further, in step S4:

as shown in fig. 4, the correction text is corrected based on the seq2seq model (sequence-to-sequence based grammar correction model) by setting a threshold range to control correction to obtain a second modification result. For a seq2seq model for grammar correction task, the basic training data is a correction sentence pair composed of an original sentence and a correct sentence, the input is an error sentence, and the corrected sentence is directly output by using an encoder-decoder model. Before the sentence is input into the model, it needs to make BPE processing, and the output sentence is in BPE form, and needs to be restored to original form, and the method used is to delete the redundant blank space in the BPE result file, and because some of the words in the sentence are ignored and are not in the vocabulary, so that some of the modified sentences contain [ UNK ], and in this case, the modified sentences are directly equal to the original sentences without modification.

Comparing the confusion degree of the first modification result and the second modification result through the language model to obtain a final correction result, wherein the lower the confusion degree is, the more reasonable the description sentence is, and the confusion degree formula is expressed as follows:

/>

wherein,,

representing sentences, & lt>

Representing sentence length,/->

Indicate->

Personal word (s)/(s)>

Indicate->

Probability of individual words.

In order to further illustrate the technical principles of the present invention, specific examples are provided below for illustration.

Taking correction results of a correction text on three methods of a Chinese spelling correction model, a sequence-to-sequence-based grammar correction model and a sequence-to-edit-based grammar correction model as examples:

the input error correction text is: "held full claims are lost by the victim's immediate economic staff and obtained I expect, which can be processed lightly as appropriate. "

Text corrected by the Chinese spelling error correction model: "reported full reimbursement victim near economic loss, and obtained I expect, can be processed lightly as appropriate. "

Text corrected based on a sequence-to-sequence grammar correction model: "reported full compensation is lost by the victim's immediate economic staff and understanding is obtained, which can be processed lightly as appropriate. "

Text corrected based on a sequence-to-edit grammar correction model: "reported full reimbursement is invaded for economic loss in close relatives and understanding is obtained, and light treatment can be considered. "

It can be seen that there is some careless error when the three models are individually error corrected.

The scheme of the invention is as follows:

changing ' top of the grammar into ' right ' by using a retrieval algorithm, changing ' I expect ' into ' forgiveness ' by using a grammar error correction model based on sequence to edit, and finally performing confusion degree selection with a result modified by the grammar error correction model based on sequence to obtain the final result as follows: "reported full reimbursement victim is close to the economic loss and obtains understanding, can be treated from light. "

As shown in FIG. 5, an embodiment of the present invention provides a search and multi-stage based Chinese text collation system comprising:

an input module 10 for inputting error correction text;

the retrieval module 20 is configured to find a text most similar to the error correction text, and splice the most similar text and the error correction text to obtain a spliced text;

a spelling correction module 30, configured to correct spelling of the spliced text;

a sequence-based editing module 40, configured to perform sequence-based editing grammar correction on the spelling corrected text, to obtain a modification result one;

a sequence-to-sequence based module 50, configured to correct the error correction text based on a sequence-to-sequence grammar by setting a threshold range, to obtain a second modification result;

a confusion selecting module 60, configured to perform confusion comparison on the modification result one and the modification result two;

and an output module 70 for outputting the modified result with low confusion.

The system is used for realizing the search and multi-stage-based Chinese text proofreading method, and is not repeated here for avoiding redundancy.

The invention also provides a Chinese text proofreading device which comprises the Chinese text proofreading method based on the searching and the multiple stages, and is used for realizing Chinese text proofreading. The technical principle and the advantageous effects of the device are similar to those of the above method and are not described herein.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present invention.

Claims

1. A search and multi-stage based chinese text collation method comprising:

s2: performing spelling correction on the spliced text;

s5: taking the modification result with low confusion degree as the final modification result;

the spelling correction of the spliced text comprises the following steps:

2. The search and multi-stage chinese text verification method of claim 1 wherein the sequence-to-edit based grammar correction of spell corrected text results in a modification of:

3. The search and multistage based chinese text proofreading method of claim 2, wherein the basic training data of the gemtor model is an erroneous-correct sentence pair, the input is an erroneous sentence, the iterative correction is performed by converting the input sentence into a corresponding transformation tag, and the corrected sentence is finally output by a BERT encoder, a sequence tag transformation.

4. The search and multi-stage based chinese text proofreading method of claim 1, wherein said correction text corrects a modified result two obtained by setting a threshold range based on a sequence-to-sequence grammar by:

5. The search and multistage based chinese text proofreading method of claim 4, wherein the basic training data of the seq2seq model is a corrected sentence pair composed of an original sentence and a correct sentence, the input is a wrong sentence, and the corrected sentence is output through an encoder-decoder model.

6. The method for correcting Chinese text based on search and multiple stages according to claim 5, wherein the erroneous sentence is subjected to BPE processing before being input into the encoder-decoder model and is subjected to restoration processing after being output from the encoder-decoder model.

7. The search and multi-stage based chinese text collation method according to claim 1, wherein the first modified result and error corrected text are subjected to confusion comparison based on sequence-to-sequence grammar correction by setting a threshold range, wherein a confusion formula is expressed as follows:

wherein S represents a sentence, N represents a sentence length, ω _i The word(s) of the i-th word is represented,

representing the probability of the i-th word.

8. A search and multi-stage based chinese text collation system comprising:

the input module is used for inputting error correction text;

the output module is used for outputting a modification result with low confusion degree;

the spelling correction of the spliced text comprises the following steps:

9. A chinese text collation apparatus comprising a search and multi-stage based chinese text collation method according to any one of claims 1 to 7 for implementing chinese text collation.