CN115630635B - Chinese text proofreading method, system and equipment based on retrieval and multiple stages - Google Patents

Chinese text proofreading method, system and equipment based on retrieval and multiple stages Download PDF

Info

Publication number
CN115630635B
CN115630635B CN202211639239.4A CN202211639239A CN115630635B CN 115630635 B CN115630635 B CN 115630635B CN 202211639239 A CN202211639239 A CN 202211639239A CN 115630635 B CN115630635 B CN 115630635B
Authority
CN
China
Prior art keywords
text
sequence
correction
sentence
modification result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211639239.4A
Other languages
Chinese (zh)
Other versions
CN115630635A (en
Inventor
曹自强
宋思琦
吕奇
耿磊
付国宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Chada Software Research & Development Co ltd
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202211639239.4A priority Critical patent/CN115630635B/en
Publication of CN115630635A publication Critical patent/CN115630635A/en
Application granted granted Critical
Publication of CN115630635B publication Critical patent/CN115630635B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The embodiment of the invention provides a Chinese text proofreading method, a Chinese text proofreading system and Chinese text proofreading equipment based on retrieval and multiple stages, wherein the method comprises the steps of inputting error correction texts, searching texts which are most similar to the error correction texts in a database, and splicing the most similar texts and the error correction texts to obtain spliced texts; performing spelling correction on the spliced text; performing grammar correction from sequence to editing on the spelling corrected text to obtain a first modification result; performing confusion comparison on the first modification result and the second modification result obtained by setting a threshold range based on the grammar correction from sequence to sequence; the modification result with low confusion is taken as the final modification result. The invention can effectively improve the robustness of the system, and improve the accuracy of detecting errors and correcting errors while improving various types of text errors.

Description

Chinese text proofreading method, system and equipment based on retrieval and multiple stages
Technical Field
The invention relates to the technical field of automatic Chinese text proofreading, in particular to a Chinese text proofreading method, system and equipment based on retrieval and multiple stages.
Background
The Chinese text proofreading is to detect and correct errors occurring in Chinese text, so as to obtain correct sentences conforming to original meaning. The types of common errors are classified into four categories, redundancy, missing, misspellings, misordering, where misspellings are most likely to occur. Chinese text proofreading can effectively correct text errors, and many researchers are devoted to Chinese text proofreading.
The current common Chinese text proofreading methods include a Chinese spelling error correction method which only solves spelling errors, a sequence-to-edit-based method and a sequence-to-sequence-based method. These methods for Chinese proofing have thousands of years in terms of two important consideration indices for Chinese proofing, namely, the accuracy of error detection and error correction. The Chinese spelling error correction method has good effects on both error detection and correction, but is only aimed at spelling errors, can not effectively correct other errors, is limited by training data, has limited correction capability when migrating from a pre-training stage to a downstream task, and has poor robustness. The sequence-to-edit based approach is capable of correcting four error types, but is weak in detecting errors, does not find errors in text well and corrects them, and does not have the same ability to modify spelling error types as spelling error correction models. Although the sequence-to-sequence based approach is more capable of detecting errors, it is less capable of correcting errors and is also relatively less capable of correcting spelling errors.
Therefore, there is a need to propose a new text collation system to solve the above problems.
Disclosure of Invention
Therefore, the embodiment of the invention provides a Chinese text proofreading method, a Chinese text proofreading system and Chinese text proofreading equipment based on retrieval and multiple stages, which are used for solving the problems of low accuracy and poor robustness of error detection and error correction of the Chinese text proofreading method in the prior art.
The invention provides a Chinese text proofreading method based on retrieval and multiple stages, which comprises the following steps:
s1: inputting error correction texts, searching texts which are most similar to the error correction texts in a database, and splicing the most similar texts and the error correction texts to obtain spliced texts;
s2: performing spelling correction on the spliced text;
s3: performing grammar correction from sequence to editing on the spelling corrected text to obtain a first modification result;
s4: performing confusion comparison on the first modification result and the second modification result obtained by setting a threshold range based on the grammar correction from sequence to sequence;
s5: the modification result with low confusion is taken as the final modification result.
Preferably, the spelling correction of the spliced text comprises the following steps:
processing the error sentence pair and the correct sentence pair into a word alignment format, sending the word alignment format into a Bert encoder to obtain original sentence characteristics, and adopting a Glyce encoder with a CNN structure as a visual information encoder to obtain characteristics of text word pronunciation and fonts;
and integrating the character sound and the character shape characteristics of the text with the characteristics of the original sentence, inputting the integrated character sound and character shape characteristics of the text into a transducer encoder, and obtaining the spelling corrected sentence through a full connection layer.
Preferably, the method for correcting the spelling corrected text based on the grammar from sequence to edition to obtain the first modified result is as follows:
and carrying out grammar correction on the spelling corrected text based on the GECToR model to obtain a first modification result.
Preferably, the basic training data of the gemtor model is an error-correct sentence pair, the input is an error sentence, the input sentence is converted into a corresponding transformation tag, the iteration error correction is carried out through a BERT encoder and a sequence tag transformation, and finally the corrected sentence is output.
Preferably, the method for correcting the modification result II obtained by setting the threshold range based on the sequence-to-sequence grammar by the error correction text is as follows:
the error correction text obtains a modification result II by setting a threshold range to control error correction based on the seq2seq model.
Preferably, the basic training data of the seq2seq model is a correction sentence pair composed of an original sentence and a correct sentence, the correction sentence pair is input as a wrong sentence, and the corrected sentence is output through an encoder-decoder model.
Preferably, the error sentence needs to be processed by BPE before being input into the encoder-decoder model, and needs to be restored after being output from the encoder-decoder model.
Preferably, the first modification result and the error correction text are subjected to confusion comparison based on the second modification result obtained by setting a threshold range through the grammar correction of the sequence-to-sequence, wherein the confusion formula is expressed as follows:
Figure 357139DEST_PATH_IMAGE001
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure 708486DEST_PATH_IMAGE002
representing sentences, & lt>
Figure 795391DEST_PATH_IMAGE003
Representing sentence length,/->
Figure 472360DEST_PATH_IMAGE004
Indicate->
Figure 379136DEST_PATH_IMAGE005
Personal word (s)/(s)>
Figure 534174DEST_PATH_IMAGE006
Indicate->
Figure 741164DEST_PATH_IMAGE005
Probability of individual words.
The invention provides a Chinese text proofreading system based on retrieval and multiple stages, which comprises:
the input module is used for inputting error correction text;
the retrieval module is used for searching the text which is most similar to the error correction text, and splicing the most similar text with the error correction text to obtain a spliced text;
the spelling correction module is used for correcting the spelling of the spliced text;
the sequence-based editing module is used for carrying out grammar correction based on sequence-based editing on the spelling corrected text to obtain a modification result I;
the sequence-to-sequence-based module is used for correcting the error correction text based on the grammar from sequence to sequence by setting a threshold range to obtain a second modification result;
the confusion degree selection module is used for carrying out confusion degree comparison on the first modification result and the second modification result;
and the output module is used for outputting the modification result with low confusion degree.
The invention also provides a Chinese text proofreading device which comprises the Chinese text proofreading method based on the searching and the multiple stages, and is used for realizing Chinese text proofreading.
Compared with the prior art, the technical scheme of the invention has the following advantages:
the invention provides a Chinese text correction method, a system and equipment based on retrieval and multiple stages, wherein a retrieval module is added on the basis of a Chinese spelling error correction method to provide a certain correct modification opinion for an error fragment, so that the robustness and the accuracy of text correction are enhanced and the cost is reduced under the condition of not using additional manual labeling data; based on a grammar error correction model from sequence to sequence, the error correction is avoided by controlling the output through a threshold value, and the accuracy of text correction is enhanced; combining the spelling error correction model, the sequence-to-sequence-based grammar error correction model and the sequence-to-edit-based grammar error correction model in a strategy, and improving the accuracy of text correction on the basis of keeping the advantages of the three models in text correction; the invention provides a method for selecting a modification result by using the confusion degree of a language model, thereby improving the robustness and the accuracy of text correction.
Drawings
For a clearer description of embodiments of the invention or of solutions in the prior art, reference will be made below to the accompanying drawings, which are used in the embodiments and which are intended to illustrate, but not to limit, the invention, and from which other drawings can be obtained without inventive effort for a person skilled in the art. Wherein:
FIG. 1 is a flow chart of a search and multi-stage based Chinese text collation method in accordance with an embodiment;
FIG. 2 is a flow diagram of a spelling error correction model inference provided in accordance with an embodiment;
FIG. 3 is a flow chart of GECToR model reasoning provided in accordance with an embodiment;
FIG. 4 is a flow chart of the inference of the seq2seq model provided in accordance with an embodiment;
FIG. 5 is a block diagram of a search and multi-stage based Chinese text collation system in accordance with an embodiment.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the invention and practice it.
The database used in the embodiments of the present invention is the data set provided by Wang et al, comprising 271329 correct-incorrect sentence pairs, using the correct sentences therein as the data for the retrieval module database.
Referring to fig. 1, an embodiment of the present invention provides a search and multi-stage based chinese text collation method, comprising:
s1: inputting error correction texts, searching texts which are most similar to the error correction texts in a database, and splicing the most similar texts and the error correction texts to obtain spliced texts;
s2: performing spelling correction on the spliced text;
s3: performing grammar correction from sequence to editing on the spelling corrected text to obtain a first modification result;
s4: performing confusion comparison on the first modification result and the second modification result obtained by setting a threshold range based on the grammar correction from sequence to sequence;
s5: the modification result with low confusion is taken as the final modification result.
The invention provides a Chinese text correction method based on retrieval and multiple stages, which is characterized in that a text which is most similar to an error correction text is searched in a database, the most similar text and the error correction text are spliced, a certain correct modification opinion is provided for the error text, and the robustness and the accuracy of text correction are enhanced and the cost is reduced under the condition that additional manual labeling data is not used; the correction text corrects a second modification result obtained by setting a threshold range based on grammar, so that the situation of error correction is avoided, and the accuracy of text correction is enhanced; through the spelling error correction model, the sequence-to-sequence-based grammar error correction model and the sequence-to-edit-based grammar error correction model, the accuracy of text correction is improved on the basis of keeping the advantages of the three models in text correction; and through confusion comparison, the robustness and accuracy of text correction are improved.
Further, in step S1:
inputting error correction text, searching the text most similar to the error correction text in a database by using a BM25 search algorithm, and splicing the most similar text and the error correction text to obtain a spliced text, so that a certain correct modification opinion is provided for the error text, and the robustness and the accuracy of text correction are enhanced and the cost is reduced under the condition that no additional manual labeling data is used.
Further, in step S2:
as shown in fig. 2, the spliced text is input into a spelling error correction model, and the spelling error correction model can only be used for correcting sentences only containing spelling errors, namely, the lengths of the erroneous sentences and the correct sentences are aligned, so that the erroneous and correct sentence pairs are firstly processed into a word aligned format and then are sent into a Bert encoder to obtain the characteristics of original sentences; as most spelling errors can be summarized into two types of word sound errors and font errors, a Glyce encoder with a CNN structure is adopted as a visual information encoder, so that the characteristics of text word sound and font are obtained; finally, the character and the font characteristics and the original characteristics are integrated and input into a transducer encoder, and finally, a full-connection layer is passed through to obtain a corrected sentence.
Further, in step S3:
as shown in fig. 3, the spelling corrected text is modified by grammar correction based on the gemtor model (based on the sequence-to-edit grammar correction model). The idea of the gemtor model is to convert the grammar correction task into a sequence tag task, tag each token, these tag types are shown in table 1, and since the training data is only a wrong-correct sentence pair, the input needs to be first converted into a corresponding conversion tag, and then sent to the model. The model structure of GECToR is a BERT-like transducer model with two fully connected layers and a softmax at the top. Through the transformation of the labels, the error correction operations such as insertion, deletion, replacement and the like can be realized, and multiple rounds of iterative labeling can be performed until no new error is found, and the final result is output.
TABLE 1
Figure 589034DEST_PATH_IMAGE007
Further, in step S4:
as shown in fig. 4, the correction text is corrected based on the seq2seq model (sequence-to-sequence based grammar correction model) by setting a threshold range to control correction to obtain a second modification result. For a seq2seq model for grammar correction task, the basic training data is a correction sentence pair composed of an original sentence and a correct sentence, the input is an error sentence, and the corrected sentence is directly output by using an encoder-decoder model. Before the sentence is input into the model, it needs to make BPE processing, and the output sentence is in BPE form, and needs to be restored to original form, and the method used is to delete the redundant blank space in the BPE result file, and because some of the words in the sentence are ignored and are not in the vocabulary, so that some of the modified sentences contain [ UNK ], and in this case, the modified sentences are directly equal to the original sentences without modification.
Comparing the confusion degree of the first modification result and the second modification result through the language model to obtain a final correction result, wherein the lower the confusion degree is, the more reasonable the description sentence is, and the confusion degree formula is expressed as follows:
Figure 248686DEST_PATH_IMAGE008
/>
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure 941835DEST_PATH_IMAGE002
representing sentences, & lt>
Figure 3332DEST_PATH_IMAGE003
Representing sentence length,/->
Figure 756525DEST_PATH_IMAGE004
Indicate->
Figure 903472DEST_PATH_IMAGE005
Personal word (s)/(s)>
Figure 665892DEST_PATH_IMAGE006
Indicate->
Figure 581895DEST_PATH_IMAGE005
Probability of individual words.
In order to further illustrate the technical principles of the present invention, specific examples are provided below for illustration.
Taking correction results of a correction text on three methods of a Chinese spelling correction model, a sequence-to-sequence-based grammar correction model and a sequence-to-edit-based grammar correction model as examples:
the input error correction text is: "held full claims are lost by the victim's immediate economic staff and obtained I expect, which can be processed lightly as appropriate. "
Text corrected by the Chinese spelling error correction model: "reported full reimbursement victim near economic loss, and obtained I expect, can be processed lightly as appropriate. "
Text corrected based on a sequence-to-sequence grammar correction model: "reported full compensation is lost by the victim's immediate economic staff and understanding is obtained, which can be processed lightly as appropriate. "
Text corrected based on a sequence-to-edit grammar correction model: "reported full reimbursement is invaded for economic loss in close relatives and understanding is obtained, and light treatment can be considered. "
It can be seen that there is some careless error when the three models are individually error corrected.
The scheme of the invention is as follows:
changing ' top of the grammar into ' right ' by using a retrieval algorithm, changing ' I expect ' into ' forgiveness ' by using a grammar error correction model based on sequence to edit, and finally performing confusion degree selection with a result modified by the grammar error correction model based on sequence to obtain the final result as follows: "reported full reimbursement victim is close to the economic loss and obtains understanding, can be treated from light. "
As shown in FIG. 5, an embodiment of the present invention provides a search and multi-stage based Chinese text collation system comprising:
an input module 10 for inputting error correction text;
the retrieval module 20 is configured to find a text most similar to the error correction text, and splice the most similar text and the error correction text to obtain a spliced text;
a spelling correction module 30, configured to correct spelling of the spliced text;
a sequence-based editing module 40, configured to perform sequence-based editing grammar correction on the spelling corrected text, to obtain a modification result one;
a sequence-to-sequence based module 50, configured to correct the error correction text based on a sequence-to-sequence grammar by setting a threshold range, to obtain a second modification result;
a confusion selecting module 60, configured to perform confusion comparison on the modification result one and the modification result two;
and an output module 70 for outputting the modified result with low confusion.
The system is used for realizing the search and multi-stage-based Chinese text proofreading method, and is not repeated here for avoiding redundancy.
The invention also provides a Chinese text proofreading device which comprises the Chinese text proofreading method based on the searching and the multiple stages, and is used for realizing Chinese text proofreading. The technical principle and the advantageous effects of the device are similar to those of the above method and are not described herein.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present invention.

Claims (9)

1. A search and multi-stage based chinese text collation method comprising:
s1: inputting error correction texts, searching texts which are most similar to the error correction texts in a database, and splicing the most similar texts and the error correction texts to obtain spliced texts;
s2: performing spelling correction on the spliced text;
s3: performing grammar correction from sequence to editing on the spelling corrected text to obtain a first modification result;
s4: performing confusion comparison on the first modification result and the second modification result obtained by setting a threshold range based on the grammar correction from sequence to sequence;
s5: taking the modification result with low confusion degree as the final modification result;
the spelling correction of the spliced text comprises the following steps:
processing the error sentence pair and the correct sentence pair into a word alignment format, sending the word alignment format into a Bert encoder to obtain original sentence characteristics, and adopting a Glyce encoder with a CNN structure as a visual information encoder to obtain characteristics of text word pronunciation and fonts;
and integrating the character sound and the character shape characteristics of the text with the characteristics of the original sentence, inputting the integrated character sound and character shape characteristics of the text into a transducer encoder, and obtaining the spelling corrected sentence through a full connection layer.
2. The search and multi-stage chinese text verification method of claim 1 wherein the sequence-to-edit based grammar correction of spell corrected text results in a modification of:
and carrying out grammar correction on the spelling corrected text based on the GECToR model to obtain a first modification result.
3. The search and multistage based chinese text proofreading method of claim 2, wherein the basic training data of the gemtor model is an erroneous-correct sentence pair, the input is an erroneous sentence, the iterative correction is performed by converting the input sentence into a corresponding transformation tag, and the corrected sentence is finally output by a BERT encoder, a sequence tag transformation.
4. The search and multi-stage based chinese text proofreading method of claim 1, wherein said correction text corrects a modified result two obtained by setting a threshold range based on a sequence-to-sequence grammar by:
the error correction text obtains a modification result II by setting a threshold range to control error correction based on the seq2seq model.
5. The search and multistage based chinese text proofreading method of claim 4, wherein the basic training data of the seq2seq model is a corrected sentence pair composed of an original sentence and a correct sentence, the input is a wrong sentence, and the corrected sentence is output through an encoder-decoder model.
6. The method for correcting Chinese text based on search and multiple stages according to claim 5, wherein the erroneous sentence is subjected to BPE processing before being input into the encoder-decoder model and is subjected to restoration processing after being output from the encoder-decoder model.
7. The search and multi-stage based chinese text collation method according to claim 1, wherein the first modified result and error corrected text are subjected to confusion comparison based on sequence-to-sequence grammar correction by setting a threshold range, wherein a confusion formula is expressed as follows:
Figure QLYQS_1
wherein S represents a sentence, N represents a sentence length, ω i The word(s) of the i-th word is represented,
Figure QLYQS_2
representing the probability of the i-th word.
8. A search and multi-stage based chinese text collation system comprising:
the input module is used for inputting error correction text;
the retrieval module is used for searching the text which is most similar to the error correction text, and splicing the most similar text with the error correction text to obtain a spliced text;
the spelling correction module is used for correcting the spelling of the spliced text;
the sequence-based editing module is used for carrying out grammar correction based on sequence-based editing on the spelling corrected text to obtain a modification result I;
the sequence-to-sequence-based module is used for correcting the error correction text based on the grammar from sequence to sequence by setting a threshold range to obtain a second modification result;
the confusion degree selection module is used for carrying out confusion degree comparison on the first modification result and the second modification result;
the output module is used for outputting a modification result with low confusion degree;
the spelling correction of the spliced text comprises the following steps:
processing the error sentence pair and the correct sentence pair into a word alignment format, sending the word alignment format into a Bert encoder to obtain original sentence characteristics, and adopting a Glyce encoder with a CNN structure as a visual information encoder to obtain characteristics of text word pronunciation and fonts;
and integrating the character sound and the character shape characteristics of the text with the characteristics of the original sentence, inputting the integrated character sound and character shape characteristics of the text into a transducer encoder, and obtaining the spelling corrected sentence through a full connection layer.
9. A chinese text collation apparatus comprising a search and multi-stage based chinese text collation method according to any one of claims 1 to 7 for implementing chinese text collation.
CN202211639239.4A 2022-12-20 2022-12-20 Chinese text proofreading method, system and equipment based on retrieval and multiple stages Active CN115630635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211639239.4A CN115630635B (en) 2022-12-20 2022-12-20 Chinese text proofreading method, system and equipment based on retrieval and multiple stages

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211639239.4A CN115630635B (en) 2022-12-20 2022-12-20 Chinese text proofreading method, system and equipment based on retrieval and multiple stages

Publications (2)

Publication Number Publication Date
CN115630635A CN115630635A (en) 2023-01-20
CN115630635B true CN115630635B (en) 2023-04-25

Family

ID=84910787

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211639239.4A Active CN115630635B (en) 2022-12-20 2022-12-20 Chinese text proofreading method, system and equipment based on retrieval and multiple stages

Country Status (1)

Country Link
CN (1) CN115630635B (en)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779970B (en) * 2021-09-24 2023-05-23 北京字跳网络技术有限公司 Text error correction method, device, equipment and computer readable storage medium
CN114065738B (en) * 2022-01-11 2022-05-17 湖南达德曼宁信息技术有限公司 Chinese spelling error correction method based on multitask learning

Also Published As

Publication number Publication date
CN115630635A (en) 2023-01-20

Similar Documents

Publication Publication Date Title
CN114444479B (en) End-to-end Chinese speech text error correction method, device and storage medium
US9069753B2 (en) Determining proximity measurements indicating respective intended inputs
US10115055B2 (en) Systems methods circuits and associated computer executable code for deep learning based natural language understanding
US9342499B2 (en) Round-trip translation for automated grammatical error correction
JP5444308B2 (en) System and method for spelling correction of non-Roman letters and words
KR20080003364A (en) Method and system for generating spelling suggestions
JPH07325828A (en) Grammar checking system
JP2009145853A (en) Method and system for generating and detecting confusing sound
CN109977220B (en) Method for reversely generating abstract based on key sentence and key word
CN111985234B (en) Voice text error correction method
KR20230009564A (en) Learning data correction method and apparatus thereof using ensemble score
Fusayasu et al. Word-error correction of continuous speech recognition based on normalized relevance distance
CN111401012A (en) Text error correction method, electronic device and computer readable storage medium
CN114218926A (en) Chinese spelling error correction method and system based on word segmentation and knowledge graph
CN115630635B (en) Chinese text proofreading method, system and equipment based on retrieval and multiple stages
CN115688703B (en) Text error correction method, storage medium and device in specific field
JP2999768B1 (en) Speech recognition error correction device
CN111462734A (en) Semantic slot filling model training method and system
US11341961B2 (en) Multi-lingual speech recognition and theme-semanteme analysis method and device
KR102430918B1 (en) Device and method for correcting Korean spelling
CN113011149A (en) Text error correction method and system
CN112988955B (en) Multilingual voice recognition and topic semantic analysis method and device
TWI731493B (en) Multi-lingual speech recognition and theme-semanteme analysis method and device
CN117575026B (en) Large model reasoning analysis method, system and product based on external knowledge enhancement
CN115859951B (en) Content error correction system for AIGC

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231225

Address after: 215000 Bamboo Garden Road, Suzhou high tech Zone, Jiangsu Province, No. 209

Patentee after: SUZHOU CHADA SOFTWARE RESEARCH & DEVELOPMENT Co.,Ltd.

Address before: No. 188, Shihu West Road, Wuzhong District, Suzhou City, Jiangsu Province

Patentee before: SOOCHOW University