CN113435186B - Chinese text error correction system, method, device and computer readable storage medium - Google Patents

Chinese text error correction system, method, device and computer readable storage medium Download PDF

Info

Publication number
CN113435186B
CN113435186B CN202110675560.7A CN202110675560A CN113435186B CN 113435186 B CN113435186 B CN 113435186B CN 202110675560 A CN202110675560 A CN 202110675560A CN 113435186 B CN113435186 B CN 113435186B
Authority
CN
China
Prior art keywords
text
chinese
corrected
semantic
correction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110675560.7A
Other languages
Chinese (zh)
Other versions
CN113435186A (en
Inventor
海月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xijin Information Technology Co ltd
Original Assignee
Shanghai Xijin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xijin Information Technology Co ltd filed Critical Shanghai Xijin Information Technology Co ltd
Priority to CN202110675560.7A priority Critical patent/CN113435186B/en
Publication of CN113435186A publication Critical patent/CN113435186A/en
Application granted granted Critical
Publication of CN113435186B publication Critical patent/CN113435186B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention relates to a Chinese text error correction system, method and device based on a machine learning model and a computer readable storage medium. The system comprises a Chinese text pre-training module, a Chinese text input module, a Chinese spell checking module, a Chinese spell correcting module, a semantic correcting module and a grammar language evaluating module. The Chinese text error correction method can focus on the consistency of upper and lower sentences, avoid the condition that a single phrase is correct but homophonic phrase selection has deviation, and ensure that the semantics are smooth without deviation when a plurality of phrases in the whole sentence are connected together.

Description

Chinese text error correction system, method, device and computer readable storage medium
Technical Field
The invention relates to the technical field of computer word processing, in particular to a Chinese text error correction system, method and device based on a machine learning model and a computer readable storage medium.
Background
The development of Chinese as the language with the most number of people in the world in the field of machine learning has a lot of limitations, and because the pronunciation, font, grammar sequence and the like of Chinese are complex, the spelling check and error correction of Chinese are in great demand in the fields of manual input or machine recognition.
The patent CN 111639489A checks and corrects various errors in Chinese texts by various methods of machine learning, and corrects unsmooth texts into smooth Chinese texts suitable for reading; and inquiring the position where the error character occurs through the confusion degree, selecting a correct modification mode to replace the error character by using a confusion set and a language model scoring mode, and finally returning to correct Chinese language expression. Although the method can adopt multithread processing, a plurality of text sentences can be simultaneously concurrent, the correction processing efficiency is high, a plurality of errors in the Chinese text are checked and corrected by a plurality of methods of machine learning, the unsmooth text is corrected into the smooth Chinese text suitable for reading, the position where the error character occurs is inquired by the confusion degree, the correct correction mode is selected by using a confusion set and a language model to replace the error character, and finally the correct Chinese language expression is returned. However, in this method, the consistency of the upper and lower sentences is easily ignored, so that a single phrase is correct, but the selection of homophones phrases is biased, and a problem of error correction occurs when a plurality of phrases are connected together in the whole sentence, so that semantic bias is caused. And moreover, different correction modes are given by correcting similar phrases before and after the same text, so that the problem of error correction between contexts is caused, and semantic deviation is caused.
Disclosure of Invention
In view of the above, the present invention provides a system, a method, a device and a computer readable storage medium for correcting a chinese text based on a machine learning model, so as to solve the technical problems of time delay, low delivery efficiency and economic loss caused by delivery errors due to incomplete or non-standard filling information of consumers in the existing express industry.
In order to solve the above problems, the present invention provides a chinese text error correction system, which is based on a machine learning model, the system comprising: the Chinese text pre-training module is used for pre-training the Chinese text and acquiring the confusion degree, the confusion set, the language model and the semantic model of the Chinese text; the Chinese text input module is used for preprocessing the input text, deleting the non-used punctuations and the spaces with abnormal length, and converting Chinese and English punctuations and coding formats; the Chinese spelling check module is used for automatically returning the position of an incorrect character when the character in the Chinese text has misspelling; the Chinese spelling correction module is used for positioning the positions of wrong characters through the Chinese text pre-training module and the Chinese spelling checking module, replacing the characters one by using candidate words, calculating a smoothness result through a language model, and selecting an optimal spelling correction text to output to form a first correction text; the semantic correction module is used for calculating a semantic smoothness result in upper and lower sentences of the first corrected text through a semantic model in the Chinese text pre-training module, reselecting candidate words for the first corrected text to replace characters one by one so that the semantics in the upper and lower sentences are unified, and outputting a plurality of corrected texts with unified semantics to form a second corrected text; and the grammar language evaluation module is used for inputting the second corrected texts and evaluating the total semantic scores in all the second corrected texts, sequencing the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as the final corrected text.
A Chinese text error correction method comprises the Chinese text error correction system, and the Chinese text error correction method comprises the following steps: s1: pre-training a Chinese text to obtain a confusion degree, a confusion set, a language model and a semantic model of the Chinese text; s2: preprocessing the input text, deleting the non-used punctuations and the spaces with abnormal length, and converting Chinese and English punctuations and coding formats; s3: taking each character or punctuation as a position, performing residual processing by taking the character as a unit, and returning the position of an incorrect character by the system when a spelling error exists in the Chinese text; s4: after all suspected errors are positioned through error detection, the characters are replaced one by using candidate words, the smoothness calculation result of the similar candidate short text set is obtained on the basis of a language model, and finally, the optimal spelling correction text is selected and output to form a first correction text; s5: calculating semantic smoothness results in upper and lower sentences of the first corrected text through a semantic model in a Chinese text pre-training module, reselecting candidate words for the first corrected text to replace characters one by one so that the semantics in the upper and lower sentences are unified, and outputting a plurality of corrected texts with unified semantics to form a second corrected text; s6: and evaluating the total semantic scores in all the second corrected texts, sorting the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as the final corrected text.
Further, the step S5 specifically includes: s51: calculating the score of each candidate word in the error correction candidate words by utilizing a semantic model for each error correction candidate word in the first correction text; s52: accumulating the scores of each word and adjacent words in the error correction candidate words to obtain the total score of the semantic compliance result of the error correction candidate words; s53: and sorting the total scores of the semantic currency results in all the first correction texts from large to small, and outputting the first correction text with the highest score as a second correction text.
Further, the step S6 specifically includes: s61: calculating the score of the semantics in each sentence in the second corrected text by using a semantic model for the semantics of the sentence; s62: multiplying the occurrence probability of each sentence in the second corrected text with the occurrence probability of the adjacent sentences to obtain the total semantic compliance result score of the second corrected text; s63: and sorting the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as a final corrected text.
Further, the semantic model calculates the score of the semantics in the sentence in the following manner: the semantic model is formed by classifying and processing harmonic sound words, confusing sound words, word order reversal, incomplete words, misshapen characters, sensitive words, common sense errors and multiple words, and is P (S) approximately equal to P (w)1)*P(w2|w1)*P(w3|w2)*...*P(wn|wn-1),
Figure GDA0003202010540000031
P(wi|wi-1)=count(wi,wi-1)/count(wi-1) (ii) a Wherein, p (w)1…wn) Is the probability of a sentence, P is the probability of a sentence, n is the length of the sentence, P (w)i|wi-1) Is the conditional probability of two words co-occurrence, w represents the word, count (w)i-1) Is the word wi-1Number of occurrences in the corpus, count (w)i,wi-1) Is wi,wi-1The number of times two words appear simultaneously; calculating the probability p (w) of sentence occurrence for each sentence in the second corrected text1…wn) And multiplying the probability of all sentences in the second corrected text by the probability of all sentences in the second corrected text to obtain the total semantic compliance result score P (S) of the second corrected text.
Further, the step S4 specifically includes: s41: obtaining a candidate set of replacement characters of suspected wrong characters, and obtaining the sound similarity, the shape similarity and the common recognition wrong candidate words of all suspected wrong characters in a confusion set after positioning all suspected mistakes through error detection; s42: replacing the positions of the characters by using the candidate characters, and enumerating each character of the confusion set to replace the original character for each replaceable character, thereby obtaining a short text candidate set for replacing suspected wrong characters; s43: and obtaining a popularity ranking result of the candidate short texts based on the n-element language model in the S13, and selecting the sentence with the highest popularity score as the final candidate text.
Further, the S43 specifically includes: s431: taking the word as a minimum calculation unit, and performing word segmentation by using the existing Chinese word segmentation model; s432: calculating the corresponding occurrence frequency of common words in a specific language database based on a specific language model to obtain the smoothness; s433: replacing the original text if the text smoothness is greater than a predefined threshold; s434: if the final candidate text smoothness is less than the predefined threshold, the original text is correct, and the original text is retained.
Further, the S3 specifically includes: s31: removing special symbols in the training corpus, and replacing invalid characters in the text, wherein the invalid characters are characters except Chinese, English, numbers and common punctuations; s32: dividing the long text into short texts, and dividing the long text into the short texts according to the specific punctuation marks and the spaces; s33: and returning the suspected incorrect character position, calculating the likelihood probability value of each character by using the confusion degree and the occurrence probability of the word, and if the likelihood probability value of the character is lower than the average probability value of the text, judging that the character is the suspected wrongly-written character and returning the position of the character in the text.
A Chinese text correction apparatus, the apparatus comprising a memory, a processor and a Chinese text correction processing program stored in the memory and operable on the processor, the Chinese text correction processing program when executed by the processor implementing the steps of the Chinese text correction method.
A computer readable storage medium having stored thereon a chinese text error correction processing program, which when executed by a processor implements the steps of the chinese text error correction method.
The Chinese text error correction system, the method, the device and the computer readable storage medium based on the machine learning model can adopt multi-thread processing, can simultaneously and concurrently carry out a plurality of text sentences, have high correction processing efficiency, check and correct a plurality of errors in the Chinese text by a plurality of methods of machine learning, correct an unordered text into a smooth Chinese text suitable for reading, inquire the position where an error character occurs by the confusion degree, select a correct modification mode to replace the error character by using a confusion set and a language model, and finally return a correct Chinese language expression. Particularly, the mode can pay attention to the continuity of the upper sentence and the lower sentence, avoids the condition that the single phrase is correct but the homophonic phrase selection has deviation, and ensures that the semantics are smooth without deviation when a plurality of phrases are connected together in the whole sentence.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention.
Fig. 1 is a block diagram of a chinese text correction system according to an embodiment of the present invention.
Fig. 2 is a flowchart of a method for correcting errors of a chinese text according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
As shown in FIG. 1, the present invention provides a Chinese text correction system 10, which is based on a machine learning model, the system 10 includes a Chinese text pre-training module 1, a Chinese text input module 2, a Chinese spell checking module 3, a Chinese spell correction module 4, a semantic correction module 5, and a grammatical language evaluation module 6. The Chinese text pre-training module 1 is used for pre-training a Chinese text and acquiring the confusion degree, the confusion set, the language model and the semantic model of the Chinese text; the Chinese text input module 2 preprocesses the input text, deletes the non-used punctuations and the spaces with abnormal length, and converts Chinese and English punctuations and coding formats; the Chinese spelling check module 3 is used for automatically returning the position of an incorrect character when the character in the Chinese text has misspelling; the Chinese spelling correction module 4 positions wrong characters through a Chinese text pre-training module and a Chinese spelling checking module, replaces the characters one by using candidate words, calculates a smoothness result through a language model, and selects an optimal spelling correction text to output to form a first correction text; the semantic correction module 5 calculates semantic smoothness results in upper and lower sentences of the first corrected text through a semantic model in the Chinese text pre-training module, reselects candidate words for the first corrected text to replace characters one by one so that the semantics in the upper and lower sentences are unified, and outputs a plurality of corrected texts with unified semantics to form a second corrected text; and the grammar language evaluation module 6 is used for inputting the second corrected texts, evaluating the total semantic scores in all the second corrected texts, sequencing the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as the final corrected text.
The implementation principle is that whether a sentence has errors or not and the corresponding error types are judged according to the deep learning bert model and the language model, the positions of wrongly-typed characters are further detected through the language model, and the wrongly-typed characters are corrected through the pinyin sound-like characteristic, the stroke five-stroke editing distance characteristic and the language model confusion characteristic.
In the task of correcting the chinese text, common error types include: (1) harmonious words, such as with eyes-with glasses; (2) confusing sound words, such as wandering girl-cowherd girl; (3) the order of the words is reversed, such as Wudi Allen-Allen Wudy; (4) completing words, if love is happy-if love is happy; (5) misshapers such as jowar-sorghum; (6) a sensitive word; (7) errors of common sense; (8) the characters are multiple, for example, the three-ring auxiliary road of the shop village has sharp turning-the three-ring auxiliary road of the shop village has sharp turning.
And a semantic model is formed in the Chinese text pre-training module 1 to score full-text semantics, the mode can pay attention to the continuity of upper and lower sentences, the condition that a single phrase is correct but homophonic phrase selection can be deviated is avoided, and the semantics are smooth and are not deviated when a plurality of phrases in the whole sentence are connected together.
As shown in FIG. 2, the present invention provides a Chinese text error correction method including the above Chinese text error correction system, the Chinese text error correction method including the following steps S1-S6.
S1: and pre-training the Chinese text to obtain the confusion degree, the confusion set, the language model and the semantic model of the Chinese text.
S2: preprocessing the input text, deleting the abnormal punctuations and the spaces with abnormal length, and converting Chinese and English punctuations and coding formats.
S3: and positioning the position of the incorrect character, taking each character or punctuation as a position, performing residual processing by taking the character as a unit, and returning the position of the incorrect character by the system when the Chinese text has spelling errors.
In this embodiment, the S3 specifically includes: s31: removing special symbols in the training corpus, and replacing invalid characters in the text, wherein the invalid characters are characters except Chinese, English, numbers and common punctuations; s32: dividing the long text into short texts, and dividing the long text into the short texts according to the specific punctuation marks and the spaces; s33: and returning the position of the suspected incorrect character, calculating the likelihood probability value of each character by using the confusion degree and the probability of the occurrence of the word, and if the likelihood probability value of the character is lower than the average probability value of the text, judging that the character is the suspected wrongly-written character and returning the position of the character in the text.
S4: and outputting a first corrected text, positioning all suspected errors through error detection, replacing the characters one by using candidate words, obtaining a smoothness calculation result of the similar candidate short text set based on a language model, and finally selecting an optimal spelling corrected text to output to form the first corrected text.
In this embodiment, the step S4 specifically includes: s41: obtaining a candidate set of replacement characters of suspected wrong characters, and obtaining the sound similarity, the shape similarity and the common recognition wrong candidate words of all suspected wrong characters in a confusion set after positioning all suspected mistakes through error detection; s42: replacing the positions of the characters by using the candidate characters, and enumerating each character of the confusion set to replace the original character for each replaceable character, thereby obtaining a short text candidate set for replacing suspected wrong characters; s43: and obtaining a popularity ranking result of the candidate short texts based on the n-element language model in the S13, and selecting the sentence with the highest popularity score as the final candidate text.
In this embodiment, the S43 specifically is: s431: taking the words as a minimum calculation unit, and performing word segmentation by using the existing Chinese word segmentation model; s432: calculating the corresponding occurrence frequency of common words in a specific language database based on a specific language model to obtain the smoothness; s433: replacing the original text if the text smoothness is greater than a predefined threshold; s434: if the final candidate text smoothness is less than the predefined threshold, the original text is correct, and the original text is retained.
S5: and forming a second corrected text according to the semantic compliance, calculating the semantic compliance result in upper and lower sentences of the first corrected text through a semantic model in a Chinese text pre-training module, reselecting candidate words for the first corrected text to replace characters one by one so that the semantics in the upper and lower sentences are unified, and outputting a plurality of corrected texts with unified semantics to form the second corrected text.
In this embodiment, the step S5 specifically includes: s51: calculating the score of each candidate word in the error correction candidate words by utilizing a semantic model for each error correction candidate word in the first correction text; s52: accumulating the scores of each word and adjacent words in the error correction candidate words to obtain the total score of the semantic compliance result of the error correction candidate words; s53: and sorting the total scores of the semantic currency results in all the first correction texts from large to small, and outputting the first correction text with the highest score as a second correction text.
S6: and outputting the final corrected text, evaluating the total semantic scores in all the second corrected texts, sorting the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as the final corrected text.
In this embodiment, the step S6 specifically includes: s61: calculating the score of the semantics in each sentence in the second corrected text by using a semantic model for the semantics of the sentence; s62: multiplying the occurrence probability of each sentence in the second corrected text with the occurrence probability of the adjacent sentences to obtain the total semantic compliance result score of the second corrected text; s63: and sorting the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as a final corrected text.
In this embodiment, the scoring manner of the semantic model for calculating the semantics in the sentence is as follows: the semantic model is formed by classifying and processing harmonic sound words, confusing sound words, word order reversal, incomplete words, misshapen characters, sensitive words, common sense errors and multiple words, and is P (S) approximately equal to P (w)1)*P(w2|w1)*P(w3|w2)*...*P(wn|wn-1),
Figure GDA0003202010540000091
P(wi|wi-1)=count(wi,wi-1)/count(wi-1) (ii) a Wherein, p (w)1…wn) Is the probability of a sentence, P is the probability of a sentence, n is the length of the sentence, P (w)i|wi-1) Is the conditional probability of two words co-occurrence, w represents the word, count (w)i-1) Is the word wi-1Number of occurrences in the corpus, count (w)i,wi-1) Is wi,wi-1The number of times two words appear simultaneously; calculating the probability p (w) of sentence occurrence for each sentence in the second correction text respectively1…wn) Obtaining the semantic smoothness result sum of the second corrected text by multiplying the probability of all sentences in the second corrected textScore P (S).
In another embodiment of the present application, text correction is performed as follows.
(1) The error correction scheme comprises two steps, wherein the first step is error detection, and the second step is error correction;
(2) an error detection section:
firstly, dividing the language materials into nine categories according to the existing linguistic data, and using a bert pre-training model to finely adjust a classification model.
Secondly, a kenlm tool is used for training the ngrama model, and the used linguistic data are 20G approximate to that of the national daily report and that of Wikipedia Chinese.
And thirdly, establishing linguistic dictionaries such as easily confused words, homophones, similar words, sensitive words, common sense word banks and the like, and establishing a jieba word segmentation user dictionary, a person name, a place name, a mechanism name and the like.
Combining a bert model and an n-grama language model to identify, cutting words by a Chinese word segmentation device in the Chinese sentence, wherein the words are cut by a specific error detection part, and because the sentences contain wrongly written characters, the word cutting result often has the condition of wrong segmentation, so that errors are detected from the two aspects of character granularity and word granularity, and the character granularity is as follows: and (3) detecting that the likelihood probability value of a word is lower than the text average value of the sentence by language model confusion (ppl), judging that the word is a suspected wrongly written word with high probability, and determining the word granularity: the probability that a word not in the dictionary after word segmentation is a suspected wrong word is high. Integrating the suspected error results of the two granularities to form a suspected error position candidate set;
specifically, the confusion of the characters or words at different positions is determined according to the size of the confusion of the language model, the larger the confusion is, the more likely the characters or words are mistaken, a threshold value is set, and if the threshold value is exceeded, the error is determined.
(3) An error correction section:
firstly, traversing all suspected error positions, replacing words in the error positions by using similar dictionaries, then calculating sentence confusion degree through a language model, comparing and sequencing results of all candidate sets, and obtaining a combination with the minimum model confusion degree to obtain an optimal corrected word, wherein a corresponding sensitive word library is required to be established for sensitive word problems and common sense errors.
(4) Multi-word, missing word, out-of-order part:
aiming at the problems of multiple words, few words and disordered word sequences, only error prompts can be given, and error positions and modification suggestions cannot be effectively identified, a crf algorithm model is used for identifying specific error positions in the prior art, and the identification effect is poor due to the lack of linguistic data and the imbalance of labels.
(5) And (3) a model part:
bert model, kenlm statistical language model tool.
(6) And (3) semantic part:
semantic scoring is performed in the same manner as in steps S5 and S6.
A Chinese text correction apparatus, the apparatus comprising a memory, a processor and a Chinese text correction processing program stored in the memory and operable on the processor, the Chinese text correction processing program when executed by the processor implementing the steps of the Chinese text correction method.
A computer readable storage medium having stored thereon a chinese text error correction processing program, which when executed by a processor implements the steps of the chinese text error correction method.
The Chinese text error correction system, the method, the device and the computer readable storage medium based on the machine learning model can adopt multi-thread processing, can simultaneously and concurrently carry out a plurality of text sentences, have high correction processing efficiency, check and correct a plurality of errors in the Chinese text by a plurality of methods of machine learning, correct an unordered text into a smooth Chinese text suitable for reading, inquire the position where an error character occurs by the confusion degree, select a correct modification mode to replace the error character by using a confusion set and a language model, and finally return a correct Chinese language expression. Particularly, the mode can pay attention to the continuity of the upper sentence and the lower sentence, avoids the condition that the single phrase is correct but the homophonic phrase selection has deviation, and ensures that the semantics are smooth without deviation when a plurality of phrases are connected together in the whole sentence.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express some exemplary embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (8)

1. A chinese text correction system, the system based on a machine learning model, the system comprising:
the Chinese text pre-training module is used for pre-training the Chinese text and acquiring the confusion degree, the confusion set, the language model and the semantic model of the Chinese text;
the Chinese text input module is used for preprocessing the input text, deleting the non-used punctuations and the spaces with abnormal length, and converting Chinese and English punctuations and coding formats;
the Chinese spelling check module is used for automatically returning the position of an incorrect character when the character in the Chinese text has misspelling;
the Chinese spelling correction module is used for positioning the positions of wrong characters through the Chinese text pre-training module and the Chinese spelling checking module, replacing the characters one by using candidate words, calculating a smoothness result through a language model, and selecting an optimal spelling correction text to output to form a first correction text;
the semantic correction module is used for calculating a semantic smoothness result in upper and lower sentences of the first corrected text through a semantic model in the Chinese text pre-training module, reselecting candidate words for the first corrected text to replace characters one by one so that the semantics in the upper and lower sentences are unified, and outputting a plurality of corrected texts with unified semantics to form a second corrected text; the semantic correction module is specifically configured to: calculating the score of each candidate word in the error correction candidate words by utilizing a semantic model for each error correction candidate word in the first correction text; accumulating the scores of each word and adjacent words in the error correction candidate words to obtain the total score of the semantic compliance result of the error correction candidate words; sorting the total scores of the semantic currency results in all the first correction texts from large to small, and outputting the first correction text with the highest score as a second correction text;
the grammar language evaluation module is used for inputting the second corrected texts and evaluating the total semantic scores in all the second corrected texts, sorting the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as a final corrected text; the grammar language evaluation module is specifically configured to: calculating the score of the semantics in each sentence in the second corrected text by using a semantic model for the semantics of the sentence; multiplying the occurrence probability of each sentence in the second corrected text with the occurrence probability of the adjacent sentences to obtain the total semantic compliance result score of the second corrected text; and sorting the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as a final corrected text.
2. A Chinese text error correction method is characterized by comprising the following steps:
s1: pre-training a Chinese text to obtain a confusion degree, a confusion set, a language model and a semantic model of the Chinese text;
s2: preprocessing the input text, deleting the unused punctuations and the spaces with abnormal length, and converting Chinese and English punctuations and coding formats;
s3: taking each character or punctuation as a position, performing residual processing by taking the character as a unit, and returning the position of an incorrect character by the system when a spelling error exists in the Chinese text;
s4: after all suspected errors are positioned through error detection, the characters are replaced one by using candidate words, the smoothness calculation result of the similar candidate short text set is obtained on the basis of a language model, and finally, the optimal spelling correction text is selected and output to form a first correction text;
s5: calculating semantic smoothness results in upper and lower sentences of the first corrected text through a semantic model in a Chinese text pre-training module, reselecting candidate words for the first corrected text to replace characters one by one so that the semantics in the upper and lower sentences are unified, and outputting a plurality of corrected texts with unified semantics to form a second corrected text; the step S5 specifically includes: s51: calculating the score of each candidate word in the error correction candidate words by utilizing a semantic model for each error correction candidate word in the first correction text; s52: accumulating the scores of each word and adjacent words in the error correction candidate words to obtain the total score of the semantic compliance result of the error correction candidate words; s53: sorting the total scores of the semantic smoothness results in all the first correction texts from large to small, and outputting the first correction text with the highest score as a second correction text;
s6: evaluating the total semantic scores in all the second corrected texts, sorting the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as a final corrected text; the step S6 specifically includes: s61: calculating the score of the semantics in each sentence in the second corrected text by using a semantic model for the semantics of the sentence; s62: multiplying the occurrence probability of each sentence in the second corrected text with the occurrence probability of the adjacent sentences to obtain the total semantic compliance result score of the second corrected text; s63: and sorting the total semantic scores in all the second corrected texts from large to small, and outputting the second corrected text with the highest score as a final corrected text.
3. The method of claim 2, wherein the semantic model of each sentence in the second corrected text calculates the score of the semantics in the sentence according to the following formula:
according to harmonious sound words, confusing sound words, reversed order of words, incomplete words and misshapen charactersCarrying out classification processing on errors, sensitive words, common sense errors and multiple characters to form the semantic model, wherein the semantic model is P (S) approximately equal to P (w)1)*P(w2|w1)*P(w3|w2)*...*P(wn|wn-1),
Figure FDA0003569065590000031
P(wi|wi-1)=count(wi,wi-1)/count(wi-1) (ii) a Wherein, p (w)1…wn) Is the probability of a sentence, P is the probability of a sentence, n is the length of the sentence, P (w)i|wi-1) Is the conditional probability of two words co-occurrence, w represents the word, count (w)i-1) Is the word wi-1Number of occurrences in the corpus, count (w)i,wi-1) Is wi, wi-1The number of times two words appear simultaneously;
calculating the probability p (w) of sentence occurrence for each sentence in the second corrected text1…wn) And multiplying the probability of all sentences in the second corrected text by the probability of all sentences in the second corrected text to obtain the total semantic compliance result score P (S) of the second corrected text.
4. The method for correcting errors of chinese text according to claim 2, wherein the step S4 specifically includes:
s41: obtaining a candidate set of replacement characters of suspected wrong characters, and obtaining the sound similarity, the shape similarity and the common recognition wrong candidate words of all suspected wrong characters in a confusion set after positioning all suspected mistakes through error detection;
s42: replacing the positions of the characters by using the candidate characters, and enumerating each character of the confusion set to replace the original character for each replaceable character, thereby obtaining a short text candidate set for replacing suspected wrong characters;
s43: and obtaining a popularity ranking result of the candidate short texts based on the n-element language model in the S13, and selecting the sentence with the highest popularity score as the final candidate text.
5. The method for correcting errors of chinese texts according to claim 4, wherein the S43 specifically is:
s431: taking the word as a minimum calculation unit, and performing word segmentation by using the existing Chinese word segmentation model;
s432: calculating the corresponding occurrence frequency of common words in a specific language database based on a specific language model to obtain the smoothness;
s433: replacing the original text if the text smoothness is greater than a predefined threshold;
s434: if the final candidate text smoothness is less than the predefined threshold, the original text is correct, and the original text is retained.
6. The method for correcting errors of chinese text according to claim 2, wherein the S3 specifically includes:
s31: removing special symbols in the training corpus, and replacing invalid characters in the text, wherein the invalid characters are characters except Chinese, English, numbers and common punctuations;
s32: dividing the long text into short texts, and dividing the long text into the short texts according to the specific punctuation marks and the spaces;
s33: and returning the suspected incorrect character position, calculating the likelihood probability value of each character by using the confusion degree and the occurrence probability of the word, and if the likelihood probability value of the character is lower than the average probability value of the text, judging that the character is the suspected wrongly-written character and returning the position of the character in the text.
7. An apparatus for chinese text error correction, the apparatus comprising a memory, a processor and a chinese text error correction processing program stored in the memory and executable on the processor, the chinese text error correction processing program when executed by the processor implementing the steps of the chinese text error correction method according to any one of claims 2 to 6.
8. A computer-readable storage medium, characterized in that a chinese text error correction processing program is stored thereon, and when executed by a processor, implements the steps of the chinese text error correction method according to any one of claims 2 to 6.
CN202110675560.7A 2021-06-18 2021-06-18 Chinese text error correction system, method, device and computer readable storage medium Active CN113435186B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110675560.7A CN113435186B (en) 2021-06-18 2021-06-18 Chinese text error correction system, method, device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110675560.7A CN113435186B (en) 2021-06-18 2021-06-18 Chinese text error correction system, method, device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN113435186A CN113435186A (en) 2021-09-24
CN113435186B true CN113435186B (en) 2022-05-20

Family

ID=77756389

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110675560.7A Active CN113435186B (en) 2021-06-18 2021-06-18 Chinese text error correction system, method, device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113435186B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114328798B (en) * 2021-11-09 2024-02-23 腾讯科技(深圳)有限公司 Processing method, device, equipment, storage medium and program product for searching text
CN114078254B (en) * 2022-01-07 2022-04-29 华中科技大学同济医学院附属协和医院 Intelligent data acquisition system based on robot
CN114065738B (en) * 2022-01-11 2022-05-17 湖南达德曼宁信息技术有限公司 Chinese spelling error correction method based on multitask learning
CN114510925A (en) * 2022-01-25 2022-05-17 森纵艾数(北京)科技有限公司 Chinese text error correction method, system, terminal equipment and storage medium
CN114611524B (en) * 2022-02-08 2023-11-17 马上消费金融股份有限公司 Text error correction method and device, electronic equipment and storage medium
CN116090441B (en) * 2022-12-30 2023-10-20 永中软件股份有限公司 Chinese spelling error correction method integrating local semantic features and global semantic features

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110852087A (en) * 2019-09-23 2020-02-28 腾讯科技(深圳)有限公司 Chinese error correction method and device, storage medium and electronic device
CN111639489A (en) * 2020-05-15 2020-09-08 民生科技有限责任公司 Chinese text error correction system, method, device and computer readable storage medium
CN111723791A (en) * 2020-06-11 2020-09-29 腾讯科技(深圳)有限公司 Character error correction method, device, equipment and storage medium
CN112232055A (en) * 2020-10-28 2021-01-15 中国电子科技集团公司第二十八研究所 Text detection and correction method based on pinyin similarity and language model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9779080B2 (en) * 2012-07-09 2017-10-03 International Business Machines Corporation Text auto-correction via N-grams

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110852087A (en) * 2019-09-23 2020-02-28 腾讯科技(深圳)有限公司 Chinese error correction method and device, storage medium and electronic device
CN111639489A (en) * 2020-05-15 2020-09-08 民生科技有限责任公司 Chinese text error correction system, method, device and computer readable storage medium
CN111723791A (en) * 2020-06-11 2020-09-29 腾讯科技(深圳)有限公司 Character error correction method, device, equipment and storage medium
CN112232055A (en) * 2020-10-28 2021-01-15 中国电子科技集团公司第二十八研究所 Text detection and correction method based on pinyin similarity and language model

Also Published As

Publication number Publication date
CN113435186A (en) 2021-09-24

Similar Documents

Publication Publication Date Title
CN113435186B (en) Chinese text error correction system, method, device and computer readable storage medium
Kissos et al. OCR error correction using character correction and feature-based word classification
US9875254B2 (en) Method for searching for, recognizing and locating a term in ink, and a corresponding device, program and language
CN111639489A (en) Chinese text error correction system, method, device and computer readable storage medium
CN109800414B (en) Method and system for recommending language correction
JP4568774B2 (en) How to generate templates used in handwriting recognition
Chanlekha et al. Thai named entity extraction by incorporating maximum entropy model with simple heuristic information
US20140298168A1 (en) System and method for spelling correction of misspelled keyword
CN101261623A (en) Word splitting method and device for word border-free mark language based on search
Jauhiainen et al. HeLI-based experiments in Swiss German dialect identification
Fahda et al. A statistical and rule-based spelling and grammar checker for Indonesian text
CN110147546B (en) Grammar correction method and device for spoken English
Uthayamoorthy et al. Ddspell-a data driven spell checker and suggestion generator for the tamil language
Schaback et al. Multi-level feature extraction for spelling correction
Rana et al. Detection and correction of real-word errors in Bangla language
Sankaran et al. Error detection in highly inflectional languages
Lin et al. A study on Chinese spelling check using confusion sets and? n-gram statistics
Wu et al. Integrating dictionary and web N-grams for chinese spell checking
Jayasuriya et al. Learning a stochastic part of speech tagger for sinhala
Mittra et al. A bangla spell checking technique to facilitate error correction in text entry environment
Chiu et al. Chinese spell checking based on noisy channel model
Mridha et al. Semantic error detection and correction in Bangla sentence
Mridha et al. An approach for detection and correction of missing word in Bengali sentence
Mohapatra et al. Spell checker for OCR
Cissé et al. Automatic Spell Checker and Correction for Under-represented Spoken Languages: Case Study on Wolof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant