CN112149406B - Chinese text error correction method and system - Google Patents

Chinese text error correction method and system Download PDF

Info

Publication number
CN112149406B
CN112149406B CN202011021044.4A CN202011021044A CN112149406B CN 112149406 B CN112149406 B CN 112149406B CN 202011021044 A CN202011021044 A CN 202011021044A CN 112149406 B CN112149406 B CN 112149406B
Authority
CN
China
Prior art keywords
text
word
determining
candidate sentence
error
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011021044.4A
Other languages
Chinese (zh)
Other versions
CN112149406A (en
Inventor
钱宝生
杨军
曾擂
王滨
干家东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 15 Research Institute
Original Assignee
CETC 15 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 15 Research Institute filed Critical CETC 15 Research Institute
Priority to CN202011021044.4A priority Critical patent/CN112149406B/en
Publication of CN112149406A publication Critical patent/CN112149406A/en
Application granted granted Critical
Publication of CN112149406B publication Critical patent/CN112149406B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to a Chinese text error correction method and a Chinese text error correction system. The Chinese text error correction method comprises the following steps: acquiring a text to be corrected; determining the error words and the positions of the error words in the text to be corrected according to a statistical language N-gram model; determining a first candidate sentence set by utilizing a two-way long-short-term memory LSTM model based on the error word and the error word position; converting the text to be corrected into a pinyin sequence; determining a second candidate sentence by using the N-gram model based on the pinyin sequence; comparing the confusion degree of all the first candidate sentences in the first candidate sentence set with the confusion degree of the second candidate sentences, and determining the sentence with the lowest confusion degree as the text after error correction. The invention can improve the error checking and correcting rate of the Chinese text and reduce the hardware configuration requirement.

Description

Chinese text error correction method and system
Technical Field
The invention relates to the technical field of natural language processing, in particular to a Chinese text error correction method and a Chinese text error correction system.
Background
Chinese text often contains various errors such as near word errors, homonym errors, term class errors, semantic errors, idiomatic or post-consumer errors, and the like. In some important situations, a wrong document will cause a significant loss, and manual error correction is inefficient and takes a lot of time to face a lot of text. Technical difficulties of Chinese text correction:
(1) Accuracy of named entity recognition: aiming at some rule errors, a dictionary in the corresponding field needs to be constructed, such as leader name proofreading, and corresponding information of leader names and positions, which can be updated in real time, needs to be provided, but because of frequent information updating and high position change frequency, differential error reporting caused by synchronous information exists.
(2) The Chinese grammar rules are complex: the biggest feature in standard chinese grammar is morphological changes that are not strictly significant. The nouns have no lattice change and no distinction between sexes and sums. Verbs are not called separately, nor tenses. This feature, which is different from European languages, has led to the recognition by many linguists that Chinese has neither grammar nor part of speech for a long time historically. It is because of the indefinite theory of Chinese that this kind of text leads to the situation that the error correction of Chinese is bigger and thus false alarm may occur.
(3) The word polysemous problem of Chinese characters: chinese characters often have word ambiguity phenomenon, such as 'Chinese character returning', can be made into two-tone huan, and means returning; at the same time, two voices hai can be made, meaning still and adherence. In a different context, such errors are difficult to successfully correct.
The current error correction method mainly comprises a rule-based method, an N-gram statistical model-based method and a deep neural network-based error correction method. The rule-based method is high in execution speed but poor in accuracy and adaptability; the method based on the N-gram statistical model can only process collocation errors among adjacent words and does not have syntactic analysis capability; the error correction method based on the deep neural network has higher requirements on hardware configuration.
Disclosure of Invention
The invention aims to provide a Chinese text error correction method and a Chinese text error correction system, which are used for solving the problems that the traditional Chinese text error correction method is low in accuracy, can only process collocation errors among adjacent words, does not have syntactic analysis capability and is high in hardware configuration requirement.
In order to achieve the above object, the present invention provides the following solutions:
a method for error correction of chinese text, comprising:
acquiring a text to be corrected;
determining the error words and the positions of the error words in the text to be corrected according to a statistical language N-gram model;
determining a first candidate sentence set by utilizing a two-way long-short-term memory LSTM model based on the error word and the error word position;
converting the text to be corrected into a pinyin sequence;
determining a second candidate sentence by using the N-gram model based on the pinyin sequence;
comparing the confusion degree of all the first candidate sentences in the first candidate sentence set with the confusion degree of the second candidate sentences, and determining the sentence with the lowest confusion degree as the text after error correction.
Optionally, the determining the error word and the error word position in the text to be corrected according to the statistical language N-gram model further includes:
collecting an original webpage, preprocessing the original webpage, determining a Chinese text corpus, and forming a corpus dictionary;
performing word segmentation on texts in the corpus dictionary by using a word segmentation device, and determining a plurality of segmented texts;
counting the number of the text after word segmentation and the co-occurrence frequency of any two words;
and constructing an N-gram model according to the co-occurrence frequency.
Optionally, the determining, based on the error word and the error word position, the first candidate sentence set by using a two-way long-short term memory LSTM model specifically includes:
converting the text after word segmentation into a word vector matrix by using a word vector tool;
taking the word vector matrix as the input of an LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delayed reverse propagation algorithm, and constructing a trained LSTM model;
substituting words in the corpus dictionary into the error word positions in the text to be corrected one by one, and determining substituted text;
inputting the substituted text into the trained LSTM model, outputting the occurrence probability of each word in the corpus dictionary at the position of the wrong word, and sequencing the substituted text according to the order of the occurrence probability from small to large to determine a first candidate sentence list;
and determining a first candidate sentence set according to the first candidate sentence list based on the error word.
Optionally, the determining, based on the error word, a first candidate sentence set according to the first candidate sentence list specifically includes:
judging whether the error word exists in the first candidate sentence list or not to obtain a first judging result;
if the first judgment indicates that the error word exists in the first candidate sentence list, determining that the text to be corrected is correct;
if the first judgment indicates that the error word does not exist in the first candidate sentence list, homophones and near-phones of the error word are screened from the first candidate sentence set, and a second candidate sentence list is determined according to the homophones and the near-phones;
and substituting the words in the second candidate sentence list into the positions of the error words in the text to be corrected one by one, and determining a first candidate sentence set.
Optionally, the determining, based on the pinyin sequence, a second candidate sentence by using the N-gram model specifically includes:
based on the pinyin sequence, constructing a plurality of candidate sentences according to the positions of the pinyin in the text to be corrected in the corpus dictionary;
and determining the probability of the plurality of candidate sentences by using the N-gram model, and taking the candidate sentence with the highest probability as the second candidate sentence.
A chinese text error correction system, comprising:
the text to be corrected acquisition module is used for acquiring the text to be corrected;
the error word and error word position determining module is used for determining the error word and error word position in the text to be corrected according to a statistical language N-gram model;
the first candidate sentence set determining module is used for determining a first candidate sentence set by utilizing a two-way long-short-term memory LSTM model based on the error words and the error word positions;
the pinyin sequence conversion module is used for converting the text to be corrected into a pinyin sequence;
the second candidate sentence determining module is used for determining a second candidate sentence by utilizing the N-gram model based on the pinyin sequence;
and the corrected text determining module is used for comparing the confusion degree of all the first candidate sentences in the first candidate sentence set with the confusion degree of the second candidate sentences, and determining the sentence with the lowest confusion degree as the corrected text.
Optionally, the method further comprises:
the corpus dictionary generating module is used for collecting original webpages, preprocessing the original webpages, determining a Chinese text corpus and forming a corpus dictionary;
the word segmentation module is used for carrying out word segmentation processing on texts in the corpus dictionary by utilizing a word segmentation device and determining a plurality of segmented texts;
the co-occurrence frequency rate determining module is used for counting the number of the text after word segmentation and the co-occurrence frequency of any two words;
and the N-gram model building module is used for building an N-gram model according to the co-occurrence frequency.
Optionally, the first candidate sentence set determining module specifically includes:
the word vector matrix conversion unit is used for converting the text subjected to word segmentation into a word vector matrix by using a word vector tool;
the trained LSTM model construction unit is used for taking the word vector matrix as the input of the LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delay reverse propagation algorithm, and constructing the trained LSTM model;
the substituted text determining unit is used for substituting the words in the corpus dictionary into the error word positions in the text to be corrected one by one to determine the substituted text;
the first candidate sentence list determining unit is used for inputting the substituted text into the trained LSTM model, outputting the occurrence probability of each word in the corpus dictionary at the position of the wrong word, and sequencing the substituted text according to the order of the occurrence probability from small to large to determine a first candidate sentence list;
and the first candidate sentence set determining unit is used for determining a first candidate sentence set according to the first candidate sentence list based on the error words.
Optionally, the first candidate sentence set determining unit specifically includes:
the first judging subunit is used for judging whether the error word exists in the first candidate sentence list or not to obtain a first judging result;
a text to be corrected correctly determining subunit, configured to determine that the text to be corrected is correct if the first determination indicates that the error word exists in the first candidate sentence list;
a second candidate sentence list determining subunit, configured to screen homophones and near phones of the erroneous word from the first candidate sentence set if the first determination indicates that the erroneous word does not exist in the first candidate sentence list, and determine a second candidate sentence list according to the homophones and the near phones;
and the first candidate sentence set determining subunit is used for substituting the words in the second candidate sentence list into the error word positions in the text to be corrected one by one to determine a first candidate sentence set.
Optionally, the second candidate sentence determination module specifically includes:
a plurality of candidate sentence construction units, configured to construct a plurality of candidate sentences according to the positions of the pinyin in the text to be corrected in the corpus dictionary based on the pinyin sequence;
and the second candidate sentence determining unit is used for determining the probability of the plurality of candidate sentences by utilizing the N-gram model, and taking the candidate sentence with the highest probability as the second candidate sentence.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the invention provides a Chinese text error correction method and a Chinese text error correction system, which are characterized in that an N-gram statistical language model is used for positioning error words in a text, a bidirectional LSTM deep neural network model and a pinyin sequence editing distance are adopted for respectively generating a first candidate sentence set and a second candidate sentence, and proper correct words are selected for replacement by calculating the confusion degree of the candidate sentences, so that the error correction and error correction rate of the Chinese text is improved, the hardware configuration requirement is low, the Chinese text error correction method and the system can be applied to manuscript content correction in scenes such as daily offices, and the like, and have high practical value.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for correcting errors of Chinese text provided by the invention;
FIG. 2 is a flowchart of another method for correcting errors in Chinese text according to the present invention;
fig. 3 is a diagram of a chinese text error correction system according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention aims to provide a Chinese text error correction method and a Chinese text error correction system, which can improve the error correction and error correction rate of Chinese text and reduce the hardware configuration requirement.
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
Fig. 1 is a flowchart of a method for correcting errors in chinese text, as shown in fig. 1, where the method for correcting errors in chinese text includes:
step 101: and acquiring a text to be corrected.
Step 102: and determining the error words and the positions of the error words in the text to be corrected according to the statistical language N-gram model.
Training N-gram statistical language model
Collecting original webpages from public document websites disclosed on the Internet, preprocessing the original webpages to form a corpus of pure document texts, forming a corpus dictionary, then utilizing a jieba word segmentation device to carry out Chinese word segmentation, counting the number of all words and the co-occurrence frequency of any two words, and calculating the co-occurrence probability of all 2-element words according to a calculation formula of an N-gram model to form a 2-gram (Bigram) statistical language model:
P(S)≈P(w 1 )*P(w 2 |w 1 )*P(w 3 |w 2 )*...*P(w n |w n-1 )
and adopting a trained N-gram language model to perform error positioning on the input sentence, and based on the co-occurrence condition of the words in the training corpus, if the co-occurrence probability of the N-element words in the training corpus is lower than a threshold value, considering that errors exist at the N-element words.
Step 103: and determining a first candidate sentence set by utilizing a two-way long-short-term memory LSTM model based on the error word and the error word position.
The step 103 specifically includes: converting the text after word segmentation into a word vector matrix by using a word vector tool; taking the word vector matrix as the input of an LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delayed reverse propagation algorithm, and constructing a trained LSTM model; substituting words in the corpus dictionary into the error word positions in the text to be corrected one by one, and determining substituted text; inputting the substituted text into the trained LSTM model, outputting the occurrence probability of each word in the corpus dictionary at the position of the wrong word, and sequencing the substituted text according to the order of the occurrence probability from small to large to determine a first candidate sentence list; and determining a first candidate sentence set according to the first candidate sentence list based on the error word.
The determining a first candidate sentence set according to the first candidate sentence list based on the error word specifically includes: judging whether the error word exists in the first candidate sentence list, if so, determining that the N-gram model judges error, and determining that the text to be corrected is correct; if not, homophones and near phones of the error words are screened from the first candidate sentence set, and a second candidate sentence list is determined according to the homophones and the near phones; and substituting the words in the second candidate sentence list into the positions of the error words in the text to be corrected one by one, and determining a first candidate sentence set.
Training a bidirectional LSTM model by using the corpus, wherein the model training steps are as follows:
a) Converting sentences in the preprocessed text corpus into a word vector matrix through word2vec, and using the word vector matrix as an input of an LSTM model;
b) The model is trained by forward propagation and time-along back propagation algorithms.
Substituting words in the dictionary into the error position of the sentence one by one, inputting the new sentence after substitution into a trained bidirectional LSTM model, and sorting the calculated probabilities by calculating the probability of each word in the output dictionary, and screening out the front topK as a set A.
Based on set a, the following determination is made: if the N-gram judges that the wrong word is in the set A, the N-gram model is considered to judge that the word is wrong, namely that the sentence has no error; if the N-gram judges that the wrong word is not in the set A, homonyms and near-homonyms of the word are selected from the set A to serve as a new set A ', the words in the set A' are substituted into the wrong positions of the sentences one by one, a first candidate sentence set S is obtained, and PPL of all sentences in the S is calculated.
Calculation formula of PPL:
for a sentence (sentence) s is composed of words, w represents a word. Namely:
s=w 1 w 2 …w n
PPL(S)=P(w 1 w 2 …w N ) -1/N
wherein P is the probability of a sentence, and N is the sentence length, i.e. the number of words.
In particular, for the 2-gram model, there are:
wherein p (w) 1 …w n ) Is the probability of a sentence, p (w i |w i-1 ) The conditional probability of co-occurrence of two words can be directly output by the trained 2-gram model, and the calculation formula is as follows:
P(w i |w i-1 )=count(w i ,w i-1 )/count(w i-1 )
wherein count (w) i-1 ) For the word w i-1 Number of occurrences in corpus, count (w i ,w i-1 ) Is w i ,w i-1 Number of times two words appear simultaneously.
Step 104: and converting the text to be corrected into a pinyin sequence.
Step 105: and determining a second candidate sentence by utilizing the N-gram model based on the pinyin sequence.
The step 105 specifically includes: based on the pinyin sequence, constructing a plurality of candidate sentences according to the positions of the pinyin in the text to be corrected in the corpus dictionary; and determining the probability of the plurality of candidate sentences by using the N-gram model, and taking the candidate sentence with the highest probability as the second candidate sentence.
Error positioning and correction based on pinyin sequence dynamic programming algorithm:
all the input text to be corrected (sentence X) is converted into phonetic sequences, each phonetic corresponds to one or more Chinese characters, and all the candidate Chinese characters form L candidate sentences according to the positions of the phonetic characters in the original sentence. Based on the 2-gram language model, calculating the probability size of each sentence:
P(S)≈P(w 1 )*P(w 2 |w 1 )*P(w 3 |w 2 )*...*P(w n |w n-1 )
the second candidate sentence (sentence Y) having the highest probability is selected.
And comparing the text X to be corrected with the sentence Y, and if the words of the two sentences at the position i are different, calculating the PPL of the X and the Y at the position i, wherein the PPL is calculated as above.
Step 106: comparing the confusion degree of all the first candidate sentences in the first candidate sentence set with the confusion degree of the second candidate sentences, and determining the sentence with the lowest confusion degree as the text after error correction.
And comparing the PPL values of all the sentences in X, Y and S, selecting the sentence with the smallest PPL value as the corrected sentence, and outputting the corrected sentence.
Based on the method for correcting Chinese text provided by the invention, the text correction process is divided into two stages of error correction and error correction, namely error words and positions possibly existing in text sentences are judged according to the word co-occurrence probability calculated by the N-gram model, the error correction stage firstly generates a candidate word list at a corresponding position according to the error positions and error words detected and then according to the deep neural network model calculation, the candidate words corresponding to each error word are ordered and screened, and the best result is recommended to a user, and fig. 2 is a flow chart of another method for correcting Chinese text provided by the invention, as shown in fig. 2.
A method for determining the position of error words in text includes such steps as providing a probability distribution for each word, determining the position of each word, and choosing the candidate sentence with lowest confusion from the candidate sentences.
In practical application, the invention is specifically applied as follows:
(1) N-gram language model calculation process:
in the case of chinese misprinting, judging whether a sentence is correct or not can be obtained by calculating its probability, and assuming that a sentence s= { w1, w2, & gt, wn }, the problem can be converted into the following form:
P(s)=P(w 1 ,w 2 ,...,w n )=P(w 1 )*P(w 2 |w 1 )*…*P(w n |w 1 ,w 2 ,…,w n-1 )
p(s) is called a language model, i.e. a model used to calculate the legal probability of a sentence.
When the formula is used for actual calculation, the parameter space is too large, the information matrix is seriously sparse and is difficult to be practical, an N-gram model is adopted in practice, the N-gram model is based on the assumption of a Markov model, the occurrence probability of one word only depends on the first 1 word or the first few words of the word, and the formula is evolved into:
(1) The occurrence of one word depends only on the first 1 word, i.e. Bigram (2-gram):
P(S)≈P(w 1 )*P(w 2 |w 1 )*P(w 3 |w 2 )*…*P(w n |w n-1 )
(2) The occurrence of one word depends only on the first 2 words, i.e. Trigram (3-gram):
P(S)≈P(w 1 )*P(w 2 |w 1 )*P(w 3 |w 1 w 2 )*…*P(w n |w n-2 w n-1 )
when the n-gram is larger, the constraint on the next word is stronger, and because more information is provided, but at the same time the model is more complex, the problem is more, a bigram or trigram is generally adopted.
The specific use of an n-gram is described below as a simple example:
the N-gram model constructs a language model through statistics of the number of words, and the calculation formula of the N-gram model is as follows for the Bigram:
P(w i |w i-1 )=count(w i ,w i-1 )/count(w i-1 )
p is the conditional probability of co-occurrence of two words, and two counts are the statistics of the occurrence times of the words in the corpus respectively.
Bigram is a 2-element language model, the co-occurrence probability of 2-element words is calculated by counting the number of 2 words in a corpus and the same Trigram is a 3-element language model.
Assume now that there is a corpus as follows, where < s > is the sentence head label and < s > is the sentence tail label:
<s1><s2>yes no no no no yes</s2></s1>
<s1><s2>nononoyesyesyesno</s2></s1>
the following task is to evaluate the probability of this sentence as follows:
<s1><s2>yesnonoyes</s2></s1>
results of calculating probabilities using trigram model:
P(yes|<s1>,<s2>)=1/2,
P(no|yes,no)=1/2,
P(</s2>|no,yes)=1/2,
P(no|<s2>,yes)=1,
P(yes|no,no)=2/5,
P(</s1>|yes,</s2>)=1
the required probability is equal to:
1/2×1×1/2×2/5×1/2×1=0.05
if the probability is less than a defined threshold, this indicates that there is an error in the sentence or that the sentence is not reasonable.
The mispronounced word of the Chinese text has locality, namely only a reasonable sliding window is selected to check whether the mispronounced word exists, and the following is an example:
the text is input, "this case has been passed through by the superior court to the inferior court process," wherein "pass to" write error is "pass to". When the model locally analyzes sentences, the co-occurrence probability of word strings is calculated to be lower than a threshold value, and the analyzer refuses to accept and judges that the word strings are wrong.
The n-gram model can be used for detecting that the 'through' word is wrongly beaten, at the moment, the 'through' word is converted into pinyin 'chuan', candidate words of the 'chuan' are searched from a dictionary, filling is performed one by one, and the n-gram is used for checking to see whether the word is reasonable or not. This is the n-gram model combines the pinyin of Chinese characters to make Chinese text misplacement word correction.
(2) Missing checking process
The error detection module comprises a Bigram subword co-occurrence and neural network model.
Bigram subword co-occurrence: and counting the co-occurrence times of the two subwords in the window with the length of k according to the collected large-scale corpus. Considering the sequence comprehensively, (w 1, w 2) is different from (w 2, w 1), and finally, the phrase with high frequency co-occurrence is reserved as initial information input into the neural network language model.
Neural network language model: the invention adopts a neural network language model based on bidirectional lstm to acquire the context information of an input text, so as to predict the probability analysis condition of a current position word according to the context information, and perform conditional probability modeling on the current word and each word in the front and rear sentences of the current word, thereby judging the final error-checking result of the current Chinese character and giving out suspected error words and candidate sets.
(3) Error correction process
And obtaining the combination of the correct text corresponding to the input text according to the candidate set by positioning the error word in the input text. And obtaining a modification result of the input text according to the sorting result. Y=input text, yi=sequence in the correct text combination for the input text.
Ranking score calculation:
Score=a1*ppl(Yi)+a2*edit_distance(Y,Yi)+a3*WordCount(Yi)
where ppl (Yi) is ppl of the language model, edit_distance (Y, yi) is the edit distance, and WordCount is the number of words. The language model for calculating ppl is a one-way LSTM statistical language model.
Fig. 3 is a diagram of a chinese text error correction system according to the present invention, as shown in fig. 3, a chinese text error correction system includes:
the text to be corrected obtaining module 301 is configured to obtain the text to be corrected.
The invention also includes: the corpus dictionary generating module is used for collecting original webpages, preprocessing the original webpages, determining a Chinese text corpus and forming a corpus dictionary; the word segmentation module is used for carrying out word segmentation processing on texts in the corpus dictionary by utilizing a word segmentation device and determining a plurality of segmented texts; the co-occurrence frequency rate determining module is used for counting the number of the text after word segmentation and the co-occurrence frequency of any two words; and the N-gram model building module is used for building an N-gram model according to the co-occurrence frequency.
The wrong word and wrong word position determining module 302 is configured to determine a wrong word and a wrong word position in the text to be corrected according to a statistical language N-gram model.
The first candidate sentence set determining module 303 is configured to determine a first candidate sentence set using a two-way long-short term memory LSTM model based on the erroneous word and the erroneous word position.
The first candidate sentence set determining module 303 specifically includes: the word vector matrix conversion unit is used for converting the text subjected to word segmentation into a word vector matrix by using a word vector tool; the trained LSTM model construction unit is used for taking the word vector matrix as the input of the LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delay reverse propagation algorithm, and constructing the trained LSTM model; the substituted text determining unit is used for substituting the words in the corpus dictionary into the error word positions in the text to be corrected one by one to determine the substituted text; the first candidate sentence list determining unit is used for inputting the substituted text into the trained LSTM model, outputting the occurrence probability of each word in the corpus dictionary at the position of the wrong word, and sequencing the substituted text according to the order of the occurrence probability from small to large to determine a first candidate sentence list; and the first candidate sentence set determining unit is used for determining a first candidate sentence set according to the first candidate sentence list based on the error words.
The first candidate sentence set determining unit specifically includes: the first judging subunit is used for judging whether the error word exists in the first candidate sentence list or not to obtain a first judging result; a text to be corrected correctly determining subunit, configured to determine that the N-gram model is incorrect and determine that the text to be corrected is correct if the first determination indicates that the incorrect word exists in the first candidate sentence list; a second candidate sentence list determining subunit, configured to screen homophones and near phones of the erroneous word from the first candidate sentence set if the first determination indicates that the erroneous word does not exist in the first candidate sentence list, and determine a second candidate sentence list according to the homophones and the near phones; and the first candidate sentence set determining subunit is used for substituting the words in the second candidate sentence list into the error word positions in the text to be corrected one by one to determine a first candidate sentence set.
And the pinyin sequence conversion module 304 is configured to convert the text to be corrected into a pinyin sequence.
A second candidate sentence determination module 305, configured to determine a second candidate sentence using the N-gram model based on the pinyin sequence.
The second candidate sentence determination module 305 specifically includes: a plurality of candidate sentence construction units, configured to construct a plurality of candidate sentences according to the positions of the pinyin in the text to be corrected in the corpus dictionary based on the pinyin sequence; and the second candidate sentence determining unit is used for determining the probability of the plurality of candidate sentences by utilizing the N-gram model, and taking the candidate sentence with the highest probability as the second candidate sentence.
And the corrected text determining module 306 is configured to compare the confusion degree of all the first candidate sentences in the first candidate sentence set with the confusion degree of the second candidate sentences, and determine that the sentence with the lowest confusion degree is the corrected text.
The invention provides a Chinese text error correction method and a Chinese text error correction system by combining a statistical language model with a deep neural network model, which remarkably improve the error correction and error correction rate of Chinese texts, can be applied to manuscript content correction in scenes such as daily office work and the like, and has high practical value.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims (10)

1. A method for correcting errors in chinese text, comprising:
acquiring a text to be corrected;
determining the error words and the positions of the error words in the text to be corrected according to a statistical language N-gram model;
determining a first candidate sentence set by utilizing a two-way long-short-term memory LSTM model based on the error word and the error word position;
converting the text to be corrected into a pinyin sequence;
determining a second candidate sentence by using the N-gram model based on the pinyin sequence;
comparing the confusion degree of all the first candidate sentences in the first candidate sentence set with the confusion degree of the second candidate sentences, and determining the sentence with the lowest confusion degree as the text after error correction.
2. The method for chinese text correction according to claim 1, wherein said determining the erroneous word and the position of the erroneous word in the text to be corrected according to a statistical language N-gram model further comprises:
collecting an original webpage, preprocessing the original webpage, determining a Chinese text corpus, and forming a corpus dictionary;
performing word segmentation on texts in the corpus dictionary by using a word segmentation device, and determining a plurality of segmented texts;
counting the number of the text after word segmentation and the co-occurrence frequency of any two words;
and constructing an N-gram model according to the co-occurrence frequency.
3. The method for correcting chinese text according to claim 2, wherein said determining a first candidate sentence set using a two-way long-short term memory LSTM model based on said erroneous word and said erroneous word position, specifically comprises:
converting the text after word segmentation into a word vector matrix by using a word vector tool;
taking the word vector matrix as the input of an LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delayed reverse propagation algorithm, and constructing a trained LSTM model;
substituting words in the corpus dictionary into the error word positions in the text to be corrected one by one, and determining substituted text;
inputting the substituted text into the trained LSTM model, outputting the occurrence probability of each word in the corpus dictionary at the position of the wrong word, and sequencing the substituted text according to the order of the occurrence probability from small to large to determine a first candidate sentence list;
and determining a first candidate sentence set according to the first candidate sentence list based on the error word.
4. A method of chinese text correction as recited in claim 3 wherein said determining a first set of candidate sentences from said first list of candidate sentences based on said erroneous terms comprises:
judging whether the error word exists in the first candidate sentence list or not to obtain a first judging result;
if the first judgment indicates that the error word exists in the first candidate sentence list, determining that the text to be corrected is correct;
if the first judgment indicates that the error word does not exist in the first candidate sentence list, homophones and near-phones of the error word are screened from the first candidate sentence set, and a second candidate sentence list is determined according to the homophones and the near-phones;
and substituting the words in the second candidate sentence list into the positions of the error words in the text to be corrected one by one, and determining a first candidate sentence set.
5. The method for correcting Chinese text according to claim 2, wherein the determining a second candidate sentence by using the N-gram model based on the pinyin sequence specifically comprises:
based on the pinyin sequence, constructing a plurality of candidate sentences according to the positions of the pinyin in the text to be corrected in the corpus dictionary;
and determining the probability of the plurality of candidate sentences by using the N-gram model, and taking the candidate sentence with the highest probability as the second candidate sentence.
6. A chinese text error correction system, comprising:
the text to be corrected acquisition module is used for acquiring the text to be corrected;
the error word and error word position determining module is used for determining the error word and error word position in the text to be corrected according to a statistical language N-gram model;
the first candidate sentence set determining module is used for determining a first candidate sentence set by utilizing a two-way long-short-term memory LSTM model based on the error words and the error word positions;
the pinyin sequence conversion module is used for converting the text to be corrected into a pinyin sequence;
the second candidate sentence determining module is used for determining a second candidate sentence by utilizing the N-gram model based on the pinyin sequence;
and the corrected text determining module is used for comparing the confusion degree of all the first candidate sentences in the first candidate sentence set with the confusion degree of the second candidate sentences, and determining the sentence with the lowest confusion degree as the corrected text.
7. The chinese text error correction system of claim 6, further comprising:
the corpus dictionary generating module is used for collecting original webpages, preprocessing the original webpages, determining a Chinese text corpus and forming a corpus dictionary;
the word segmentation module is used for carrying out word segmentation processing on texts in the corpus dictionary by utilizing a word segmentation device and determining a plurality of segmented texts;
the co-occurrence frequency rate determining module is used for counting the number of the text after word segmentation and the co-occurrence frequency of any two words;
and the N-gram model building module is used for building an N-gram model according to the co-occurrence frequency.
8. The chinese text error correction system of claim 7, wherein said first candidate sentence set determination module specifically comprises:
the word vector matrix conversion unit is used for converting the text subjected to word segmentation into a word vector matrix by using a word vector tool;
the trained LSTM model construction unit is used for taking the word vector matrix as the input of the LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delay reverse propagation algorithm, and constructing the trained LSTM model;
the substituted text determining unit is used for substituting the words in the corpus dictionary into the error word positions in the text to be corrected one by one to determine the substituted text;
the first candidate sentence list determining unit is used for inputting the substituted text into the trained LSTM model, outputting the occurrence probability of each word in the corpus dictionary at the position of the wrong word, and sequencing the substituted text according to the order of the occurrence probability from small to large to determine a first candidate sentence list;
and the first candidate sentence set determining unit is used for determining a first candidate sentence set according to the first candidate sentence list based on the error words.
9. The chinese text error correction system of claim 8, wherein said first candidate sentence set determining unit specifically comprises:
the first judging subunit is used for judging whether the error word exists in the first candidate sentence list or not to obtain a first judging result;
a text to be corrected correctly determining subunit, configured to determine that the text to be corrected is correct if the first determination indicates that the error word exists in the first candidate sentence list;
a second candidate sentence list determining subunit, configured to screen homophones and near phones of the erroneous word from the first candidate sentence set if the first determination indicates that the erroneous word does not exist in the first candidate sentence list, and determine a second candidate sentence list according to the homophones and the near phones;
and the first candidate sentence set determining subunit is used for substituting the words in the second candidate sentence list into the error word positions in the text to be corrected one by one to determine a first candidate sentence set.
10. The chinese text error correction system of claim 7, wherein said second candidate sentence determination module comprises:
a plurality of candidate sentence construction units, configured to construct a plurality of candidate sentences according to the positions of the pinyin in the text to be corrected in the corpus dictionary based on the pinyin sequence;
and the second candidate sentence determining unit is used for determining the probability of the plurality of candidate sentences by utilizing the N-gram model, and taking the candidate sentence with the highest probability as the second candidate sentence.
CN202011021044.4A 2020-09-25 2020-09-25 Chinese text error correction method and system Active CN112149406B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011021044.4A CN112149406B (en) 2020-09-25 2020-09-25 Chinese text error correction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011021044.4A CN112149406B (en) 2020-09-25 2020-09-25 Chinese text error correction method and system

Publications (2)

Publication Number Publication Date
CN112149406A CN112149406A (en) 2020-12-29
CN112149406B true CN112149406B (en) 2023-09-08

Family

ID=73896929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011021044.4A Active CN112149406B (en) 2020-09-25 2020-09-25 Chinese text error correction method and system

Country Status (1)

Country Link
CN (1) CN112149406B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800987B (en) * 2021-02-02 2023-07-21 中国联合网络通信集团有限公司 Chinese character processing method and device
CN112735396A (en) * 2021-02-05 2021-04-30 北京小米松果电子有限公司 Speech recognition error correction method, device and storage medium
CN112989806A (en) * 2021-04-07 2021-06-18 广州伟宏智能科技有限公司 Intelligent text error correction model training method
CN113076739A (en) * 2021-04-09 2021-07-06 厦门快商通科技股份有限公司 Method and system for realizing cross-domain Chinese text error correction
CN113096667A (en) * 2021-04-19 2021-07-09 上海云绅智能科技有限公司 Wrongly-written character recognition detection method and system
CN113051896B (en) * 2021-04-23 2023-08-18 百度在线网络技术(北京)有限公司 Method and device for correcting text, electronic equipment and storage medium
CN112883717A (en) * 2021-04-27 2021-06-01 北京嘉和海森健康科技有限公司 Wrongly written character detection method and device
CN113343671B (en) * 2021-06-07 2023-03-31 佳都科技集团股份有限公司 Statement error correction method, device and equipment after voice recognition and storage medium
CN117113978A (en) * 2021-06-24 2023-11-24 湖北大学 Text error correction system for debugging by using shielding language model
CN113361266B (en) * 2021-06-25 2022-12-06 达闼机器人股份有限公司 Text error correction method, electronic device and storage medium
CN113780418A (en) * 2021-09-10 2021-12-10 平安科技(深圳)有限公司 Data screening method, system, equipment and storage medium
CN114328798B (en) * 2021-11-09 2024-02-23 腾讯科技(深圳)有限公司 Processing method, device, equipment, storage medium and program product for searching text
CN114495910B (en) * 2022-04-07 2022-08-02 联通(广东)产业互联网有限公司 Text error correction method, system, device and storage medium
CN115310434B (en) * 2022-10-11 2023-01-06 深圳擎盾信息科技有限公司 Error correction method and device for grammars of contracting documents, computer equipment and storage medium
CN115719059B (en) * 2022-11-29 2023-08-08 北京中科智加科技有限公司 Morse grouping error correction method
CN116090441B (en) * 2022-12-30 2023-10-20 永中软件股份有限公司 Chinese spelling error correction method integrating local semantic features and global semantic features
CN116306600B (en) * 2023-05-25 2023-08-11 山东齐鲁壹点传媒有限公司 MacBert-based Chinese text error correction method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765996A (en) * 2019-10-21 2020-02-07 北京百度网讯科技有限公司 Text information processing method and device
CN111369996A (en) * 2020-02-24 2020-07-03 网经科技(苏州)有限公司 Method for correcting text error in speech recognition in specific field

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10431210B1 (en) * 2018-04-16 2019-10-01 International Business Machines Corporation Implementing a whole sentence recurrent neural network language model for natural language processing
US20200125639A1 (en) * 2018-10-22 2020-04-23 Ca, Inc. Generating training data from a machine learning model to identify offensive language

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765996A (en) * 2019-10-21 2020-02-07 北京百度网讯科技有限公司 Text information processing method and device
CN111369996A (en) * 2020-02-24 2020-07-03 网经科技(苏州)有限公司 Method for correcting text error in speech recognition in specific field

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于上下文语义的新闻人名纠错方法;杨越;黄瑞章;魏琴;陈艳平;秦永彬;;电子科技大学学报(第06期);全文 *

Also Published As

Publication number Publication date
CN112149406A (en) 2020-12-29

Similar Documents

Publication Publication Date Title
CN112149406B (en) Chinese text error correction method and system
CN111369996B (en) Speech recognition text error correction method in specific field
US7383172B1 (en) Process and system for semantically recognizing, correcting, and suggesting domain specific speech
US7424675B2 (en) Language input architecture for converting one text form to another text form with tolerance to spelling typographical and conversion errors
US7165019B1 (en) Language input architecture for converting one text form to another text form with modeless entry
JP6675463B2 (en) Bidirectional stochastic rewriting and selection of natural language
Wilcox-O’Hearn et al. Real-word spelling correction with trigrams: A reconsideration of the Mays, Damerau, and Mercer model
US6311152B1 (en) System for chinese tokenization and named entity recognition
US20030046078A1 (en) Supervised automatic text generation based on word classes for language modeling
CN109145287B (en) Indonesia word error detection and correction method and system
CN111753529B (en) Chinese text error correction method based on pinyin identity or similarity
JP6778655B2 (en) Word concatenation discriminative model learning device, word concatenation detection device, method, and program
Lee et al. Automatic word spacing using probabilistic models based on character n-grams
CN114564912A (en) Intelligent checking and correcting method and system for document format
CN114298010A (en) Text generation method integrating dual-language model and sentence detection
JP5097802B2 (en) Japanese automatic recommendation system and method using romaji conversion
KR102204395B1 (en) Method and system for automatic word spacing of voice recognition using named entity recognition
Mekki et al. COTA 2.0: An automatic corrector of tunisian Arabic social media texts
Dinarelli Spoken language understanding: from spoken utterances to semantic structures
CN114548075A (en) Text processing method, text processing device, storage medium and electronic equipment
Parveen et al. Clause Boundary Identification using Classifier and Clause Markers in Urdu Language
KR19990070636A (en) Tagging device and its method
Duan et al. Research on Chinese Text Error Correction Based on Sequence Model
Athanaselis et al. A corpus based technique for repairing ill-formed sentences with word order errors using co-occurrences of n-grams
CN113033188B (en) Tibetan grammar error correction method based on neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant