CN112149406A

CN112149406A - Chinese text error correction method and system

Info

Publication number: CN112149406A
Application number: CN202011021044.4A
Authority: CN
Inventors: 钱宝生; 杨军; 曾擂; 王滨; 干家东
Original assignee: CETC 15 Research Institute
Current assignee: CETC 15 Research Institute
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2020-12-29
Anticipated expiration: 2040-09-25
Also published as: CN112149406B

Abstract

The invention relates to a Chinese text error correction method and a Chinese text error correction system. The Chinese text error correction method comprises the following steps: acquiring a text to be corrected; determining error words and error word positions in the text to be corrected according to a statistical language N-gram model; determining a first candidate sentence set by utilizing a bidirectional long-short term memory (LSTM) model based on the error words and the error word positions; converting the text to be corrected into a pinyin sequence; determining a second candidate sentence by using the N-gram model based on the pinyin sequence; and comparing the confusion degrees of all the first candidate sentences in the first candidate sentence set with the confusion degrees of all the second candidate sentences to determine the sentence with the lowest confusion degree as the corrected text. The invention can improve the error checking and correcting rate of the Chinese text and reduce the requirement of hardware configuration.

Description

Chinese text error correction method and system

Technical Field

The invention relates to the technical field of natural language processing, in particular to a Chinese text error correction method and a Chinese text error correction system.

Background

Chinese text often contains various errors such as near word errors, homophone errors, term class errors, semantic errors, idioms or postlanguage errors, and so forth. In some important situations, documents with errors will cause significant loss, and manual error correction will be inefficient and will consume a lot of time in the face of a large amount of text. The technical difficulty of correcting the Chinese text is as follows:

(1) accuracy of named entity identification: for some rule errors, dictionaries in corresponding fields need to be constructed, such as leader name proofreading, and corresponding information of a leader name and a position, which can be updated in real time, needs to be provided.

(2) The Chinese grammar rule is complex: the biggest feature in standard chinese grammar is the lack of strictly meaningful morphological changes. There is no definite change in nouns, nor is there any difference in the nature and number. Verbs are not distinguished from humans, nor are there tenses. This feature, which is different from european languages, has led to chinese being considered by many linguists as having no grammar and no part of speech for a long period of time in history. Due to the indeterminate theory of the Chinese language, the Chinese language has large error correction, and therefore false alarm may occur.

(3) The Chinese character word polysemy problem: chinese characters often have a phenomenon of word ambiguity, such as 'returning' which can be called a two-tone huan meaning returning and returning; at the same time, two sounds hai can be made, meaning still, insist. Such errors are more difficult to correct successfully in different contexts.

The current error correction method mainly comprises a rule-based method, an N-gram statistical model-based method and a deep neural network-based error correction method. The rule-based method has high execution speed but poor accuracy and adaptability; the method based on the N-gram statistical model can only process collocation errors among adjacent words and has no syntactic analysis capability; the error correction method based on the deep neural network has higher requirements on hardware configuration.

Disclosure of Invention

The invention aims to provide a Chinese text error correction method and a Chinese text error correction system, which are used for solving the problems that the existing Chinese text error correction method is low in accuracy, can only process collocation errors among adjacent words, does not have syntactic analysis capability and has high hardware configuration requirement.

In order to achieve the purpose, the invention provides the following scheme:

a Chinese text error correction method comprises the following steps:

acquiring a text to be corrected;

determining error words and error word positions in the text to be corrected according to a statistical language N-gram model;

determining a first candidate sentence set by utilizing a bidirectional long-short term memory (LSTM) model based on the error words and the error word positions;

converting the text to be corrected into a pinyin sequence;

determining a second candidate sentence by using the N-gram model based on the pinyin sequence;

and comparing the confusion degrees of all the first candidate sentences in the first candidate sentence set with the confusion degrees of all the second candidate sentences to determine the sentence with the lowest confusion degree as the corrected text.

Optionally, the determining, according to a statistical language N-gram model, an erroneous word and an erroneous word position in the text to be corrected further includes:

collecting an original webpage, preprocessing the original webpage, determining a Chinese text corpus and forming a corpus dictionary;

performing word segmentation processing on the texts in the corpus dictionary by using a word segmentation device, and determining a plurality of segmented texts;

counting the number of the texts after all the word segmentation and the co-occurrence frequency of any two words;

and constructing an N-gram model according to the co-occurrence frequency.

Optionally, the determining, based on the erroneous term and the position of the erroneous term, a first candidate statement set by using a bidirectional long-short term memory LSTM model specifically includes:

converting the text after word segmentation into a word vector matrix by using a word vector tool;

taking the word vector matrix as the input of an LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delayed backward propagation algorithm, and constructing the trained LSTM model;

substituting the characters in the corpus dictionary into the positions of the wrong words in the text to be corrected one by one, and determining the substituted text;

inputting the substituted text into the trained LSTM model, outputting the occurrence probability of each word in the dictionary of the corpus at the position of the wrong word, sequencing the substituted text according to the sequence of the occurrence probabilities from small to large, and determining a first candidate sentence list;

based on the erroneous word, a first set of candidate sentences is determined from the first list of candidate sentences.

Optionally, the determining, based on the erroneous word and according to the first candidate sentence list, a first candidate sentence set specifically includes:

judging whether the wrong words exist in the first candidate sentence list or not to obtain a first judgment result;

if the first judgment shows that the wrong words exist in the first candidate sentence list, determining that the text to be corrected is correct;

if the first judgment shows that the wrong words do not exist in the first candidate sentence list, homophone words and near-phoneme words of the wrong words are screened out from the first candidate sentence set, and a second candidate sentence list is determined according to the homophone words and the near-phoneme words;

and substituting the words in the second candidate sentence list into the positions of the wrong words in the text to be corrected one by one to determine a first candidate sentence set.

Optionally, the determining, based on the pinyin sequence, a second candidate sentence by using the N-gram model specifically includes:

based on the pinyin sequence, constructing a plurality of candidate sentences for the text in the corpus dictionary according to the position of pinyin in the text to be corrected;

and determining the probabilities of the candidate sentences by using the N-gram model, and taking the candidate sentence with the highest probability as the second candidate sentence.

A chinese text correction system comprising:

the text to be corrected acquiring module is used for acquiring a text to be corrected;

the error word and error word position determining module is used for determining the error word and the error word position in the text to be corrected according to a statistical language N-gram model;

a first candidate sentence set determining module, configured to determine a first candidate sentence set by using a bidirectional long-short term memory (LSTM) model based on the erroneous word and the erroneous word position;

the pinyin sequence conversion module is used for converting the text to be corrected into a pinyin sequence;

a second candidate sentence determination module, configured to determine a second candidate sentence by using the N-gram model based on the pinyin sequence;

and the text determining module after error correction is used for comparing the confusion degrees of all the first candidate sentences in the first candidate sentence set with the confusion degree of the second candidate sentences and determining the sentence with the lowest confusion degree as the text after error correction.

Optionally, the method further includes:

the corpus dictionary generating module is used for collecting original webpages, preprocessing the original webpages, determining a Chinese text corpus and forming a corpus dictionary;

the word segmentation module is used for performing word segmentation processing on the texts in the corpus dictionary by using a word segmentation device and determining a plurality of segmented texts;

the co-occurrence frequency rate determining module is used for counting the number of the texts after the word segmentation and the co-occurrence frequency of any two words;

and the N-gram model building module is used for building an N-gram model according to the co-occurrence frequency.

Optionally, the first candidate sentence set determining module specifically includes:

the word vector matrix conversion unit is used for converting the text after word segmentation into a word vector matrix by using a word vector tool;

the trained LSTM model building unit is used for taking the word vector matrix as the input of the LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delayed backward propagation algorithm, and building the trained LSTM model;

the substituted text determining unit is used for substituting the characters in the corpus dictionary into the positions of the wrong words in the text to be corrected one by one to determine the substituted text;

a first candidate sentence list determining unit, configured to input the substituted text into the trained LSTM model, output an occurrence probability of each word in the corpus dictionary at the position of the wrong word, sort the substituted text according to a sequence of the occurrence probabilities from small to large, and determine a first candidate sentence list;

a first candidate sentence set determination unit, configured to determine a first candidate sentence set according to the first candidate sentence list based on the erroneous word.

Optionally, the first candidate sentence set determining unit specifically includes:

the first judgment subunit is configured to judge whether the wrong word exists in the first candidate sentence list, so as to obtain a first judgment result;

a text to be corrected correctness determining subunit, configured to determine that the text to be corrected is correct if the first determination indicates that the erroneous word exists in the first candidate sentence list;

a second candidate sentence list determining subunit, configured to, if the first determination indicates that the incorrect word does not exist in the first candidate sentence list, screen homophones and nearsighted characters with the incorrect word from the first candidate sentence set, and determine a second candidate sentence list according to the homophones and the nearsighted characters;

and the first candidate sentence set determining subunit is used for substituting the words in the second candidate sentence list into the positions of the error words in the text to be corrected one by one to determine a first candidate sentence set.

Optionally, the second candidate sentence determining module specifically includes:

the candidate sentence construction units are used for constructing a plurality of candidate sentences for the text in the corpus dictionary according to the position of the pinyin in the text to be corrected based on the pinyin sequence;

a second candidate sentence determination unit configured to determine probabilities of the plurality of candidate sentences using the N-gram model, and to take a candidate sentence with a largest probability as the second candidate sentence.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the invention provides a Chinese text error correction method and a system, which are characterized in that an N-gram statistical language model is used for positioning error words in a text, a first candidate sentence set and a second candidate sentence are respectively generated by adopting a bidirectional LSTM deep neural network model and a pinyin sequence editing distance, and proper correct words are selected for replacement by calculating the confusion degree of the candidate sentences, so that the error checking and correction rate of the Chinese text is improved, the requirement on hardware configuration is low, the method and the system can be applied to manuscript content proofreading in daily office and other scenes, and the method and the system have high practical value.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a method for correcting errors in a Chinese text according to the present invention;

FIG. 2 is a flow chart of another method for correcting errors in Chinese text according to the present invention;

FIG. 3 is a structural diagram of a Chinese text error correction system according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention aims to provide a Chinese text error correction method and a Chinese text error correction system, which can improve the error checking and correction rate of Chinese texts and reduce the hardware configuration requirement.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a flowchart of a method for correcting a chinese text according to the present invention, and as shown in fig. 1, the method for correcting a chinese text includes:

step 101: and acquiring the text to be corrected.

Step 102: and determining the error words and the positions of the error words in the text to be corrected according to a statistical language N-gram model.

Training N-gram statistical language model

Collecting original webpages about 50 thousands of from a public document website disclosed on the Internet, preprocessing the original webpages to form a corpus of a plain document text, forming a corpus dictionary, performing Chinese word segmentation by using a jieba word segmentation device, counting the number of all words and the co-occurrence frequency of any two words, and calculating the co-occurrence probability of all 2-gram (bigram) words according to a calculation formula of an N-gram model to form a 2-gram (bigram) statistical language model:

P(S)≈P(w₁)*P(w₂|w₁)*P(w₃|w₂)*...*P(w_n|w_n-1)

and adopting a trained N-gram language model to carry out error positioning on the input sentence, and considering that an error exists at the N-element character if the co-occurrence probability of the N-element character in the training corpus is lower than a threshold value based on the co-occurrence condition of the characters in the training corpus.

Step 103: based on the erroneous terms and the erroneous term locations, a first set of candidate sentences is determined using a two-way long-short term memory (LSTM) model.

The step 103 specifically includes: converting the text after word segmentation into a word vector matrix by using a word vector tool; taking the word vector matrix as the input of an LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delayed backward propagation algorithm, and constructing the trained LSTM model; substituting the characters in the corpus dictionary into the positions of the wrong words in the text to be corrected one by one, and determining the substituted text; inputting the substituted text into the trained LSTM model, outputting the occurrence probability of each word in the dictionary of the corpus at the position of the wrong word, sequencing the substituted text according to the sequence of the occurrence probabilities from small to large, and determining a first candidate sentence list; based on the erroneous word, a first set of candidate sentences is determined from the first list of candidate sentences.

Determining a first candidate sentence set according to the first candidate sentence list based on the erroneous word specifically includes: judging whether the wrong words exist in the first candidate sentence list or not, if so, determining that the N-gram model judges wrongly and determining that the text to be corrected is correct; if not, selecting homophone and near-phonetic character of the error word from the first candidate sentence set, and determining a second candidate sentence list according to the homophone and the near-phonetic character; and substituting the words in the second candidate sentence list into the positions of the wrong words in the text to be corrected one by one to determine a first candidate sentence set.

The two-way LSTM model is trained by utilizing the corpus, and the model training steps are as follows:

a) converting sentences in the preprocessed text corpus into a matrix of word vectors through word2vec, and using the matrix as the input of an LSTM model;

b) the model is trained by forward and time-wise back propagation algorithms.

And (3) substituting the words in the dictionary into the error positions in the sentences one by one, inputting the substituted new sentences into the trained bidirectional LSTM model, and outputting the probability of each word in the dictionary by the model through calculating, sequencing the calculated probability, and screening out the top topK as a set A.

The following determinations are made based on set a: if the N-gram judges that the wrong word is in the set A, the N-gram model judges that the word is wrong, namely the sentence has no error; and if the N-gram judges that the wrong words are not in the set A, screening homophones and near-phonetic words of the words from the set A to serve as a new set A ', substituting the words in the set A' into the wrong part of the sentence one by one to obtain a first candidate sentence set S, and calculating the PPL of all sentences in the S.

Calculation formula of PPL:

for a sentence (sentence) s is composed of words, w represents a word. Namely:

s＝w₁w₂…w_n

PPL(S)＝P(w₁w₂…w_N)^-1/N

wherein, P is the probability of a sentence, and N is the length of the sentence, i.e., the number of words.

In particular, for the 2-gram model, there are:

wherein, p (w)₁…w_n) Probability of a sentence, p (w)_i|w_i-1) The conditional probability of the co-occurrence of the two words can be directly output by the trained 2-gram model, and the calculation formula is as follows:

P(w_i|w_i-1)＝count(w_i，w_i-1)/count(w_i-1)

wherein count (w)_i-1) Is the word w_i-1Number of occurrences in the corpus, count (w)_i，w_i-1) Is w_i，w_i-1Number of times two words occur simultaneously.

Step 104: and converting the text to be corrected into a pinyin sequence.

Step 105: and determining a second candidate sentence by utilizing the N-gram model based on the pinyin sequence.

The step 105 specifically includes: based on the pinyin sequence, constructing a plurality of candidate sentences for the text in the corpus dictionary according to the position of pinyin in the text to be corrected; and determining the probabilities of the candidate sentences by using the N-gram model, and taking the candidate sentence with the highest probability as the second candidate sentence.

Error positioning and correction based on a pinyin sequence dynamic programming algorithm:

all input texts (sentences X) to be corrected are converted into pinyin sequences, each pinyin corresponds to one or more Chinese characters, and all candidate Chinese characters form L candidate sentences according to the positions of the pinyin in the original sentences. Calculating the probability size of each statement based on a 2-gram language model:

P(S)≈P(w₁)*P(w₂|w₁)*P(w₃|w₂)*...*P(w_n|w_n-1)

the second candidate sentence (sentence Y) with the highest probability is selected.

And comparing the text X to be corrected with the sentence Y, and if the characters of the two sentences at the position i are different, calculating the PPL of the X and the PPL of the Y, wherein the PPL calculation is the same as the above.

Step 106: and comparing the confusion degrees of all the first candidate sentences in the first candidate sentence set with the confusion degrees of all the second candidate sentences to determine the sentence with the lowest confusion degree as the corrected text.

Comparing the PPL values of X, Y and all sentences in S, selecting the sentence with the smallest PPL value as the corrected sentence, and outputting.

Based on the method for correcting the Chinese text provided by the invention, the text correction process is divided into two stages of error checking and error correction, wherein the error checking is to judge the error words and positions possibly existing in the text sentences according to the word co-occurrence probability calculated by an N-gram model, the error correction stage firstly generates a candidate word list of corresponding positions according to the checked error positions and error words and then according to the deep neural network model, the candidate words corresponding to each error word are sequenced and screened, and the best result is recommended to a user, and fig. 2 is another flow chart of the method for correcting the Chinese text provided by the invention, as shown in fig. 2.

The method adopts two ways of positioning and correcting wrong words in the text and takes the confusion degree (PPL) as a measuring index (the confusion degree is a method for measuring the quality of a language probability model in natural language processing, and one language probability model can be regarded as a probability distribution on a whole sentence or a text segment.

In practical application, the invention is specifically applied as follows:

(1) n-gram language model calculation process:

in the case of chinese wrongly written characters, determining whether a sentence is correct can be obtained by calculating its probability, and assuming that a sentence S ═ { w1, w 2.., wn }, the problem can be converted into the following form:

P(s)＝P(w₁，w₂，...，w_n)＝P(w₁)*P(w₂|w₁)*…*P(w_n|w₁，w₂，…，w_n-1)

p(s) is called a language model, i.e. a model used to calculate the legal probability of a sentence.

When the formula is used for actual calculation, the parameter space is too large, the information matrix is seriously sparse and is difficult to be practical, an N-gram model is adopted in practice, the N-gram model is based on the assumption of a Markov model, the occurrence probability of a word only depends on the first 1 word or the first few words of the word, and then the formula is evolved into:

(1) the occurrence of a word depends only on the first 1 word, i.e. Bigram (2-gram):

P(S)≈P(w₁)*P(w₂|w₁)*P(w₃|w₂)*…*P(w_n|w_n-1)

(2) the appearance of a word depends only on the first 2 words, i.e. Trigram (3-gram):

P(S)≈P(w₁)*P(w₂|w₁)*P(w₃|w₁w₂)*…*P(w_n|w_n-2w_n-1)

when the n value of the n-gram is larger, the constraint force on the next word is stronger, and since more information is provided, but at the same time, the more complicated the model is, the more problems are, the bigram or trigram is generally adopted.

The specific use of n-grams is illustrated below as a simple example:

the N-gram model constructs a language model through the statistics of the number of words, and the calculation formula of the Bigram is as follows:

P(w_i|w_i-1)＝count(w_i，w_i-1)/count(w_i-1)

p is the conditional probability of two word co-occurrences, and two counts are respectively statistics of the number of times a word occurs in the corpus.

The Bigram is a 2-element language model, the co-occurrence probability of 2-element words is calculated by counting the number of 2 words in a corpus and the number of single words, and the Trigram is a 3-element language model in the same way.

Suppose now that there is a corpus as follows, where < s > is the beginning tag of a sentence and < s > is the end tag of a sentence:

<s1><s2>nononoyesyesyesno</s2></s1>

the following task is to evaluate the probability of this sentence as follows:

<s1><s2>yesnonoyes</s2></s1>

calculating the result of the probability using a trigram model:

P(yes|<s1>,<s2>)＝1/2,

P(no|yes,no)＝1/2，

P(</s2>|no,yes)＝1/2，

P(no|<s2>,yes)＝1，

P(yes|no,no)＝2/5，

P(</s1>|yes,</s2>)＝1

the probability required is equal to:

1/2×1×1/2×2/5×1/2×1＝0.05

if the probability is less than a certain threshold defined, it indicates that there is an error or an unreasonable sentence in the sentence.

The wrongly written characters of the Chinese text have locality, that is, only a reasonable sliding window needs to be selected to check whether the wrongly written characters exist, and an example is given below:

the text "this case was already handled by the upper court to the lower court" is entered, wherein "give" wrongly written "is" give ". When the model carries out local analysis on the sentence, the co-occurrence probability of the calculated word strings is lower than a threshold value, the analyzer refuses to accept, and the sentence is judged to be wrong.

The 'through' word is checked to be wrongly typed by using an n-gram model, the 'through' word is converted into the pinyin 'chuan', the candidate word of the 'chuan' is searched from the dictionary, and the candidate word is checked to see whether the word is reasonable or not by using the n-gram after one trial filling. The method is characterized in that the n-gram model is combined with the pinyin of the Chinese characters to correct wrongly written characters of the Chinese text.

(2) Error checking process

The error checking module comprises a Bigram subword co-occurrence module and a neural network model.

Bigram subword co-occurrence: and counting the co-occurrence times of the two sub-words in the window with the length of k according to the collected large-scale corpus. Considering the sequence together, (w1, w2) is different from (w2, w1), and finally the high-frequency co-occurring phrases are retained as the initial information input to the neural network language model.

Neural network language model: the invention adopts a neural network language model based on bidirectional lstm to obtain the context information of the input text, thereby predicting the probability analysis condition of the current position character according to the context information, and carrying out conditional probability modeling on the current character and each character in the preceding and following sentences in which the current character is positioned, thereby judging the final error checking result of the current Chinese character and providing a suspected error character and a candidate set.

(3) Correction procedure

And obtaining the combination of the correct text corresponding to the input text according to the candidate set by the positioned wrong words in the input text. And obtaining a modification result of the input text according to the sequencing result. And Y is the input text, and Yi is the sequence of the input text corresponding to the correct text combination.

Calculating a ranking score:

Score＝a1*ppl(Yi)+a2*edit_distance(Y,Yi)+a3*WordCount(Yi)

where ppl (Yi) is the ppl of the language model, edit _ distance (Y, Yi) is the edit distance, and WordCount is the number of words. The language model for calculating ppl is a one-way LSTM statistical language model.

Fig. 3 is a structural diagram of a chinese text error correction system according to the present invention, and as shown in fig. 3, a chinese text error correction system includes:

a text to be corrected obtaining module 301, configured to obtain a text to be corrected.

The invention also includes: the corpus dictionary generating module is used for collecting original webpages, preprocessing the original webpages, determining a Chinese text corpus and forming a corpus dictionary; the word segmentation module is used for performing word segmentation processing on the texts in the corpus dictionary by using a word segmentation device and determining a plurality of segmented texts; the co-occurrence frequency rate determining module is used for counting the number of the texts after the word segmentation and the co-occurrence frequency of any two words; and the N-gram model building module is used for building an N-gram model according to the co-occurrence frequency.

And an erroneous word and erroneous word position determining module 302, configured to determine an erroneous word and an erroneous word position in the text to be corrected according to a statistical language N-gram model.

A first candidate sentence set determining module 303, configured to determine a first candidate sentence set by using a bidirectional long-short term memory LSTM model based on the erroneous word and the erroneous word position.

The first candidate sentence set determining module 303 specifically includes: the word vector matrix conversion unit is used for converting the text after word segmentation into a word vector matrix by using a word vector tool; the trained LSTM model building unit is used for taking the word vector matrix as the input of the LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delayed backward propagation algorithm, and building the trained LSTM model; the substituted text determining unit is used for substituting the characters in the corpus dictionary into the positions of the wrong words in the text to be corrected one by one to determine the substituted text; a first candidate sentence list determining unit, configured to input the substituted text into the trained LSTM model, output an occurrence probability of each word in the corpus dictionary at the position of the wrong word, sort the substituted text according to a sequence of the occurrence probabilities from small to large, and determine a first candidate sentence list; a first candidate sentence set determination unit, configured to determine a first candidate sentence set according to the first candidate sentence list based on the erroneous word.

The first candidate sentence set determining unit specifically includes: the first judgment subunit is configured to judge whether the wrong word exists in the first candidate sentence list, so as to obtain a first judgment result; a text to be corrected correctness determining subunit, configured to determine that the N-gram model is judged incorrectly and determine that the text to be corrected is correct if the first judgment indicates that the incorrect word exists in the first candidate sentence list; a second candidate sentence list determining subunit, configured to, if the first determination indicates that the incorrect word does not exist in the first candidate sentence list, screen homophones and nearsighted characters with the incorrect word from the first candidate sentence set, and determine a second candidate sentence list according to the homophones and the nearsighted characters; and the first candidate sentence set determining subunit is used for substituting the words in the second candidate sentence list into the positions of the error words in the text to be corrected one by one to determine a first candidate sentence set.

And a pinyin sequence conversion module 304, configured to convert the text to be corrected into a pinyin sequence.

A second candidate sentence determination module 305, configured to determine a second candidate sentence by using the N-gram model based on the pinyin sequence.

The second candidate sentence determining module 305 specifically includes: the candidate sentence construction units are used for constructing a plurality of candidate sentences for the text in the corpus dictionary according to the position of the pinyin in the text to be corrected based on the pinyin sequence; a second candidate sentence determination unit configured to determine probabilities of the plurality of candidate sentences using the N-gram model, and to take a candidate sentence with a largest probability as the second candidate sentence.

And an error-corrected text determining module 306, configured to compare the confusion degrees of all the first candidate sentences in the first candidate sentence set with the confusion degrees of the second candidate sentences, and determine the sentence with the lowest confusion degree as the error-corrected text.

The invention provides a Chinese text error correction method and a Chinese text error correction system by combining a statistical language model and a deep neural network model, which remarkably improve the error checking and correction rate of Chinese texts, can be applied to proofreading of manuscript contents in daily office and other scenes and have high practical value.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A Chinese text error correction method is characterized by comprising the following steps:

acquiring a text to be corrected;

converting the text to be corrected into a pinyin sequence;

2. The method for correcting Chinese text according to claim 1, wherein the determining of the erroneous word and the position of the erroneous word in the text to be corrected according to a statistical language N-gram model further comprises:

and constructing an N-gram model according to the co-occurrence frequency.

3. The method of correcting chinese text according to claim 1, wherein the determining a first set of candidate sentences using a two-way long-short term memory LSTM model based on the erroneous terms and the erroneous term positions comprises:

4. The method for correcting chinese text according to claim 3, wherein the determining a first set of candidate sentences according to the first list of candidate sentences based on the erroneous word comprises:

5. The method for correcting chinese text according to claim 1, wherein the determining a second candidate sentence using the N-gram model based on the pinyin sequence specifically includes:

6. A chinese text correction system, comprising:

7. The chinese text correction system of claim 6, further comprising:

8. The chinese text correction system of claim 6, wherein the first candidate sentence set determining module specifically comprises:

9. The chinese text correction system of claim 8, wherein the first candidate sentence set determining unit specifically includes:

10. The chinese text correction system of claim 6, wherein the second candidate sentence determination module specifically comprises: