CN112149406B

CN112149406B - Chinese text error correction method and system

Info

Publication number: CN112149406B
Application number: CN202011021044.4A
Authority: CN
Inventors: 钱宝生; 杨军; 曾擂; 王滨; 干家东
Original assignee: CETC 15 Research Institute
Current assignee: CETC 15 Research Institute
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2023-09-08
Anticipated expiration: 2040-09-25
Also published as: CN112149406A

Abstract

The invention relates to a Chinese text error correction method and a Chinese text error correction system. The Chinese text error correction method comprises the following steps: acquiring a text to be corrected; determining the error words and the positions of the error words in the text to be corrected according to a statistical language N-gram model; determining a first candidate sentence set by utilizing a two-way long-short-term memory LSTM model based on the error word and the error word position; converting the text to be corrected into a pinyin sequence; determining a second candidate sentence by using the N-gram model based on the pinyin sequence; comparing the confusion degree of all the first candidate sentences in the first candidate sentence set with the confusion degree of the second candidate sentences, and determining the sentence with the lowest confusion degree as the text after error correction. The invention can improve the error checking and correcting rate of the Chinese text and reduce the hardware configuration requirement.

Description

Chinese text error correction method and system

Technical Field

The invention relates to the technical field of natural language processing, in particular to a Chinese text error correction method and a Chinese text error correction system.

Background

Chinese text often contains various errors such as near word errors, homonym errors, term class errors, semantic errors, idiomatic or post-consumer errors, and the like. In some important situations, a wrong document will cause a significant loss, and manual error correction is inefficient and takes a lot of time to face a lot of text. Technical difficulties of Chinese text correction:

(1) Accuracy of named entity recognition: aiming at some rule errors, a dictionary in the corresponding field needs to be constructed, such as leader name proofreading, and corresponding information of leader names and positions, which can be updated in real time, needs to be provided, but because of frequent information updating and high position change frequency, differential error reporting caused by synchronous information exists.

(2) The Chinese grammar rules are complex: the biggest feature in standard chinese grammar is morphological changes that are not strictly significant. The nouns have no lattice change and no distinction between sexes and sums. Verbs are not called separately, nor tenses. This feature, which is different from European languages, has led to the recognition by many linguists that Chinese has neither grammar nor part of speech for a long time historically. It is because of the indefinite theory of Chinese that this kind of text leads to the situation that the error correction of Chinese is bigger and thus false alarm may occur.

(3) The word polysemous problem of Chinese characters: chinese characters often have word ambiguity phenomenon, such as 'Chinese character returning', can be made into two-tone huan, and means returning; at the same time, two voices hai can be made, meaning still and adherence. In a different context, such errors are difficult to successfully correct.

The current error correction method mainly comprises a rule-based method, an N-gram statistical model-based method and a deep neural network-based error correction method. The rule-based method is high in execution speed but poor in accuracy and adaptability; the method based on the N-gram statistical model can only process collocation errors among adjacent words and does not have syntactic analysis capability; the error correction method based on the deep neural network has higher requirements on hardware configuration.

Disclosure of Invention

The invention aims to provide a Chinese text error correction method and a Chinese text error correction system, which are used for solving the problems that the traditional Chinese text error correction method is low in accuracy, can only process collocation errors among adjacent words, does not have syntactic analysis capability and is high in hardware configuration requirement.

In order to achieve the above object, the present invention provides the following solutions:

a method for error correction of chinese text, comprising:

acquiring a text to be corrected;

determining the error words and the positions of the error words in the text to be corrected according to a statistical language N-gram model;

determining a first candidate sentence set by utilizing a two-way long-short-term memory LSTM model based on the error word and the error word position;

converting the text to be corrected into a pinyin sequence;

determining a second candidate sentence by using the N-gram model based on the pinyin sequence;

comparing the confusion degree of all the first candidate sentences in the first candidate sentence set with the confusion degree of the second candidate sentences, and determining the sentence with the lowest confusion degree as the text after error correction.

Optionally, the determining the error word and the error word position in the text to be corrected according to the statistical language N-gram model further includes:

collecting an original webpage, preprocessing the original webpage, determining a Chinese text corpus, and forming a corpus dictionary;

performing word segmentation on texts in the corpus dictionary by using a word segmentation device, and determining a plurality of segmented texts;

counting the number of the text after word segmentation and the co-occurrence frequency of any two words;

and constructing an N-gram model according to the co-occurrence frequency.

Optionally, the determining, based on the error word and the error word position, the first candidate sentence set by using a two-way long-short term memory LSTM model specifically includes:

converting the text after word segmentation into a word vector matrix by using a word vector tool;

taking the word vector matrix as the input of an LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delayed reverse propagation algorithm, and constructing a trained LSTM model;

substituting words in the corpus dictionary into the error word positions in the text to be corrected one by one, and determining substituted text;

inputting the substituted text into the trained LSTM model, outputting the occurrence probability of each word in the corpus dictionary at the position of the wrong word, and sequencing the substituted text according to the order of the occurrence probability from small to large to determine a first candidate sentence list;

and determining a first candidate sentence set according to the first candidate sentence list based on the error word.

Optionally, the determining, based on the error word, a first candidate sentence set according to the first candidate sentence list specifically includes:

judging whether the error word exists in the first candidate sentence list or not to obtain a first judging result;

if the first judgment indicates that the error word exists in the first candidate sentence list, determining that the text to be corrected is correct;

if the first judgment indicates that the error word does not exist in the first candidate sentence list, homophones and near-phones of the error word are screened from the first candidate sentence set, and a second candidate sentence list is determined according to the homophones and the near-phones;

and substituting the words in the second candidate sentence list into the positions of the error words in the text to be corrected one by one, and determining a first candidate sentence set.

Optionally, the determining, based on the pinyin sequence, a second candidate sentence by using the N-gram model specifically includes:

based on the pinyin sequence, constructing a plurality of candidate sentences according to the positions of the pinyin in the text to be corrected in the corpus dictionary;

and determining the probability of the plurality of candidate sentences by using the N-gram model, and taking the candidate sentence with the highest probability as the second candidate sentence.

A chinese text error correction system, comprising:

the text to be corrected acquisition module is used for acquiring the text to be corrected;

the error word and error word position determining module is used for determining the error word and error word position in the text to be corrected according to a statistical language N-gram model;

the first candidate sentence set determining module is used for determining a first candidate sentence set by utilizing a two-way long-short-term memory LSTM model based on the error words and the error word positions;

the pinyin sequence conversion module is used for converting the text to be corrected into a pinyin sequence;

the second candidate sentence determining module is used for determining a second candidate sentence by utilizing the N-gram model based on the pinyin sequence;

and the corrected text determining module is used for comparing the confusion degree of all the first candidate sentences in the first candidate sentence set with the confusion degree of the second candidate sentences, and determining the sentence with the lowest confusion degree as the corrected text.

Optionally, the method further comprises:

the corpus dictionary generating module is used for collecting original webpages, preprocessing the original webpages, determining a Chinese text corpus and forming a corpus dictionary;

the word segmentation module is used for carrying out word segmentation processing on texts in the corpus dictionary by utilizing a word segmentation device and determining a plurality of segmented texts;

the co-occurrence frequency rate determining module is used for counting the number of the text after word segmentation and the co-occurrence frequency of any two words;

and the N-gram model building module is used for building an N-gram model according to the co-occurrence frequency.

Optionally, the first candidate sentence set determining module specifically includes:

the word vector matrix conversion unit is used for converting the text subjected to word segmentation into a word vector matrix by using a word vector tool;

the trained LSTM model construction unit is used for taking the word vector matrix as the input of the LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delay reverse propagation algorithm, and constructing the trained LSTM model;

the substituted text determining unit is used for substituting the words in the corpus dictionary into the error word positions in the text to be corrected one by one to determine the substituted text;

the first candidate sentence list determining unit is used for inputting the substituted text into the trained LSTM model, outputting the occurrence probability of each word in the corpus dictionary at the position of the wrong word, and sequencing the substituted text according to the order of the occurrence probability from small to large to determine a first candidate sentence list;

and the first candidate sentence set determining unit is used for determining a first candidate sentence set according to the first candidate sentence list based on the error words.

Optionally, the first candidate sentence set determining unit specifically includes:

the first judging subunit is used for judging whether the error word exists in the first candidate sentence list or not to obtain a first judging result;

a text to be corrected correctly determining subunit, configured to determine that the text to be corrected is correct if the first determination indicates that the error word exists in the first candidate sentence list;

a second candidate sentence list determining subunit, configured to screen homophones and near phones of the erroneous word from the first candidate sentence set if the first determination indicates that the erroneous word does not exist in the first candidate sentence list, and determine a second candidate sentence list according to the homophones and the near phones;

and the first candidate sentence set determining subunit is used for substituting the words in the second candidate sentence list into the error word positions in the text to be corrected one by one to determine a first candidate sentence set.

Optionally, the second candidate sentence determination module specifically includes:

a plurality of candidate sentence construction units, configured to construct a plurality of candidate sentences according to the positions of the pinyin in the text to be corrected in the corpus dictionary based on the pinyin sequence;

and the second candidate sentence determining unit is used for determining the probability of the plurality of candidate sentences by utilizing the N-gram model, and taking the candidate sentence with the highest probability as the second candidate sentence.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the invention provides a Chinese text error correction method and a Chinese text error correction system, which are characterized in that an N-gram statistical language model is used for positioning error words in a text, a bidirectional LSTM deep neural network model and a pinyin sequence editing distance are adopted for respectively generating a first candidate sentence set and a second candidate sentence, and proper correct words are selected for replacement by calculating the confusion degree of the candidate sentences, so that the error correction and error correction rate of the Chinese text is improved, the hardware configuration requirement is low, the Chinese text error correction method and the system can be applied to manuscript content correction in scenes such as daily offices, and the like, and have high practical value.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for correcting errors of Chinese text provided by the invention;

FIG. 2 is a flowchart of another method for correcting errors in Chinese text according to the present invention;

fig. 3 is a diagram of a chinese text error correction system according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention aims to provide a Chinese text error correction method and a Chinese text error correction system, which can improve the error correction and error correction rate of Chinese text and reduce the hardware configuration requirement.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Fig. 1 is a flowchart of a method for correcting errors in chinese text, as shown in fig. 1, where the method for correcting errors in chinese text includes:

step 101: and acquiring a text to be corrected.

Step 102: and determining the error words and the positions of the error words in the text to be corrected according to the statistical language N-gram model.

Training N-gram statistical language model

Collecting original webpages from public document websites disclosed on the Internet, preprocessing the original webpages to form a corpus of pure document texts, forming a corpus dictionary, then utilizing a jieba word segmentation device to carry out Chinese word segmentation, counting the number of all words and the co-occurrence frequency of any two words, and calculating the co-occurrence probability of all 2-element words according to a calculation formula of an N-gram model to form a 2-gram (Bigram) statistical language model:

P(S)≈P(w ₁ )*P(w ₂ |w ₁ )*P(w ₃ |w ₂ )*...*P(w _n |w _n-1 )

and adopting a trained N-gram language model to perform error positioning on the input sentence, and based on the co-occurrence condition of the words in the training corpus, if the co-occurrence probability of the N-element words in the training corpus is lower than a threshold value, considering that errors exist at the N-element words.

Step 103: and determining a first candidate sentence set by utilizing a two-way long-short-term memory LSTM model based on the error word and the error word position.

The step 103 specifically includes: converting the text after word segmentation into a word vector matrix by using a word vector tool; taking the word vector matrix as the input of an LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delayed reverse propagation algorithm, and constructing a trained LSTM model; substituting words in the corpus dictionary into the error word positions in the text to be corrected one by one, and determining substituted text; inputting the substituted text into the trained LSTM model, outputting the occurrence probability of each word in the corpus dictionary at the position of the wrong word, and sequencing the substituted text according to the order of the occurrence probability from small to large to determine a first candidate sentence list; and determining a first candidate sentence set according to the first candidate sentence list based on the error word.

The determining a first candidate sentence set according to the first candidate sentence list based on the error word specifically includes: judging whether the error word exists in the first candidate sentence list, if so, determining that the N-gram model judges error, and determining that the text to be corrected is correct; if not, homophones and near phones of the error words are screened from the first candidate sentence set, and a second candidate sentence list is determined according to the homophones and the near phones; and substituting the words in the second candidate sentence list into the positions of the error words in the text to be corrected one by one, and determining a first candidate sentence set.

Training a bidirectional LSTM model by using the corpus, wherein the model training steps are as follows:

a) Converting sentences in the preprocessed text corpus into a word vector matrix through word2vec, and using the word vector matrix as an input of an LSTM model;

b) The model is trained by forward propagation and time-along back propagation algorithms.

Substituting words in the dictionary into the error position of the sentence one by one, inputting the new sentence after substitution into a trained bidirectional LSTM model, and sorting the calculated probabilities by calculating the probability of each word in the output dictionary, and screening out the front topK as a set A.

Based on set a, the following determination is made: if the N-gram judges that the wrong word is in the set A, the N-gram model is considered to judge that the word is wrong, namely that the sentence has no error; if the N-gram judges that the wrong word is not in the set A, homonyms and near-homonyms of the word are selected from the set A to serve as a new set A ', the words in the set A' are substituted into the wrong positions of the sentences one by one, a first candidate sentence set S is obtained, and PPL of all sentences in the S is calculated.

Calculation formula of PPL:

for a sentence (sentence) s is composed of words, w represents a word. Namely:

s＝w ₁ w ₂ …w _n

PPL(S)＝P(w ₁ w ₂ …w _N ) ^-1/N

wherein P is the probability of a sentence, and N is the sentence length, i.e. the number of words.

In particular, for the 2-gram model, there are:

wherein p (w) ₁ …w _n ) Is the probability of a sentence, p (w _i |w _i-1 ) The conditional probability of co-occurrence of two words can be directly output by the trained 2-gram model, and the calculation formula is as follows:

P(w _i |w _i-1 )＝count(w _i ，w _i-1 )/count(w _i-1 )

wherein count (w) _i-1 ) For the word w _i-1 Number of occurrences in corpus, count (w _i ，w _i-1 ) Is w _i ，w _i-1 Number of times two words appear simultaneously.

Step 104: and converting the text to be corrected into a pinyin sequence.

Step 105: and determining a second candidate sentence by utilizing the N-gram model based on the pinyin sequence.

The step 105 specifically includes: based on the pinyin sequence, constructing a plurality of candidate sentences according to the positions of the pinyin in the text to be corrected in the corpus dictionary; and determining the probability of the plurality of candidate sentences by using the N-gram model, and taking the candidate sentence with the highest probability as the second candidate sentence.

Error positioning and correction based on pinyin sequence dynamic programming algorithm:

all the input text to be corrected (sentence X) is converted into phonetic sequences, each phonetic corresponds to one or more Chinese characters, and all the candidate Chinese characters form L candidate sentences according to the positions of the phonetic characters in the original sentence. Based on the 2-gram language model, calculating the probability size of each sentence:

P(S)≈P(w ₁ )*P(w ₂ |w ₁ )*P(w ₃ |w ₂ )*...*P(w _n |w _n-1 )

the second candidate sentence (sentence Y) having the highest probability is selected.

And comparing the text X to be corrected with the sentence Y, and if the words of the two sentences at the position i are different, calculating the PPL of the X and the Y at the position i, wherein the PPL is calculated as above.

Step 106: comparing the confusion degree of all the first candidate sentences in the first candidate sentence set with the confusion degree of the second candidate sentences, and determining the sentence with the lowest confusion degree as the text after error correction.

And comparing the PPL values of all the sentences in X, Y and S, selecting the sentence with the smallest PPL value as the corrected sentence, and outputting the corrected sentence.

Based on the method for correcting Chinese text provided by the invention, the text correction process is divided into two stages of error correction and error correction, namely error words and positions possibly existing in text sentences are judged according to the word co-occurrence probability calculated by the N-gram model, the error correction stage firstly generates a candidate word list at a corresponding position according to the error positions and error words detected and then according to the deep neural network model calculation, the candidate words corresponding to each error word are ordered and screened, and the best result is recommended to a user, and fig. 2 is a flow chart of another method for correcting Chinese text provided by the invention, as shown in fig. 2.

A method for determining the position of error words in text includes such steps as providing a probability distribution for each word, determining the position of each word, and choosing the candidate sentence with lowest confusion from the candidate sentences.

In practical application, the invention is specifically applied as follows:

(1) N-gram language model calculation process:

in the case of chinese misprinting, judging whether a sentence is correct or not can be obtained by calculating its probability, and assuming that a sentence s= { w1, w2, & gt, wn }, the problem can be converted into the following form:

P(s)＝P(w ₁ ，w ₂ ，...，w _n )＝P(w ₁ )*P(w ₂ |w ₁ )*…*P(w _n |w ₁ ，w ₂ ，…，w _n-1 )

p(s) is called a language model, i.e. a model used to calculate the legal probability of a sentence.

When the formula is used for actual calculation, the parameter space is too large, the information matrix is seriously sparse and is difficult to be practical, an N-gram model is adopted in practice, the N-gram model is based on the assumption of a Markov model, the occurrence probability of one word only depends on the first 1 word or the first few words of the word, and the formula is evolved into:

(1) The occurrence of one word depends only on the first 1 word, i.e. Bigram (2-gram):

P(S)≈P(w ₁ )*P(w ₂ |w ₁ )*P(w ₃ |w ₂ )*…*P(w _n |w _n-1 )

(2) The occurrence of one word depends only on the first 2 words, i.e. Trigram (3-gram):

P(S)≈P(w ₁ )*P(w ₂ |w ₁ )*P(w ₃ |w ₁ w ₂ )*…*P(w _n |w _n-2 w _n-1 )

when the n-gram is larger, the constraint on the next word is stronger, and because more information is provided, but at the same time the model is more complex, the problem is more, a bigram or trigram is generally adopted.

The specific use of an n-gram is described below as a simple example:

the N-gram model constructs a language model through statistics of the number of words, and the calculation formula of the N-gram model is as follows for the Bigram:

P(w _i |w _i-1 )＝count(w _i ，w _i-1 )/count(w _i-1 )

p is the conditional probability of co-occurrence of two words, and two counts are the statistics of the occurrence times of the words in the corpus respectively.

Bigram is a 2-element language model, the co-occurrence probability of 2-element words is calculated by counting the number of 2 words in a corpus and the same Trigram is a 3-element language model.

Assume now that there is a corpus as follows, where < s > is the sentence head label and < s > is the sentence tail label:

<s1><s2>nononoyesyesyesno</s2></s1>

the following task is to evaluate the probability of this sentence as follows:

<s1><s2>yesnonoyes</s2></s1>

results of calculating probabilities using trigram model:

P(yes|<s1>,<s2>)＝1/2,

P(no|yes,no)＝1/2，

P(</s2>|no,yes)＝1/2，

P(no|<s2>,yes)＝1，

P(yes|no,no)＝2/5，

P(</s1>|yes,</s2>)＝1

the required probability is equal to:

1/2×1×1/2×2/5×1/2×1＝0.05

if the probability is less than a defined threshold, this indicates that there is an error in the sentence or that the sentence is not reasonable.

The mispronounced word of the Chinese text has locality, namely only a reasonable sliding window is selected to check whether the mispronounced word exists, and the following is an example:

the text is input, "this case has been passed through by the superior court to the inferior court process," wherein "pass to" write error is "pass to". When the model locally analyzes sentences, the co-occurrence probability of word strings is calculated to be lower than a threshold value, and the analyzer refuses to accept and judges that the word strings are wrong.

The n-gram model can be used for detecting that the 'through' word is wrongly beaten, at the moment, the 'through' word is converted into pinyin 'chuan', candidate words of the 'chuan' are searched from a dictionary, filling is performed one by one, and the n-gram is used for checking to see whether the word is reasonable or not. This is the n-gram model combines the pinyin of Chinese characters to make Chinese text misplacement word correction.

(2) Missing checking process

The error detection module comprises a Bigram subword co-occurrence and neural network model.

Bigram subword co-occurrence: and counting the co-occurrence times of the two subwords in the window with the length of k according to the collected large-scale corpus. Considering the sequence comprehensively, (w 1, w 2) is different from (w 2, w 1), and finally, the phrase with high frequency co-occurrence is reserved as initial information input into the neural network language model.

Neural network language model: the invention adopts a neural network language model based on bidirectional lstm to acquire the context information of an input text, so as to predict the probability analysis condition of a current position word according to the context information, and perform conditional probability modeling on the current word and each word in the front and rear sentences of the current word, thereby judging the final error-checking result of the current Chinese character and giving out suspected error words and candidate sets.

(3) Error correction process

And obtaining the combination of the correct text corresponding to the input text according to the candidate set by positioning the error word in the input text. And obtaining a modification result of the input text according to the sorting result. Y=input text, yi=sequence in the correct text combination for the input text.

Ranking score calculation:

Score＝a1*ppl(Yi)+a2*edit_distance(Y,Yi)+a3*WordCount(Yi)

where ppl (Yi) is ppl of the language model, edit_distance (Y, yi) is the edit distance, and WordCount is the number of words. The language model for calculating ppl is a one-way LSTM statistical language model.

Fig. 3 is a diagram of a chinese text error correction system according to the present invention, as shown in fig. 3, a chinese text error correction system includes:

the text to be corrected obtaining module 301 is configured to obtain the text to be corrected.

The invention also includes: the corpus dictionary generating module is used for collecting original webpages, preprocessing the original webpages, determining a Chinese text corpus and forming a corpus dictionary; the word segmentation module is used for carrying out word segmentation processing on texts in the corpus dictionary by utilizing a word segmentation device and determining a plurality of segmented texts; the co-occurrence frequency rate determining module is used for counting the number of the text after word segmentation and the co-occurrence frequency of any two words; and the N-gram model building module is used for building an N-gram model according to the co-occurrence frequency.

The wrong word and wrong word position determining module 302 is configured to determine a wrong word and a wrong word position in the text to be corrected according to a statistical language N-gram model.

The first candidate sentence set determining module 303 is configured to determine a first candidate sentence set using a two-way long-short term memory LSTM model based on the erroneous word and the erroneous word position.

The first candidate sentence set determining module 303 specifically includes: the word vector matrix conversion unit is used for converting the text subjected to word segmentation into a word vector matrix by using a word vector tool; the trained LSTM model construction unit is used for taking the word vector matrix as the input of the LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delay reverse propagation algorithm, and constructing the trained LSTM model; the substituted text determining unit is used for substituting the words in the corpus dictionary into the error word positions in the text to be corrected one by one to determine the substituted text; the first candidate sentence list determining unit is used for inputting the substituted text into the trained LSTM model, outputting the occurrence probability of each word in the corpus dictionary at the position of the wrong word, and sequencing the substituted text according to the order of the occurrence probability from small to large to determine a first candidate sentence list; and the first candidate sentence set determining unit is used for determining a first candidate sentence set according to the first candidate sentence list based on the error words.

The first candidate sentence set determining unit specifically includes: the first judging subunit is used for judging whether the error word exists in the first candidate sentence list or not to obtain a first judging result; a text to be corrected correctly determining subunit, configured to determine that the N-gram model is incorrect and determine that the text to be corrected is correct if the first determination indicates that the incorrect word exists in the first candidate sentence list; a second candidate sentence list determining subunit, configured to screen homophones and near phones of the erroneous word from the first candidate sentence set if the first determination indicates that the erroneous word does not exist in the first candidate sentence list, and determine a second candidate sentence list according to the homophones and the near phones; and the first candidate sentence set determining subunit is used for substituting the words in the second candidate sentence list into the error word positions in the text to be corrected one by one to determine a first candidate sentence set.

And the pinyin sequence conversion module 304 is configured to convert the text to be corrected into a pinyin sequence.

A second candidate sentence determination module 305, configured to determine a second candidate sentence using the N-gram model based on the pinyin sequence.

The second candidate sentence determination module 305 specifically includes: a plurality of candidate sentence construction units, configured to construct a plurality of candidate sentences according to the positions of the pinyin in the text to be corrected in the corpus dictionary based on the pinyin sequence; and the second candidate sentence determining unit is used for determining the probability of the plurality of candidate sentences by utilizing the N-gram model, and taking the candidate sentence with the highest probability as the second candidate sentence.

And the corrected text determining module 306 is configured to compare the confusion degree of all the first candidate sentences in the first candidate sentence set with the confusion degree of the second candidate sentences, and determine that the sentence with the lowest confusion degree is the corrected text.

The invention provides a Chinese text error correction method and a Chinese text error correction system by combining a statistical language model with a deep neural network model, which remarkably improve the error correction and error correction rate of Chinese texts, can be applied to manuscript content correction in scenes such as daily office work and the like, and has high practical value.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A method for correcting errors in chinese text, comprising:

acquiring a text to be corrected;

converting the text to be corrected into a pinyin sequence;

2. The method for chinese text correction according to claim 1, wherein said determining the erroneous word and the position of the erroneous word in the text to be corrected according to a statistical language N-gram model further comprises:

and constructing an N-gram model according to the co-occurrence frequency.

3. The method for correcting chinese text according to claim 2, wherein said determining a first candidate sentence set using a two-way long-short term memory LSTM model based on said erroneous word and said erroneous word position, specifically comprises:

4. A method of chinese text correction as recited in claim 3 wherein said determining a first set of candidate sentences from said first list of candidate sentences based on said erroneous terms comprises:

5. The method for correcting Chinese text according to claim 2, wherein the determining a second candidate sentence by using the N-gram model based on the pinyin sequence specifically comprises:

6. A chinese text error correction system, comprising:

7. The chinese text error correction system of claim 6, further comprising:

8. The chinese text error correction system of claim 7, wherein said first candidate sentence set determination module specifically comprises:

9. The chinese text error correction system of claim 8, wherein said first candidate sentence set determining unit specifically comprises:

10. The chinese text error correction system of claim 7, wherein said second candidate sentence determination module comprises: