CN112149406A - Chinese text error correction method and system - Google Patents

Chinese text error correction method and system Download PDF

Info

Publication number
CN112149406A
CN112149406A CN202011021044.4A CN202011021044A CN112149406A CN 112149406 A CN112149406 A CN 112149406A CN 202011021044 A CN202011021044 A CN 202011021044A CN 112149406 A CN112149406 A CN 112149406A
Authority
CN
China
Prior art keywords
text
word
candidate sentence
candidate
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011021044.4A
Other languages
Chinese (zh)
Other versions
CN112149406B (en
Inventor
钱宝生
杨军
曾擂
王滨
干家东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 15 Research Institute
Original Assignee
CETC 15 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 15 Research Institute filed Critical CETC 15 Research Institute
Priority to CN202011021044.4A priority Critical patent/CN112149406B/en
Publication of CN112149406A publication Critical patent/CN112149406A/en
Application granted granted Critical
Publication of CN112149406B publication Critical patent/CN112149406B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Computational Mathematics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Probability & Statistics with Applications (AREA)
  • Operations Research (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a Chinese text error correction method and a Chinese text error correction system. The Chinese text error correction method comprises the following steps: acquiring a text to be corrected; determining error words and error word positions in the text to be corrected according to a statistical language N-gram model; determining a first candidate sentence set by utilizing a bidirectional long-short term memory (LSTM) model based on the error words and the error word positions; converting the text to be corrected into a pinyin sequence; determining a second candidate sentence by using the N-gram model based on the pinyin sequence; and comparing the confusion degrees of all the first candidate sentences in the first candidate sentence set with the confusion degrees of all the second candidate sentences to determine the sentence with the lowest confusion degree as the corrected text. The invention can improve the error checking and correcting rate of the Chinese text and reduce the requirement of hardware configuration.

Description

Chinese text error correction method and system
Technical Field
The invention relates to the technical field of natural language processing, in particular to a Chinese text error correction method and a Chinese text error correction system.
Background
Chinese text often contains various errors such as near word errors, homophone errors, term class errors, semantic errors, idioms or postlanguage errors, and so forth. In some important situations, documents with errors will cause significant loss, and manual error correction will be inefficient and will consume a lot of time in the face of a large amount of text. The technical difficulty of correcting the Chinese text is as follows:
(1) accuracy of named entity identification: for some rule errors, dictionaries in corresponding fields need to be constructed, such as leader name proofreading, and corresponding information of a leader name and a position, which can be updated in real time, needs to be provided.
(2) The Chinese grammar rule is complex: the biggest feature in standard chinese grammar is the lack of strictly meaningful morphological changes. There is no definite change in nouns, nor is there any difference in the nature and number. Verbs are not distinguished from humans, nor are there tenses. This feature, which is different from european languages, has led to chinese being considered by many linguists as having no grammar and no part of speech for a long period of time in history. Due to the indeterminate theory of the Chinese language, the Chinese language has large error correction, and therefore false alarm may occur.
(3) The Chinese character word polysemy problem: chinese characters often have a phenomenon of word ambiguity, such as 'returning' which can be called a two-tone huan meaning returning and returning; at the same time, two sounds hai can be made, meaning still, insist. Such errors are more difficult to correct successfully in different contexts.
The current error correction method mainly comprises a rule-based method, an N-gram statistical model-based method and a deep neural network-based error correction method. The rule-based method has high execution speed but poor accuracy and adaptability; the method based on the N-gram statistical model can only process collocation errors among adjacent words and has no syntactic analysis capability; the error correction method based on the deep neural network has higher requirements on hardware configuration.
Disclosure of Invention
The invention aims to provide a Chinese text error correction method and a Chinese text error correction system, which are used for solving the problems that the existing Chinese text error correction method is low in accuracy, can only process collocation errors among adjacent words, does not have syntactic analysis capability and has high hardware configuration requirement.
In order to achieve the purpose, the invention provides the following scheme:
a Chinese text error correction method comprises the following steps:
acquiring a text to be corrected;
determining error words and error word positions in the text to be corrected according to a statistical language N-gram model;
determining a first candidate sentence set by utilizing a bidirectional long-short term memory (LSTM) model based on the error words and the error word positions;
converting the text to be corrected into a pinyin sequence;
determining a second candidate sentence by using the N-gram model based on the pinyin sequence;
and comparing the confusion degrees of all the first candidate sentences in the first candidate sentence set with the confusion degrees of all the second candidate sentences to determine the sentence with the lowest confusion degree as the corrected text.
Optionally, the determining, according to a statistical language N-gram model, an erroneous word and an erroneous word position in the text to be corrected further includes:
collecting an original webpage, preprocessing the original webpage, determining a Chinese text corpus and forming a corpus dictionary;
performing word segmentation processing on the texts in the corpus dictionary by using a word segmentation device, and determining a plurality of segmented texts;
counting the number of the texts after all the word segmentation and the co-occurrence frequency of any two words;
and constructing an N-gram model according to the co-occurrence frequency.
Optionally, the determining, based on the erroneous term and the position of the erroneous term, a first candidate statement set by using a bidirectional long-short term memory LSTM model specifically includes:
converting the text after word segmentation into a word vector matrix by using a word vector tool;
taking the word vector matrix as the input of an LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delayed backward propagation algorithm, and constructing the trained LSTM model;
substituting the characters in the corpus dictionary into the positions of the wrong words in the text to be corrected one by one, and determining the substituted text;
inputting the substituted text into the trained LSTM model, outputting the occurrence probability of each word in the dictionary of the corpus at the position of the wrong word, sequencing the substituted text according to the sequence of the occurrence probabilities from small to large, and determining a first candidate sentence list;
based on the erroneous word, a first set of candidate sentences is determined from the first list of candidate sentences.
Optionally, the determining, based on the erroneous word and according to the first candidate sentence list, a first candidate sentence set specifically includes:
judging whether the wrong words exist in the first candidate sentence list or not to obtain a first judgment result;
if the first judgment shows that the wrong words exist in the first candidate sentence list, determining that the text to be corrected is correct;
if the first judgment shows that the wrong words do not exist in the first candidate sentence list, homophone words and near-phoneme words of the wrong words are screened out from the first candidate sentence set, and a second candidate sentence list is determined according to the homophone words and the near-phoneme words;
and substituting the words in the second candidate sentence list into the positions of the wrong words in the text to be corrected one by one to determine a first candidate sentence set.
Optionally, the determining, based on the pinyin sequence, a second candidate sentence by using the N-gram model specifically includes:
based on the pinyin sequence, constructing a plurality of candidate sentences for the text in the corpus dictionary according to the position of pinyin in the text to be corrected;
and determining the probabilities of the candidate sentences by using the N-gram model, and taking the candidate sentence with the highest probability as the second candidate sentence.
A chinese text correction system comprising:
the text to be corrected acquiring module is used for acquiring a text to be corrected;
the error word and error word position determining module is used for determining the error word and the error word position in the text to be corrected according to a statistical language N-gram model;
a first candidate sentence set determining module, configured to determine a first candidate sentence set by using a bidirectional long-short term memory (LSTM) model based on the erroneous word and the erroneous word position;
the pinyin sequence conversion module is used for converting the text to be corrected into a pinyin sequence;
a second candidate sentence determination module, configured to determine a second candidate sentence by using the N-gram model based on the pinyin sequence;
and the text determining module after error correction is used for comparing the confusion degrees of all the first candidate sentences in the first candidate sentence set with the confusion degree of the second candidate sentences and determining the sentence with the lowest confusion degree as the text after error correction.
Optionally, the method further includes:
the corpus dictionary generating module is used for collecting original webpages, preprocessing the original webpages, determining a Chinese text corpus and forming a corpus dictionary;
the word segmentation module is used for performing word segmentation processing on the texts in the corpus dictionary by using a word segmentation device and determining a plurality of segmented texts;
the co-occurrence frequency rate determining module is used for counting the number of the texts after the word segmentation and the co-occurrence frequency of any two words;
and the N-gram model building module is used for building an N-gram model according to the co-occurrence frequency.
Optionally, the first candidate sentence set determining module specifically includes:
the word vector matrix conversion unit is used for converting the text after word segmentation into a word vector matrix by using a word vector tool;
the trained LSTM model building unit is used for taking the word vector matrix as the input of the LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delayed backward propagation algorithm, and building the trained LSTM model;
the substituted text determining unit is used for substituting the characters in the corpus dictionary into the positions of the wrong words in the text to be corrected one by one to determine the substituted text;
a first candidate sentence list determining unit, configured to input the substituted text into the trained LSTM model, output an occurrence probability of each word in the corpus dictionary at the position of the wrong word, sort the substituted text according to a sequence of the occurrence probabilities from small to large, and determine a first candidate sentence list;
a first candidate sentence set determination unit, configured to determine a first candidate sentence set according to the first candidate sentence list based on the erroneous word.
Optionally, the first candidate sentence set determining unit specifically includes:
the first judgment subunit is configured to judge whether the wrong word exists in the first candidate sentence list, so as to obtain a first judgment result;
a text to be corrected correctness determining subunit, configured to determine that the text to be corrected is correct if the first determination indicates that the erroneous word exists in the first candidate sentence list;
a second candidate sentence list determining subunit, configured to, if the first determination indicates that the incorrect word does not exist in the first candidate sentence list, screen homophones and nearsighted characters with the incorrect word from the first candidate sentence set, and determine a second candidate sentence list according to the homophones and the nearsighted characters;
and the first candidate sentence set determining subunit is used for substituting the words in the second candidate sentence list into the positions of the error words in the text to be corrected one by one to determine a first candidate sentence set.
Optionally, the second candidate sentence determining module specifically includes:
the candidate sentence construction units are used for constructing a plurality of candidate sentences for the text in the corpus dictionary according to the position of the pinyin in the text to be corrected based on the pinyin sequence;
a second candidate sentence determination unit configured to determine probabilities of the plurality of candidate sentences using the N-gram model, and to take a candidate sentence with a largest probability as the second candidate sentence.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the invention provides a Chinese text error correction method and a system, which are characterized in that an N-gram statistical language model is used for positioning error words in a text, a first candidate sentence set and a second candidate sentence are respectively generated by adopting a bidirectional LSTM deep neural network model and a pinyin sequence editing distance, and proper correct words are selected for replacement by calculating the confusion degree of the candidate sentences, so that the error checking and correction rate of the Chinese text is improved, the requirement on hardware configuration is low, the method and the system can be applied to manuscript content proofreading in daily office and other scenes, and the method and the system have high practical value.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of a method for correcting errors in a Chinese text according to the present invention;
FIG. 2 is a flow chart of another method for correcting errors in Chinese text according to the present invention;
FIG. 3 is a structural diagram of a Chinese text error correction system according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a Chinese text error correction method and a Chinese text error correction system, which can improve the error checking and correction rate of Chinese texts and reduce the hardware configuration requirement.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a flowchart of a method for correcting a chinese text according to the present invention, and as shown in fig. 1, the method for correcting a chinese text includes:
step 101: and acquiring the text to be corrected.
Step 102: and determining the error words and the positions of the error words in the text to be corrected according to a statistical language N-gram model.
Training N-gram statistical language model
Collecting original webpages about 50 thousands of from a public document website disclosed on the Internet, preprocessing the original webpages to form a corpus of a plain document text, forming a corpus dictionary, performing Chinese word segmentation by using a jieba word segmentation device, counting the number of all words and the co-occurrence frequency of any two words, and calculating the co-occurrence probability of all 2-gram (bigram) words according to a calculation formula of an N-gram model to form a 2-gram (bigram) statistical language model:
P(S)≈P(w1)*P(w2|w1)*P(w3|w2)*...*P(wn|wn-1)
and adopting a trained N-gram language model to carry out error positioning on the input sentence, and considering that an error exists at the N-element character if the co-occurrence probability of the N-element character in the training corpus is lower than a threshold value based on the co-occurrence condition of the characters in the training corpus.
Step 103: based on the erroneous terms and the erroneous term locations, a first set of candidate sentences is determined using a two-way long-short term memory (LSTM) model.
The step 103 specifically includes: converting the text after word segmentation into a word vector matrix by using a word vector tool; taking the word vector matrix as the input of an LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delayed backward propagation algorithm, and constructing the trained LSTM model; substituting the characters in the corpus dictionary into the positions of the wrong words in the text to be corrected one by one, and determining the substituted text; inputting the substituted text into the trained LSTM model, outputting the occurrence probability of each word in the dictionary of the corpus at the position of the wrong word, sequencing the substituted text according to the sequence of the occurrence probabilities from small to large, and determining a first candidate sentence list; based on the erroneous word, a first set of candidate sentences is determined from the first list of candidate sentences.
Determining a first candidate sentence set according to the first candidate sentence list based on the erroneous word specifically includes: judging whether the wrong words exist in the first candidate sentence list or not, if so, determining that the N-gram model judges wrongly and determining that the text to be corrected is correct; if not, selecting homophone and near-phonetic character of the error word from the first candidate sentence set, and determining a second candidate sentence list according to the homophone and the near-phonetic character; and substituting the words in the second candidate sentence list into the positions of the wrong words in the text to be corrected one by one to determine a first candidate sentence set.
The two-way LSTM model is trained by utilizing the corpus, and the model training steps are as follows:
a) converting sentences in the preprocessed text corpus into a matrix of word vectors through word2vec, and using the matrix as the input of an LSTM model;
b) the model is trained by forward and time-wise back propagation algorithms.
And (3) substituting the words in the dictionary into the error positions in the sentences one by one, inputting the substituted new sentences into the trained bidirectional LSTM model, and outputting the probability of each word in the dictionary by the model through calculating, sequencing the calculated probability, and screening out the top topK as a set A.
The following determinations are made based on set a: if the N-gram judges that the wrong word is in the set A, the N-gram model judges that the word is wrong, namely the sentence has no error; and if the N-gram judges that the wrong words are not in the set A, screening homophones and near-phonetic words of the words from the set A to serve as a new set A ', substituting the words in the set A' into the wrong part of the sentence one by one to obtain a first candidate sentence set S, and calculating the PPL of all sentences in the S.
Calculation formula of PPL:
for a sentence (sentence) s is composed of words, w represents a word. Namely:
s=w1w2…wn
PPL(S)=P(w1w2…wN)-1/N
wherein, P is the probability of a sentence, and N is the length of the sentence, i.e., the number of words.
In particular, for the 2-gram model, there are:
Figure BDA0002700647300000081
wherein, p (w)1…wn) Probability of a sentence, p (w)i|wi-1) The conditional probability of the co-occurrence of the two words can be directly output by the trained 2-gram model, and the calculation formula is as follows:
P(wi|wi-1)=count(wi,wi-1)/count(wi-1)
wherein count (w)i-1) Is the word wi-1Number of occurrences in the corpus, count (w)i,wi-1) Is wi,wi-1Number of times two words occur simultaneously.
Step 104: and converting the text to be corrected into a pinyin sequence.
Step 105: and determining a second candidate sentence by utilizing the N-gram model based on the pinyin sequence.
The step 105 specifically includes: based on the pinyin sequence, constructing a plurality of candidate sentences for the text in the corpus dictionary according to the position of pinyin in the text to be corrected; and determining the probabilities of the candidate sentences by using the N-gram model, and taking the candidate sentence with the highest probability as the second candidate sentence.
Error positioning and correction based on a pinyin sequence dynamic programming algorithm:
all input texts (sentences X) to be corrected are converted into pinyin sequences, each pinyin corresponds to one or more Chinese characters, and all candidate Chinese characters form L candidate sentences according to the positions of the pinyin in the original sentences. Calculating the probability size of each statement based on a 2-gram language model:
P(S)≈P(w1)*P(w2|w1)*P(w3|w2)*...*P(wn|wn-1)
the second candidate sentence (sentence Y) with the highest probability is selected.
And comparing the text X to be corrected with the sentence Y, and if the characters of the two sentences at the position i are different, calculating the PPL of the X and the PPL of the Y, wherein the PPL calculation is the same as the above.
Step 106: and comparing the confusion degrees of all the first candidate sentences in the first candidate sentence set with the confusion degrees of all the second candidate sentences to determine the sentence with the lowest confusion degree as the corrected text.
Comparing the PPL values of X, Y and all sentences in S, selecting the sentence with the smallest PPL value as the corrected sentence, and outputting.
Based on the method for correcting the Chinese text provided by the invention, the text correction process is divided into two stages of error checking and error correction, wherein the error checking is to judge the error words and positions possibly existing in the text sentences according to the word co-occurrence probability calculated by an N-gram model, the error correction stage firstly generates a candidate word list of corresponding positions according to the checked error positions and error words and then according to the deep neural network model, the candidate words corresponding to each error word are sequenced and screened, and the best result is recommended to a user, and fig. 2 is another flow chart of the method for correcting the Chinese text provided by the invention, as shown in fig. 2.
The method adopts two ways of positioning and correcting wrong words in the text and takes the confusion degree (PPL) as a measuring index (the confusion degree is a method for measuring the quality of a language probability model in natural language processing, and one language probability model can be regarded as a probability distribution on a whole sentence or a text segment.
In practical application, the invention is specifically applied as follows:
(1) n-gram language model calculation process:
in the case of chinese wrongly written characters, determining whether a sentence is correct can be obtained by calculating its probability, and assuming that a sentence S ═ { w1, w 2.., wn }, the problem can be converted into the following form:
P(s)=P(w1,w2,...,wn)=P(w1)*P(w2|w1)*…*P(wn|w1,w2,…,wn-1)
p(s) is called a language model, i.e. a model used to calculate the legal probability of a sentence.
When the formula is used for actual calculation, the parameter space is too large, the information matrix is seriously sparse and is difficult to be practical, an N-gram model is adopted in practice, the N-gram model is based on the assumption of a Markov model, the occurrence probability of a word only depends on the first 1 word or the first few words of the word, and then the formula is evolved into:
(1) the occurrence of a word depends only on the first 1 word, i.e. Bigram (2-gram):
P(S)≈P(w1)*P(w2|w1)*P(w3|w2)*…*P(wn|wn-1)
(2) the appearance of a word depends only on the first 2 words, i.e. Trigram (3-gram):
P(S)≈P(w1)*P(w2|w1)*P(w3|w1w2)*…*P(wn|wn-2wn-1)
when the n value of the n-gram is larger, the constraint force on the next word is stronger, and since more information is provided, but at the same time, the more complicated the model is, the more problems are, the bigram or trigram is generally adopted.
The specific use of n-grams is illustrated below as a simple example:
the N-gram model constructs a language model through the statistics of the number of words, and the calculation formula of the Bigram is as follows:
P(wi|wi-1)=count(wi,wi-1)/count(wi-1)
p is the conditional probability of two word co-occurrences, and two counts are respectively statistics of the number of times a word occurs in the corpus.
The Bigram is a 2-element language model, the co-occurrence probability of 2-element words is calculated by counting the number of 2 words in a corpus and the number of single words, and the Trigram is a 3-element language model in the same way.
Suppose now that there is a corpus as follows, where < s > is the beginning tag of a sentence and < s > is the end tag of a sentence:
<s1><s2>yes no no no no yes</s2></s1>
<s1><s2>nononoyesyesyesno</s2></s1>
the following task is to evaluate the probability of this sentence as follows:
<s1><s2>yesnonoyes</s2></s1>
calculating the result of the probability using a trigram model:
P(yes|<s1>,<s2>)=1/2,
P(no|yes,no)=1/2,
P(</s2>|no,yes)=1/2,
P(no|<s2>,yes)=1,
P(yes|no,no)=2/5,
P(</s1>|yes,</s2>)=1
the probability required is equal to:
1/2×1×1/2×2/5×1/2×1=0.05
if the probability is less than a certain threshold defined, it indicates that there is an error or an unreasonable sentence in the sentence.
The wrongly written characters of the Chinese text have locality, that is, only a reasonable sliding window needs to be selected to check whether the wrongly written characters exist, and an example is given below:
the text "this case was already handled by the upper court to the lower court" is entered, wherein "give" wrongly written "is" give ". When the model carries out local analysis on the sentence, the co-occurrence probability of the calculated word strings is lower than a threshold value, the analyzer refuses to accept, and the sentence is judged to be wrong.
The 'through' word is checked to be wrongly typed by using an n-gram model, the 'through' word is converted into the pinyin 'chuan', the candidate word of the 'chuan' is searched from the dictionary, and the candidate word is checked to see whether the word is reasonable or not by using the n-gram after one trial filling. The method is characterized in that the n-gram model is combined with the pinyin of the Chinese characters to correct wrongly written characters of the Chinese text.
(2) Error checking process
The error checking module comprises a Bigram subword co-occurrence module and a neural network model.
Bigram subword co-occurrence: and counting the co-occurrence times of the two sub-words in the window with the length of k according to the collected large-scale corpus. Considering the sequence together, (w1, w2) is different from (w2, w1), and finally the high-frequency co-occurring phrases are retained as the initial information input to the neural network language model.
Neural network language model: the invention adopts a neural network language model based on bidirectional lstm to obtain the context information of the input text, thereby predicting the probability analysis condition of the current position character according to the context information, and carrying out conditional probability modeling on the current character and each character in the preceding and following sentences in which the current character is positioned, thereby judging the final error checking result of the current Chinese character and providing a suspected error character and a candidate set.
(3) Correction procedure
And obtaining the combination of the correct text corresponding to the input text according to the candidate set by the positioned wrong words in the input text. And obtaining a modification result of the input text according to the sequencing result. And Y is the input text, and Yi is the sequence of the input text corresponding to the correct text combination.
Calculating a ranking score:
Score=a1*ppl(Yi)+a2*edit_distance(Y,Yi)+a3*WordCount(Yi)
where ppl (Yi) is the ppl of the language model, edit _ distance (Y, Yi) is the edit distance, and WordCount is the number of words. The language model for calculating ppl is a one-way LSTM statistical language model.
Fig. 3 is a structural diagram of a chinese text error correction system according to the present invention, and as shown in fig. 3, a chinese text error correction system includes:
a text to be corrected obtaining module 301, configured to obtain a text to be corrected.
The invention also includes: the corpus dictionary generating module is used for collecting original webpages, preprocessing the original webpages, determining a Chinese text corpus and forming a corpus dictionary; the word segmentation module is used for performing word segmentation processing on the texts in the corpus dictionary by using a word segmentation device and determining a plurality of segmented texts; the co-occurrence frequency rate determining module is used for counting the number of the texts after the word segmentation and the co-occurrence frequency of any two words; and the N-gram model building module is used for building an N-gram model according to the co-occurrence frequency.
And an erroneous word and erroneous word position determining module 302, configured to determine an erroneous word and an erroneous word position in the text to be corrected according to a statistical language N-gram model.
A first candidate sentence set determining module 303, configured to determine a first candidate sentence set by using a bidirectional long-short term memory LSTM model based on the erroneous word and the erroneous word position.
The first candidate sentence set determining module 303 specifically includes: the word vector matrix conversion unit is used for converting the text after word segmentation into a word vector matrix by using a word vector tool; the trained LSTM model building unit is used for taking the word vector matrix as the input of the LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delayed backward propagation algorithm, and building the trained LSTM model; the substituted text determining unit is used for substituting the characters in the corpus dictionary into the positions of the wrong words in the text to be corrected one by one to determine the substituted text; a first candidate sentence list determining unit, configured to input the substituted text into the trained LSTM model, output an occurrence probability of each word in the corpus dictionary at the position of the wrong word, sort the substituted text according to a sequence of the occurrence probabilities from small to large, and determine a first candidate sentence list; a first candidate sentence set determination unit, configured to determine a first candidate sentence set according to the first candidate sentence list based on the erroneous word.
The first candidate sentence set determining unit specifically includes: the first judgment subunit is configured to judge whether the wrong word exists in the first candidate sentence list, so as to obtain a first judgment result; a text to be corrected correctness determining subunit, configured to determine that the N-gram model is judged incorrectly and determine that the text to be corrected is correct if the first judgment indicates that the incorrect word exists in the first candidate sentence list; a second candidate sentence list determining subunit, configured to, if the first determination indicates that the incorrect word does not exist in the first candidate sentence list, screen homophones and nearsighted characters with the incorrect word from the first candidate sentence set, and determine a second candidate sentence list according to the homophones and the nearsighted characters; and the first candidate sentence set determining subunit is used for substituting the words in the second candidate sentence list into the positions of the error words in the text to be corrected one by one to determine a first candidate sentence set.
And a pinyin sequence conversion module 304, configured to convert the text to be corrected into a pinyin sequence.
A second candidate sentence determination module 305, configured to determine a second candidate sentence by using the N-gram model based on the pinyin sequence.
The second candidate sentence determining module 305 specifically includes: the candidate sentence construction units are used for constructing a plurality of candidate sentences for the text in the corpus dictionary according to the position of the pinyin in the text to be corrected based on the pinyin sequence; a second candidate sentence determination unit configured to determine probabilities of the plurality of candidate sentences using the N-gram model, and to take a candidate sentence with a largest probability as the second candidate sentence.
And an error-corrected text determining module 306, configured to compare the confusion degrees of all the first candidate sentences in the first candidate sentence set with the confusion degrees of the second candidate sentences, and determine the sentence with the lowest confusion degree as the error-corrected text.
The invention provides a Chinese text error correction method and a Chinese text error correction system by combining a statistical language model and a deep neural network model, which remarkably improve the error checking and correction rate of Chinese texts, can be applied to proofreading of manuscript contents in daily office and other scenes and have high practical value.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A Chinese text error correction method is characterized by comprising the following steps:
acquiring a text to be corrected;
determining error words and error word positions in the text to be corrected according to a statistical language N-gram model;
determining a first candidate sentence set by utilizing a bidirectional long-short term memory (LSTM) model based on the error words and the error word positions;
converting the text to be corrected into a pinyin sequence;
determining a second candidate sentence by using the N-gram model based on the pinyin sequence;
and comparing the confusion degrees of all the first candidate sentences in the first candidate sentence set with the confusion degrees of all the second candidate sentences to determine the sentence with the lowest confusion degree as the corrected text.
2. The method for correcting Chinese text according to claim 1, wherein the determining of the erroneous word and the position of the erroneous word in the text to be corrected according to a statistical language N-gram model further comprises:
collecting an original webpage, preprocessing the original webpage, determining a Chinese text corpus and forming a corpus dictionary;
performing word segmentation processing on the texts in the corpus dictionary by using a word segmentation device, and determining a plurality of segmented texts;
counting the number of the texts after all the word segmentation and the co-occurrence frequency of any two words;
and constructing an N-gram model according to the co-occurrence frequency.
3. The method of correcting chinese text according to claim 1, wherein the determining a first set of candidate sentences using a two-way long-short term memory LSTM model based on the erroneous terms and the erroneous term positions comprises:
converting the text after word segmentation into a word vector matrix by using a word vector tool;
taking the word vector matrix as the input of an LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delayed backward propagation algorithm, and constructing the trained LSTM model;
substituting the characters in the corpus dictionary into the positions of the wrong words in the text to be corrected one by one, and determining the substituted text;
inputting the substituted text into the trained LSTM model, outputting the occurrence probability of each word in the dictionary of the corpus at the position of the wrong word, sequencing the substituted text according to the sequence of the occurrence probabilities from small to large, and determining a first candidate sentence list;
based on the erroneous word, a first set of candidate sentences is determined from the first list of candidate sentences.
4. The method for correcting chinese text according to claim 3, wherein the determining a first set of candidate sentences according to the first list of candidate sentences based on the erroneous word comprises:
judging whether the wrong words exist in the first candidate sentence list or not to obtain a first judgment result;
if the first judgment shows that the wrong words exist in the first candidate sentence list, determining that the text to be corrected is correct;
if the first judgment shows that the wrong words do not exist in the first candidate sentence list, homophone words and near-phoneme words of the wrong words are screened out from the first candidate sentence set, and a second candidate sentence list is determined according to the homophone words and the near-phoneme words;
and substituting the words in the second candidate sentence list into the positions of the wrong words in the text to be corrected one by one to determine a first candidate sentence set.
5. The method for correcting chinese text according to claim 1, wherein the determining a second candidate sentence using the N-gram model based on the pinyin sequence specifically includes:
based on the pinyin sequence, constructing a plurality of candidate sentences for the text in the corpus dictionary according to the position of pinyin in the text to be corrected;
and determining the probabilities of the candidate sentences by using the N-gram model, and taking the candidate sentence with the highest probability as the second candidate sentence.
6. A chinese text correction system, comprising:
the text to be corrected acquiring module is used for acquiring a text to be corrected;
the error word and error word position determining module is used for determining the error word and the error word position in the text to be corrected according to a statistical language N-gram model;
a first candidate sentence set determining module, configured to determine a first candidate sentence set by using a bidirectional long-short term memory (LSTM) model based on the erroneous word and the erroneous word position;
the pinyin sequence conversion module is used for converting the text to be corrected into a pinyin sequence;
a second candidate sentence determination module, configured to determine a second candidate sentence by using the N-gram model based on the pinyin sequence;
and the text determining module after error correction is used for comparing the confusion degrees of all the first candidate sentences in the first candidate sentence set with the confusion degree of the second candidate sentences and determining the sentence with the lowest confusion degree as the text after error correction.
7. The chinese text correction system of claim 6, further comprising:
the corpus dictionary generating module is used for collecting original webpages, preprocessing the original webpages, determining a Chinese text corpus and forming a corpus dictionary;
the word segmentation module is used for performing word segmentation processing on the texts in the corpus dictionary by using a word segmentation device and determining a plurality of segmented texts;
the co-occurrence frequency rate determining module is used for counting the number of the texts after the word segmentation and the co-occurrence frequency of any two words;
and the N-gram model building module is used for building an N-gram model according to the co-occurrence frequency.
8. The chinese text correction system of claim 6, wherein the first candidate sentence set determining module specifically comprises:
the word vector matrix conversion unit is used for converting the text after word segmentation into a word vector matrix by using a word vector tool;
the trained LSTM model building unit is used for taking the word vector matrix as the input of the LSTM model, training the LSTM model by utilizing a forward propagation algorithm and a delayed backward propagation algorithm, and building the trained LSTM model;
the substituted text determining unit is used for substituting the characters in the corpus dictionary into the positions of the wrong words in the text to be corrected one by one to determine the substituted text;
a first candidate sentence list determining unit, configured to input the substituted text into the trained LSTM model, output an occurrence probability of each word in the corpus dictionary at the position of the wrong word, sort the substituted text according to a sequence of the occurrence probabilities from small to large, and determine a first candidate sentence list;
a first candidate sentence set determination unit, configured to determine a first candidate sentence set according to the first candidate sentence list based on the erroneous word.
9. The chinese text correction system of claim 8, wherein the first candidate sentence set determining unit specifically includes:
the first judgment subunit is configured to judge whether the wrong word exists in the first candidate sentence list, so as to obtain a first judgment result;
a text to be corrected correctness determining subunit, configured to determine that the text to be corrected is correct if the first determination indicates that the erroneous word exists in the first candidate sentence list;
a second candidate sentence list determining subunit, configured to, if the first determination indicates that the incorrect word does not exist in the first candidate sentence list, screen homophones and nearsighted characters with the incorrect word from the first candidate sentence set, and determine a second candidate sentence list according to the homophones and the nearsighted characters;
and the first candidate sentence set determining subunit is used for substituting the words in the second candidate sentence list into the positions of the error words in the text to be corrected one by one to determine a first candidate sentence set.
10. The chinese text correction system of claim 6, wherein the second candidate sentence determination module specifically comprises:
the candidate sentence construction units are used for constructing a plurality of candidate sentences for the text in the corpus dictionary according to the position of the pinyin in the text to be corrected based on the pinyin sequence;
a second candidate sentence determination unit configured to determine probabilities of the plurality of candidate sentences using the N-gram model, and to take a candidate sentence with a largest probability as the second candidate sentence.
CN202011021044.4A 2020-09-25 2020-09-25 Chinese text error correction method and system Active CN112149406B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011021044.4A CN112149406B (en) 2020-09-25 2020-09-25 Chinese text error correction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011021044.4A CN112149406B (en) 2020-09-25 2020-09-25 Chinese text error correction method and system

Publications (2)

Publication Number Publication Date
CN112149406A true CN112149406A (en) 2020-12-29
CN112149406B CN112149406B (en) 2023-09-08

Family

ID=73896929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011021044.4A Active CN112149406B (en) 2020-09-25 2020-09-25 Chinese text error correction method and system

Country Status (1)

Country Link
CN (1) CN112149406B (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735396A (en) * 2021-02-05 2021-04-30 北京小米松果电子有限公司 Speech recognition error correction method, device and storage medium
CN112800987A (en) * 2021-02-02 2021-05-14 中国联合网络通信集团有限公司 Chinese character processing method and device
CN112883717A (en) * 2021-04-27 2021-06-01 北京嘉和海森健康科技有限公司 Wrongly written character detection method and device
CN112989806A (en) * 2021-04-07 2021-06-18 广州伟宏智能科技有限公司 Intelligent text error correction model training method
CN113051896A (en) * 2021-04-23 2021-06-29 百度在线网络技术(北京)有限公司 Method and device for correcting text, electronic equipment and storage medium
CN113076739A (en) * 2021-04-09 2021-07-06 厦门快商通科技股份有限公司 Method and system for realizing cross-domain Chinese text error correction
CN113096667A (en) * 2021-04-19 2021-07-09 上海云绅智能科技有限公司 Wrongly-written character recognition detection method and system
CN113343671A (en) * 2021-06-07 2021-09-03 佳都科技集团股份有限公司 Statement error correction method, device and equipment after voice recognition and storage medium
CN113361266A (en) * 2021-06-25 2021-09-07 达闼机器人有限公司 Text error correction method, electronic device and storage medium
CN113435187A (en) * 2021-06-24 2021-09-24 湖北大学 Text error correction method and system for industrial alarm information
CN113780418A (en) * 2021-09-10 2021-12-10 平安科技(深圳)有限公司 Data screening method, system, equipment and storage medium
CN113887203A (en) * 2021-09-29 2022-01-04 平安普惠企业管理有限公司 Text error correction method, device and equipment based on artificial intelligence and storage medium
CN113887202A (en) * 2021-09-29 2022-01-04 平安普惠企业管理有限公司 Text error correction method and device, computer equipment and storage medium
CN114328798A (en) * 2021-11-09 2022-04-12 腾讯科技(深圳)有限公司 Processing method, device, equipment, storage medium and program product for searching text
CN114386399A (en) * 2021-12-30 2022-04-22 中国电信股份有限公司 Text error correction method and device
CN114417834A (en) * 2021-12-24 2022-04-29 深圳云天励飞技术股份有限公司 Text processing method and device, electronic equipment and readable storage medium
CN114495910A (en) * 2022-04-07 2022-05-13 联通(广东)产业互联网有限公司 Text error correction method, system, device and storage medium
CN114492396A (en) * 2022-02-17 2022-05-13 重庆长安汽车股份有限公司 Text error correction method for automobile proper nouns and readable storage medium
CN114528824A (en) * 2021-12-24 2022-05-24 深圳云天励飞技术股份有限公司 Text error correction method and device, electronic equipment and storage medium
CN114707492A (en) * 2022-03-22 2022-07-05 昆明理工大学 Vietnamese grammar error correction method and device fusing multi-granularity characteristics
CN115223588A (en) * 2022-03-24 2022-10-21 华东师范大学 Child voice phrase matching method based on pinyin distance and sliding window
CN115310434A (en) * 2022-10-11 2022-11-08 深圳擎盾信息科技有限公司 Error correction method and device for grammars of contracting documents, computer equipment and storage medium
CN115719059A (en) * 2022-11-29 2023-02-28 北京中科智加科技有限公司 Morse packet error correction method
CN116090441A (en) * 2022-12-30 2023-05-09 永中软件股份有限公司 Chinese spelling error correction method integrating local semantic features and global semantic features
CN116306600A (en) * 2023-05-25 2023-06-23 山东齐鲁壹点传媒有限公司 MacBert-based Chinese text error correction method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190318732A1 (en) * 2018-04-16 2019-10-17 International Business Machines Corporation Implementing a whole sentence recurrent neural network language model for natural language processing
CN110765996A (en) * 2019-10-21 2020-02-07 北京百度网讯科技有限公司 Text information processing method and device
US20200125639A1 (en) * 2018-10-22 2020-04-23 Ca, Inc. Generating training data from a machine learning model to identify offensive language
CN111369996A (en) * 2020-02-24 2020-07-03 网经科技(苏州)有限公司 Method for correcting text error in speech recognition in specific field

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190318732A1 (en) * 2018-04-16 2019-10-17 International Business Machines Corporation Implementing a whole sentence recurrent neural network language model for natural language processing
US20200125639A1 (en) * 2018-10-22 2020-04-23 Ca, Inc. Generating training data from a machine learning model to identify offensive language
CN110765996A (en) * 2019-10-21 2020-02-07 北京百度网讯科技有限公司 Text information processing method and device
CN111369996A (en) * 2020-02-24 2020-07-03 网经科技(苏州)有限公司 Method for correcting text error in speech recognition in specific field

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨越;黄瑞章;魏琴;陈艳平;秦永彬;: "基于上下文语义的新闻人名纠错方法", 电子科技大学学报, no. 06 *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800987A (en) * 2021-02-02 2021-05-14 中国联合网络通信集团有限公司 Chinese character processing method and device
CN112800987B (en) * 2021-02-02 2023-07-21 中国联合网络通信集团有限公司 Chinese character processing method and device
CN112735396A (en) * 2021-02-05 2021-04-30 北京小米松果电子有限公司 Speech recognition error correction method, device and storage medium
CN112989806A (en) * 2021-04-07 2021-06-18 广州伟宏智能科技有限公司 Intelligent text error correction model training method
CN113076739A (en) * 2021-04-09 2021-07-06 厦门快商通科技股份有限公司 Method and system for realizing cross-domain Chinese text error correction
CN113096667A (en) * 2021-04-19 2021-07-09 上海云绅智能科技有限公司 Wrongly-written character recognition detection method and system
CN113051896A (en) * 2021-04-23 2021-06-29 百度在线网络技术(北京)有限公司 Method and device for correcting text, electronic equipment and storage medium
CN113051896B (en) * 2021-04-23 2023-08-18 百度在线网络技术(北京)有限公司 Method and device for correcting text, electronic equipment and storage medium
CN112883717A (en) * 2021-04-27 2021-06-01 北京嘉和海森健康科技有限公司 Wrongly written character detection method and device
CN113343671B (en) * 2021-06-07 2023-03-31 佳都科技集团股份有限公司 Statement error correction method, device and equipment after voice recognition and storage medium
CN113343671A (en) * 2021-06-07 2021-09-03 佳都科技集团股份有限公司 Statement error correction method, device and equipment after voice recognition and storage medium
CN113435187A (en) * 2021-06-24 2021-09-24 湖北大学 Text error correction method and system for industrial alarm information
CN113435187B (en) * 2021-06-24 2023-07-07 湖北大学 Text error correction method and system for industrial alarm information
CN113361266A (en) * 2021-06-25 2021-09-07 达闼机器人有限公司 Text error correction method, electronic device and storage medium
CN113361266B (en) * 2021-06-25 2022-12-06 达闼机器人股份有限公司 Text error correction method, electronic device and storage medium
CN113780418A (en) * 2021-09-10 2021-12-10 平安科技(深圳)有限公司 Data screening method, system, equipment and storage medium
CN113887203A (en) * 2021-09-29 2022-01-04 平安普惠企业管理有限公司 Text error correction method, device and equipment based on artificial intelligence and storage medium
CN113887202A (en) * 2021-09-29 2022-01-04 平安普惠企业管理有限公司 Text error correction method and device, computer equipment and storage medium
CN114328798A (en) * 2021-11-09 2022-04-12 腾讯科技(深圳)有限公司 Processing method, device, equipment, storage medium and program product for searching text
CN114328798B (en) * 2021-11-09 2024-02-23 腾讯科技(深圳)有限公司 Processing method, device, equipment, storage medium and program product for searching text
CN114417834A (en) * 2021-12-24 2022-04-29 深圳云天励飞技术股份有限公司 Text processing method and device, electronic equipment and readable storage medium
CN114528824A (en) * 2021-12-24 2022-05-24 深圳云天励飞技术股份有限公司 Text error correction method and device, electronic equipment and storage medium
CN114386399A (en) * 2021-12-30 2022-04-22 中国电信股份有限公司 Text error correction method and device
CN114492396A (en) * 2022-02-17 2022-05-13 重庆长安汽车股份有限公司 Text error correction method for automobile proper nouns and readable storage medium
CN114707492A (en) * 2022-03-22 2022-07-05 昆明理工大学 Vietnamese grammar error correction method and device fusing multi-granularity characteristics
CN115223588A (en) * 2022-03-24 2022-10-21 华东师范大学 Child voice phrase matching method based on pinyin distance and sliding window
CN114495910A (en) * 2022-04-07 2022-05-13 联通(广东)产业互联网有限公司 Text error correction method, system, device and storage medium
CN115310434A (en) * 2022-10-11 2022-11-08 深圳擎盾信息科技有限公司 Error correction method and device for grammars of contracting documents, computer equipment and storage medium
CN115719059A (en) * 2022-11-29 2023-02-28 北京中科智加科技有限公司 Morse packet error correction method
CN115719059B (en) * 2022-11-29 2023-08-08 北京中科智加科技有限公司 Morse grouping error correction method
CN116090441A (en) * 2022-12-30 2023-05-09 永中软件股份有限公司 Chinese spelling error correction method integrating local semantic features and global semantic features
CN116090441B (en) * 2022-12-30 2023-10-20 永中软件股份有限公司 Chinese spelling error correction method integrating local semantic features and global semantic features
CN116306600A (en) * 2023-05-25 2023-06-23 山东齐鲁壹点传媒有限公司 MacBert-based Chinese text error correction method
CN116306600B (en) * 2023-05-25 2023-08-11 山东齐鲁壹点传媒有限公司 MacBert-based Chinese text error correction method

Also Published As

Publication number Publication date
CN112149406B (en) 2023-09-08

Similar Documents

Publication Publication Date Title
CN112149406B (en) Chinese text error correction method and system
CN110489760A (en) Based on deep neural network text auto-collation and device
Kanakaraddi et al. Survey on parts of speech tagger techniques
US7424675B2 (en) Language input architecture for converting one text form to another text form with tolerance to spelling typographical and conversion errors
US7165019B1 (en) Language input architecture for converting one text form to another text form with modeless entry
US7035789B2 (en) Supervised automatic text generation based on word classes for language modeling
Wilcox-O’Hearn et al. Real-word spelling correction with trigrams: A reconsideration of the Mays, Damerau, and Mercer model
CN111611810A (en) Polyphone pronunciation disambiguation device and method
WO2008059111A2 (en) Natural language processing
JP6778655B2 (en) Word concatenation discriminative model learning device, word concatenation detection device, method, and program
US7464024B2 (en) Chinese character-based parser
CN102214238A (en) Device and method for matching similarity of Chinese words
CN114564912B (en) Intelligent document format checking and correcting method and system
CN114298010A (en) Text generation method integrating dual-language model and sentence detection
Lee et al. Automatic word spacing using probabilistic models based on character n-grams
Mundotiya et al. Linguistic resources for Bhojpuri, Magahi, and Maithili: statistics about them, their similarity estimates, and baselines for three applications
JPH10326275A (en) Method and device for morpheme analysis and method and device for japanese morpheme analysis
Motlani et al. A finite-state morphological analyser for Sindhi
Mekki et al. COTA 2.0: An automatic corrector of Tunisian Arabic social media texts
Zarnoufi et al. MANorm: A normalization dictionary for Moroccan Arabic dialect written in Latin script
CN115034209A (en) Text analysis method and device, electronic equipment and storage medium
CN114528861A (en) Foreign language translation training method and device based on corpus
CN114896966A (en) Method, system, equipment and medium for positioning grammar error of Chinese text
CN113468875A (en) MNet method for semantic analysis of natural language interaction interface of SCADA system
Aliprandi et al. An inflected-sensitive letter and word prediction system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant