CN111639488A - English word correction system, method, application, device and readable storage medium - Google Patents

English word correction system, method, application, device and readable storage medium Download PDF

Info

Publication number
CN111639488A
CN111639488A CN202010414063.7A CN202010414063A CN111639488A CN 111639488 A CN111639488 A CN 111639488A CN 202010414063 A CN202010414063 A CN 202010414063A CN 111639488 A CN111639488 A CN 111639488A
Authority
CN
China
Prior art keywords
word
words
english
probability
candidate set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010414063.7A
Other languages
Chinese (zh)
Inventor
李振
张刚
鲍东岳
尹正
徐超
刘昊霖
周圣文
孙梅
吕亚波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Minsheng Science And Technology Co ltd
Original Assignee
Minsheng Science And Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Minsheng Science And Technology Co ltd filed Critical Minsheng Science And Technology Co ltd
Priority to CN202010414063.7A priority Critical patent/CN111639488A/en
Publication of CN111639488A publication Critical patent/CN111639488A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Abstract

The invention provides an English word correcting system, method, device, application and readable storage medium, firstly constructing a dictionary in a certain field, then modifying each word one by one, scanning each word in a sentence, if the word is in the constructed dictionary and is considered to be a correct word, continuing to scan the next word; if not, a candidate set of this wrong word is generated, which is a set of all possible correct words generated from the dictionary in accordance with the four possible types of errors that can be made (replacement letters, low-written letters, high-written letters, no space) and with an edit distance of 1 limited. Then, the best candidate is selected (calculated by a noise channel and prior probability), and the original wrong word is replaced.

Description

English word correction system, method, application, device and readable storage medium
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of computer character processing, in particular to an English word correction system, method, device, application and readable storage medium based on an OCR recognition technology.
[ background of the invention ]
The OCR technology is an abbreviation for Optical Character Recognition (Optical Character Recognition), and is a computer input technology that converts characters of various bills, newspapers, books, manuscripts and other printed matters into image information by an Optical input method such as scanning, and then converts the image information into usable computer information by using a Character Recognition technology. For the current technical reasons, the accuracy of the recognized text by the OCR technology is difficult to reach 100%, and various recognition-induced errors often occur. The invention hopes to correct some errors in a NLP (natural language processing) error correction mode and improve the accuracy of comparing the text recognized by OCR with the original text.
The traditional word correction technology is often based on conventional text, and most errors are letter replacement, letter omission or letter multi-writing errors caused by handwriting input. In addition to the above common errors, the types of text errors recognized by OCR are different from the traditional input in the following points:
OCR recognized text often suffers from the error of two words being recognized together;
the text recognized by the OCR does not have letter or word reversal errors;
the text recognized by the OCR does not have similar word errors;
accordingly, there is a need to develop an OCR recognition technology-based english word correction system, method, apparatus, application and readable storage medium to address the deficiencies of the prior art and to solve or alleviate one or more of the above problems.
[ summary of the invention ]
In view of this, the invention provides an english word correction system, method, device, application and readable storage medium, which modify each word one by one and have higher accuracy.
In one aspect, the present invention provides an english word correction system, which corrects based on a result of OCR recognizing a text, the system comprising:
the dictionary is used for inquiring and judging whether the word to be corrected belongs to the dictionary or not;
a word candidate set processing module; the device is used for performing edit distance processing on a word to be corrected and generating a first word candidate set according to an edit distance processing result;
the prior probability calculation module is used for carrying out prior probability calculation on the words in the first word candidate set;
the noise model calculation module is used for carrying out likelihood probability calculation on the words in the first word candidate set;
the correction module is used for calculating the results of the noise model calculation module and the prior probability calculation module to obtain a second word, and replacing the word to be corrected with the second word as a corrected word;
one end of the word candidate set processing module is connected with the dictionary, and the other end of the word candidate set processing module is connected with the correction module through the prior probability calculation module and the noise model calculation module.
The above-mentioned aspect and any possible implementation manner further provide an english word correction method, including the english word correction system, the method including the steps of:
s1: constructing a dictionary according to the existing corpus and the corpus obtained by crawling;
s2: inputting a word to be corrected, and judging whether the word is in a dictionary or not by inquiring the dictionary constructed in S1, if so, carrying out S3, and otherwise, carrying out S4;
s3: outputting a word to be corrected;
s4: combining a dictionary according to the error type of the word to be corrected and limiting the editing distance to generate a first word candidate set;
s5: the words in the first word candidate set pass through a noise channel, and a prior probability calculation method and a noise model of the words in the first word candidate set are obtained through a Bayesian inference method;
s6: respectively calculating words in the first word candidate set through a prior probability calculation method and a noise channel model to obtain the prior probability and the likelihood probability of each word;
s7: multiplying the prior probability and the likelihood probability of each word in the S6, and selecting the result with the maximum probability as a second word;
s8: and outputting the second word as a final correction result.
As for the above-mentioned aspect and any possible implementation manner, there is further provided an implementation manner, where the S1 specifically is: determining the domain to which the dictionary belongs, obtaining a corpus of the domain through a crawler technology, separating the text according to spaces by combining the corpus obtained by the training data of the domain, and finally counting the word frequency of each word to complete dictionary construction.
As to the above-mentioned aspect and any possible implementation manner, there is further provided an implementation manner, where the S4 specifically includes:
s41: four error cases that may occur in a word are determined, including: few letters, many letters, with letters being replaced and not separated by spaces;
s42: calculating the editing distance of the word with the error condition through an editing distance algorithm;
s43: and taking the error type and the editing distance as 1 as standards, generating a candidate set of error words, screening corresponding words in a dictionary in the candidate set of error words, and generating a first word candidate set.
As for the above-mentioned aspect and any possible implementation manner, there is further provided an implementation manner, where the S42 specifically is: the method comprises the steps of carrying out quantitative measurement on the difference degree of two character strings to obtain an editing distance, wherein the measuring mode is to calculate at least the processing times required for changing one character string into another character string, one time of processing comprises replacing, deleting and adding one letter, and the editing distance of two or more character strings is solved by a dynamic programming method.
The above-described aspect and any possible implementation manner further provide an implementation manner, and a specific calculation method of the edit distance is as follows:
representing the edit distance of two character strings a, b as leva,b(| a |, | b |), wherein a and b correspond to the length of a, b, lev, respectivelya,b(| a |, | b |) is expressed as follows:
Figure BDA0002494393070000051
leva,bthe distance between the first i character and the first j character in the first i character b in the (i, j) bit a, and the final edit distance is i ═ a |, and the distance lev when j ═ b |a,b(|a|,|b|)。
As to the above-mentioned aspect and any possible implementation manner, there is further provided an implementation manner, where the S5 specifically includes:
s51: passing the words in the first word candidate set through a noise channel;
s52: taking P (w | x) as the probability that the correct word corresponding to the error word x is w, P (x | w) as the probability that the error word x is changed into the word w by the error, P (w) as the probability that the correct word w appears, and P (x) as the probability that the error word x appears;
Figure BDA0002494393070000052
v is a dictionary, one w is selected from V, and w with the maximum calculation probability is as follows:
Figure BDA0002494393070000053
the w with the maximum calculation probability is the optimal solution
Figure BDA0002494393070000054
Figure BDA0002494393070000055
S53: acquiring a noise model P (x | w) and a prior probability P (w) according to the calculation result in S52;
wherein the prior probability of a word: p (w)i)=P(wi)|P(wi-1)*P(wi+1|wi);
Wherein wiFor the wrong word, the previous word of the wrong word in the test text is wi-1The latter word is wi+1
The results of the noise model P (x | w) are obtained by constructing a confusion matrix and counting the probability of each error occurring in the erroneous word data set.
The above aspect and any possible implementation manner further provide an application of the english word correction method, where the application specifically is: a section of text to be corrected is identified through OCR, and words are modified one by the English word correction method so as to improve the accuracy rate of the English words in the text.
The above-mentioned aspect and any possible implementation manner further provide an english word correction device, where the device includes a memory, a processor, and an english word correction processing program stored in the memory and executable on the processor, and the english word correction processing program, when executed by the processor, implements the steps of the english word correction method.
The above-mentioned aspects and any possible implementation manners further provide a computer-readable storage medium, on which an english word correction processing program is stored, and when being executed by a processor, the processing program implements the steps of the english word correction method.
Compared with the prior art, the invention can obtain the following technical effects:
1. aiming at the English text recognized by the OCR, the corrected word has higher accuracy;
2. the invention corrects some errors in a NLP (natural language processing) error correction mode and improves the accuracy of comparing the text recognized by OCR with the original text.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of an english word correction system according to an embodiment of the present invention.
[ detailed description ] embodiments
For better understanding of the technical solutions of the present invention, the following detailed descriptions of the embodiments of the present invention are provided with reference to the accompanying drawings.
It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The invention provides an English word correction system, a method, an application, a device and a readable storage medium based on a machine learning model, wherein the system corrects based on the result of OCR text recognition, and is characterized in that the system comprises:
the dictionary is used for inquiring and judging whether the word to be corrected belongs to the dictionary or not;
a word candidate set processing module; the device is used for performing edit distance processing on a word to be corrected and generating a first word candidate set according to an edit distance processing result;
the prior probability calculation module is used for carrying out prior probability calculation on the words in the first word candidate set;
the noise model calculation module is used for carrying out likelihood probability calculation on the words in the first word candidate set;
the correction module is used for calculating the results of the noise model calculation module and the prior probability calculation module to obtain a second word, and replacing the word to be corrected with the second word as a corrected word;
one end of the word candidate set processing module is connected with the dictionary, and the other end of the word candidate set processing module is connected with the correction module through the prior probability calculation module and the noise model calculation module.
An English word correction method, comprising the steps of:
s1: constructing a dictionary according to the existing corpus and the corpus obtained by crawling;
s2: inputting a word to be corrected, and judging whether the word is in a dictionary or not by inquiring the dictionary constructed in S1, if so, carrying out S3, and otherwise, carrying out S4;
s3: outputting a word to be corrected;
s4: combining a dictionary according to the error type of the word to be corrected and limiting the editing distance to generate a first word candidate set;
s5: the words in the first word candidate set pass through a noise channel, and a prior probability calculation method and a noise model of the words in the first word candidate set are obtained through a Bayesian inference method;
s6: respectively calculating words in the first word candidate set through a prior probability calculation method and a noise channel model to obtain the prior probability and the likelihood probability of each word;
s7: multiplying the prior probability and the likelihood probability of each word in the S6, and selecting the result with the maximum probability as a second word;
s8: and outputting the second word as a final correction result.
The S1 specifically includes: determining the domain to which the dictionary belongs, obtaining a corpus of the domain through a crawler technology, separating the text according to spaces by combining the corpus obtained by the training data of the domain, and finally counting the word frequency of each word to complete dictionary construction.
The S4 specifically includes:
s41: four error cases that may occur in a word are determined, including: few letters, many letters, with letters being replaced and not separated by spaces;
s42: calculating the editing distance of the word with the error condition through an editing distance algorithm;
s43: and taking the error type and the editing distance as 1 as standards, generating a candidate set of error words, screening corresponding words in a dictionary in the candidate set of error words, and generating a first word candidate set.
The S42 specifically includes: the method comprises the steps of carrying out quantitative measurement on the difference degree of two character strings to obtain an editing distance, wherein the measuring mode is to calculate at least the processing times required for changing one character string into another character string, one time of processing comprises replacing, deleting and adding one letter, and the editing distance of two or more character strings is solved by a dynamic programming method.
The specific calculation method of the edit distance is as follows:
representing the edit distance of two character strings a, b as leva,b(| a |, | b |), wherein a and b correspond to the length of a, b, lev, respectivelya,b(| a |, | b |) is expressed as follows:
Figure BDA0002494393070000101
leva,bthe distance between the first i character and the first j character in the first i character b in the (i, j) bit a, and the final edit distance is i ═ a |, and the distance lev when j ═ b |a,b(|a|,|b|)。
The S5 specifically includes:
s51: passing the words in the first word candidate set through a noise channel;
s52: taking P (w | x) as the probability that the correct word corresponding to the error word x is w, P (x | w) as the probability that the error word x is changed into the word w by the error, P (w) as the probability that the correct word w appears, and P (x) as the probability that the error word x appears;
Figure BDA0002494393070000102
v is a dictionary, one w is selected from V, and w with the maximum calculation probability is as follows:
Figure BDA0002494393070000103
the w with the maximum calculation probability is the optimal solution
Figure BDA0002494393070000104
Figure BDA0002494393070000105
S53: acquiring a noise model P (x | w) and a prior probability P (w) according to the calculation result in S52;
wherein the prior probability of a word: p (w)i)=P(wi)|P(wi-1)*P(wi+1|wi);
Wherein wiFor the wrong word, the previous word of the wrong word in the test text is wi-1The latter word is wi+1
The results of the noise model P (x | w) are obtained by constructing a confusion matrix and counting the probability of each error occurring in the erroneous word data set.
An application of an English word correction method is specifically as follows: a piece of text to be corrected is identified through OCR, and words are modified one by one in the English word correction method according to any one of claims 2-7, so that the accuracy rate of the English words in the text is improved.
An english word correction apparatus, the apparatus comprising a memory, a processor, and an english word correction processing program stored on the memory and executable on the processor, the english word correction processing program, when executed by the processor, implementing the steps of the english word correction method as described.
A computer-readable storage medium, on which an english word correction processing program is stored, which, when executed by a processor, implements the steps of the english word correction method as described.
Example 1:
the English word correction method comprises the following steps: a dictionary of a certain domain, such as a financial domain, is first constructed. To get a sufficient corpus, crawler technology can be used to obtain it from the web. And (3) combining the existing corpus and the crawled corpus, cutting the text according to the blank space, counting the word frequency of each word, and constructing a dictionary. The correction method is to modify each word one by one for a section of text to be corrected. Scanning each word in the sentence, and if the word is in the constructed dictionary and is considered to be a correct word, continuing to scan the next word; if not, a candidate set of this wrong word is generated, which is a set of all possible correct words generated from the 4 possible resulting error types (alternate letters, low-written letters, high-written letters, no space) and with an edit distance of 1 limited in combination with the lexicon. The best candidate is then selected (calculated using the noise channel + prior probability) and the original wrong word is replaced.
As shown in fig. 1, the method of the present invention comprises the following steps:
1-word repairing method
For a piece of text to be corrected, first, each word is scanned in turn, and if the word is not in the dictionary (not included in the dictionary), it is considered to be a misspelled word. Second, a "most accurate" word is selected from a set of candidate correct words, and the "most accurate" word is the result to be found (the correct word corresponding to the incorrect word).
2 Generation of candidate set
A word may be due to 4 cases: few letters, many letters, letters replaced or not separated by spaces result in a wrong word.
And searching a group of candidate words by using an edit distance algorithm.
The edit distance is a quantitative measure of the difference between two strings, and the measure is to determine how many times a string is changed into another string. The editing Distance adopted by the invention is a Levenshtein Distance (Levenshtein Distance), and one-time processing comprises replacing, deleting or adding a letter. The edit distance of the two character strings can be solved by a dynamic programming method.
The specific calculation method of the edit distance is as follows:
the levenstein distance of two strings a, b is denoted leva,b(| a |, | b |), wherein a and b correspond to the lengths of a, b, respectively. leva,b(| a |, | b |) may be expressed as follows:
Figure BDA0002494393070000131
leva,b(i, j) refers to the distance between the first j characters of the first i characters b in a. The last edit distance is the distance lev when i ═ a |, j ═ b |a,b(|a|,|b|)。
The invention firstly generates a candidate set of error words according to the error type and the editing distance as 1, and then filters words in a dictionary in the candidate set to generate a correct word candidate set.
3 Bayes inference
Initially the initial word passes through the noise path to generate a noisy word. The present invention obtains guessed words by decoding noisy words. The solution can be solved by a Bayesian inference method.
P (w | x) represents the probability that the correct word corresponding to the incorrect word x is w, P (x | w) represents the probability that the incorrect word x has been erroneously changed into the word w, P (w) represents the probability that the correct word w has been generated, and P (x) represents the probability that the incorrect word x has been generated.
According to the Bayesian formula, the method comprises the following steps:
Figure BDA0002494393070000132
v is a dictionary, one w is selected from V, and the probability is calculated and is the maximumThe w of (a) is the correct word corresponding to the incorrect word x, and the correct word is recorded as:
Figure BDA0002494393070000133
find the w with the largest calculation result (the largest probability), which is the optimal solution
Figure BDA0002494393070000134
Figure BDA0002494393070000135
The denominator p (x) may not need to be calculated because p (x) may be treated as a constant value unchanged. The formula can then be simplified as:
Figure BDA0002494393070000141
part of the formula is P (x | w), which is referred to herein as the noise model, which is the likelihood function. And P (w) is referred to herein as the prior probability.
4. Method for solving prior probability P (w)
A corpus (thesaurus) is selected and a series of candidate words are then generated from the wrong words by an edit distance algorithm. The candidate word is a word set with an edit distance of 1 from the original word.
The present invention employs a binary language model. The prior probability of each candidate word is related to both its preceding and succeeding words. Assume the wrong word is wiWith the previous word in the test text as wi-1The latter word is wi+1. The prior probabilities of the words are:
P(wi)=P(wi)|P(wi-1)*P(wi+1|wi)
5. solving method of noise channel model
P (x | w) means how likely it is that a wrong word x is caused, given a correct candidate word w.
The error condition may take several forms:
the correct word becomes the wrong word one less letter. ('word' - > 'wor')
A correct word goes wrong with a letter to become a wrong word. ('word' - > 'word')
The correct word is written with more letters than a letter to become the wrong word. ('word' - > 'word')
Two correct words are concatenated together to become an incorrect word, separated by no space in between. ('word to' - > 'word')
These errors can all be found with the wrong word plus an edit distance of 1.
For each error, a confusion matrix is constructed, and the probability of each error occurring in the erroneous-word dataset is counted.
And finally multiplying the prior probability and the likelihood probability, wherein the candidate word with the maximum probability is the final result.
The english word correction system, method, application, device and readable storage medium provided in the embodiments of the present application are described in detail above. The above description of the embodiments is only for the purpose of helping to understand the method of the present application and its core ideas; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.
As used in the specification and claims, certain terms are used to refer to particular components. As one skilled in the art will appreciate, manufacturers may refer to a component by different names. This specification and claims do not intend to distinguish between components that differ in name but not function. In the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. "substantially" means within an acceptable error range, and a person skilled in the art can solve the technical problem within a certain error range to substantially achieve the technical effect. The description which follows is a preferred embodiment of the present application, but is made for the purpose of illustrating the general principles of the application and not for the purpose of limiting the scope of the application. The protection scope of the present application shall be subject to the definitions of the appended claims.
It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a commodity or system that includes the element.
It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
The foregoing description shows and describes several preferred embodiments of the present application, but as aforementioned, it is to be understood that the application is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the application as described herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the application, which is to be protected by the claims appended hereto.

Claims (10)

1. An English word correction system, which corrects based on a result of OCR recognition of a text, comprising:
the dictionary is used for inquiring and judging whether the word to be corrected belongs to the dictionary or not;
a word candidate set processing module; the device is used for performing edit distance processing on a word to be corrected and generating a first word candidate set according to an edit distance processing result;
the prior probability calculation module is used for carrying out prior probability calculation on the words in the first word candidate set;
the noise model calculation module is used for carrying out likelihood probability calculation on the words in the first word candidate set;
the correction module is used for calculating the results of the noise model calculation module and the prior probability calculation module to obtain a second word, and replacing the word to be corrected with the second word as a corrected word;
one end of the word candidate set processing module is connected with the dictionary, and the other end of the word candidate set processing module is connected with the correction module through the prior probability calculation module and the noise model calculation module.
2. An english word correction method comprising the english word correction system according to claim 1, characterized in that the method comprises the steps of:
s1: constructing a dictionary according to the existing corpus and the corpus obtained by crawling;
s2: inputting a word to be corrected, and judging whether the word is in a dictionary or not by inquiring the dictionary constructed in S1, if so, carrying out S3, and otherwise, carrying out S4;
s3: outputting a word to be corrected;
s4: combining a dictionary according to the error type of the word to be corrected and limiting the editing distance to generate a first word candidate set;
s5: the words in the first word candidate set pass through a noise channel, and a prior probability calculation method and a noise model of the words in the first word candidate set are obtained through a Bayesian inference method;
s6: respectively calculating words in the first word candidate set through a prior probability calculation method and a noise channel model to obtain the prior probability and the likelihood probability of each word;
s7: multiplying the prior probability and the likelihood probability of each word in the S6, and selecting the result with the maximum probability as a second word;
s8: and outputting the second word as a final correction result.
3. The method for correcting an english word according to claim 2, wherein the S1 is specifically: determining the domain to which the dictionary belongs, obtaining a corpus of the domain through a crawler technology, separating the text according to spaces by combining the corpus obtained by the training data of the domain, and finally counting the word frequency of each word to complete dictionary construction.
4. The method for correcting an english word according to claim 2, wherein the S4 specifically includes:
s41: four error cases that may occur in a word are determined, including: few letters, many letters, with letters being replaced and not separated by spaces;
s42: calculating the editing distance of the word with the error condition through an editing distance algorithm;
s43: and taking the error type and the editing distance as 1 as standards, generating a candidate set of error words, screening corresponding words in a dictionary in the candidate set of error words, and generating a first word candidate set.
5. The method for correcting an english word according to claim 4, wherein the S42 is specifically: the method comprises the steps of carrying out quantitative measurement on the difference degree of two character strings to obtain an editing distance, wherein the measuring mode is to calculate at least the processing times required for changing one character string into another character string, one time of processing comprises replacing, deleting and adding one letter, and the editing distance of two or more character strings is solved by a dynamic programming method.
6. The method for correcting English words according to claim 5, wherein the specific calculation method of the edit distance is as follows:
representing the edit distance of two character strings a, b as leva,b(| a |, | b |), wherein a and b correspond to the length of a, b, lev, respectivelya,b(| a |, | b |) is expressed as follows:
Figure FDA0002494393060000031
leva,bthe distance between the first i character and the first j character in the first i character b in the (i, j) bit a, and the final edit distance is i ═ a |, and the distance lev when j ═ b |a,b(|a|,|b|)。
7. The method for correcting an english word according to claim 6, wherein the S5 specifically includes:
s51: passing the words in the first word candidate set through a noise channel;
s52: p (w | x) represents the probability that the correct word corresponding to the error word x is w, P (x | w) represents the probability that the error word x is changed into the word w in error, P (w) represents the probability that the correct word w appears, and P (x) represents the probability that the error word x appears;
Figure FDA0002494393060000032
v is a dictionary, one w is selected from V, and w with the maximum calculation probability is as follows:
Figure FDA0002494393060000033
the w with the maximum calculation probability is the optimal solution
Figure FDA0002494393060000041
Figure FDA0002494393060000042
S53: acquiring a noise model P (x | w) and a prior probability P (w) according to the calculation result in S52;
wherein the prior probability of a word: p (w)i)=P(wi)|P(wi-1)*P(wi+1|wi);
Wherein wiFor wrong words, wrong notesThe word preceding the test text is wi-1The latter word is wi+1
The results of the noise model P (x | w) are obtained by constructing a confusion matrix and counting the probability of each error occurring in the erroneous word data set.
8. An application of an English word correction method is characterized in that the application specifically comprises the following steps: a piece of text to be corrected is identified through OCR, and words are modified one by one in the English word correction method according to any one of claims 2-7, so that the accuracy rate of the English words in the text is improved.
9. An apparatus for correcting english words, comprising a memory, a processor, and a processing program for correcting english words stored in the memory and executable on the processor, wherein the processing program for correcting english words realizes the steps of the method for correcting english words according to any one of claims 2 to 7 when executed by the processor.
10. A computer-readable storage medium, wherein an english word correction processing program is stored on the computer-readable storage medium, and when the processing program is executed by a processor, the steps of the english word correction method according to any one of claims 2 to 7 are implemented.
CN202010414063.7A 2020-05-15 2020-05-15 English word correction system, method, application, device and readable storage medium Pending CN111639488A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010414063.7A CN111639488A (en) 2020-05-15 2020-05-15 English word correction system, method, application, device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010414063.7A CN111639488A (en) 2020-05-15 2020-05-15 English word correction system, method, application, device and readable storage medium

Publications (1)

Publication Number Publication Date
CN111639488A true CN111639488A (en) 2020-09-08

Family

ID=72330829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010414063.7A Pending CN111639488A (en) 2020-05-15 2020-05-15 English word correction system, method, application, device and readable storage medium

Country Status (1)

Country Link
CN (1) CN111639488A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102867040A (en) * 2012-08-31 2013-01-09 中国科学院计算技术研究所 Chinese search engine mixed speech-oriented query error corrosion method and system
CN108491392A (en) * 2018-03-29 2018-09-04 广州视源电子科技股份有限公司 Modification method, system, computer equipment and the storage medium of word misspelling
CN110348020A (en) * 2019-07-17 2019-10-18 杭州嘉云数据科技有限公司 A kind of English- word spelling error correction method, device, equipment and readable storage medium storing program for executing
CN110532572A (en) * 2019-09-12 2019-12-03 四川长虹电器股份有限公司 Spell checking methods based on the tree-like naive Bayesian of TAN

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102867040A (en) * 2012-08-31 2013-01-09 中国科学院计算技术研究所 Chinese search engine mixed speech-oriented query error corrosion method and system
CN108491392A (en) * 2018-03-29 2018-09-04 广州视源电子科技股份有限公司 Modification method, system, computer equipment and the storage medium of word misspelling
CN110348020A (en) * 2019-07-17 2019-10-18 杭州嘉云数据科技有限公司 A kind of English- word spelling error correction method, device, equipment and readable storage medium storing program for executing
CN110532572A (en) * 2019-09-12 2019-12-03 四川长虹电器股份有限公司 Spell checking methods based on the tree-like naive Bayesian of TAN

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王冬: "基于贝叶斯方法和编辑距离的英文语法检查系统设计与实现" *

Similar Documents

Publication Publication Date Title
Kissos et al. OCR error correction using character correction and feature-based word classification
US9069753B2 (en) Determining proximity measurements indicating respective intended inputs
CN111639489A (en) Chinese text error correction system, method, device and computer readable storage medium
US7047493B1 (en) Spell checker with arbitrary length string-to-string transformations to improve noisy channel spelling correction
US10963717B1 (en) Auto-correction of pattern defined strings
CN113435186B (en) Chinese text error correction system, method, device and computer readable storage medium
CN111859921A (en) Text error correction method and device, computer equipment and storage medium
Li et al. Spelling error correction using a nested RNN model and pseudo training data
JPH0778165A (en) Method and computer system for detection of error string in text
CN111651978A (en) Entity-based lexical examination method and device, computer equipment and storage medium
Pal et al. OCR error correction of an inflectional indian language using morphological parsing
US8208685B2 (en) Word recognition method and word recognition program
JP5097802B2 (en) Japanese automatic recommendation system and method using romaji conversion
US20110229036A1 (en) Method and apparatus for text and error profiling of historical documents
JP2000089786A (en) Method for correcting speech recognition result and apparatus therefor
CN114677689B (en) Text image recognition error correction method and electronic equipment
CN111639488A (en) English word correction system, method, application, device and readable storage medium
CN116306594A (en) Medical OCR recognition error correction method
Lund Ensemble Methods for Historical Machine-Printed Document Recognition
Mohapatra et al. Spell checker for OCR
JP3975825B2 (en) Character recognition error correction method, apparatus and program
Islam et al. A context-sensitive approach to find optimum language model for automatic Bangla spelling correction
Al-Zaydi et al. Multiple Outputs TechniquesEvaluation for Arabic Character Recognition
JP3274014B2 (en) Character recognition device and character recognition method
JP6303508B2 (en) Document analysis apparatus, document analysis system, document analysis method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination