CN111639488A

CN111639488A - English word correction system, method, application, device and readable storage medium

Info

Publication number: CN111639488A
Application number: CN202010414063.7A
Authority: CN
Inventors: 李振; 张刚; 鲍东岳; 尹正; 徐超; 刘昊霖; 周圣文; 孙梅; 吕亚波
Original assignee: Minsheng Science And Technology Co ltd
Current assignee: Minsheng Science And Technology Co ltd
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2020-09-08

Abstract

The invention provides an English word correcting system, method, device, application and readable storage medium, firstly constructing a dictionary in a certain field, then modifying each word one by one, scanning each word in a sentence, if the word is in the constructed dictionary and is considered to be a correct word, continuing to scan the next word; if not, a candidate set of this wrong word is generated, which is a set of all possible correct words generated from the dictionary in accordance with the four possible types of errors that can be made (replacement letters, low-written letters, high-written letters, no space) and with an edit distance of 1 limited. Then, the best candidate is selected (calculated by a noise channel and prior probability), and the original wrong word is replaced.

Description

English word correction system, method, application, device and readable storage medium

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of computer character processing, in particular to an English word correction system, method, device, application and readable storage medium based on an OCR recognition technology.

[ background of the invention ]

The OCR technology is an abbreviation for Optical Character Recognition (Optical Character Recognition), and is a computer input technology that converts characters of various bills, newspapers, books, manuscripts and other printed matters into image information by an Optical input method such as scanning, and then converts the image information into usable computer information by using a Character Recognition technology. For the current technical reasons, the accuracy of the recognized text by the OCR technology is difficult to reach 100%, and various recognition-induced errors often occur. The invention hopes to correct some errors in a NLP (natural language processing) error correction mode and improve the accuracy of comparing the text recognized by OCR with the original text.

The traditional word correction technology is often based on conventional text, and most errors are letter replacement, letter omission or letter multi-writing errors caused by handwriting input. In addition to the above common errors, the types of text errors recognized by OCR are different from the traditional input in the following points:

OCR recognized text often suffers from the error of two words being recognized together;

the text recognized by the OCR does not have letter or word reversal errors;

the text recognized by the OCR does not have similar word errors;

accordingly, there is a need to develop an OCR recognition technology-based english word correction system, method, apparatus, application and readable storage medium to address the deficiencies of the prior art and to solve or alleviate one or more of the above problems.

[ summary of the invention ]

In view of this, the invention provides an english word correction system, method, device, application and readable storage medium, which modify each word one by one and have higher accuracy.

In one aspect, the present invention provides an english word correction system, which corrects based on a result of OCR recognizing a text, the system comprising:

the dictionary is used for inquiring and judging whether the word to be corrected belongs to the dictionary or not;

a word candidate set processing module; the device is used for performing edit distance processing on a word to be corrected and generating a first word candidate set according to an edit distance processing result;

the prior probability calculation module is used for carrying out prior probability calculation on the words in the first word candidate set;

the noise model calculation module is used for carrying out likelihood probability calculation on the words in the first word candidate set;

the correction module is used for calculating the results of the noise model calculation module and the prior probability calculation module to obtain a second word, and replacing the word to be corrected with the second word as a corrected word;

one end of the word candidate set processing module is connected with the dictionary, and the other end of the word candidate set processing module is connected with the correction module through the prior probability calculation module and the noise model calculation module.

The above-mentioned aspect and any possible implementation manner further provide an english word correction method, including the english word correction system, the method including the steps of:

s1: constructing a dictionary according to the existing corpus and the corpus obtained by crawling;

s2: inputting a word to be corrected, and judging whether the word is in a dictionary or not by inquiring the dictionary constructed in S1, if so, carrying out S3, and otherwise, carrying out S4;

s3: outputting a word to be corrected;

s4: combining a dictionary according to the error type of the word to be corrected and limiting the editing distance to generate a first word candidate set;

s5: the words in the first word candidate set pass through a noise channel, and a prior probability calculation method and a noise model of the words in the first word candidate set are obtained through a Bayesian inference method;

s6: respectively calculating words in the first word candidate set through a prior probability calculation method and a noise channel model to obtain the prior probability and the likelihood probability of each word;

s7: multiplying the prior probability and the likelihood probability of each word in the S6, and selecting the result with the maximum probability as a second word;

s8: and outputting the second word as a final correction result.

As for the above-mentioned aspect and any possible implementation manner, there is further provided an implementation manner, where the S1 specifically is: determining the domain to which the dictionary belongs, obtaining a corpus of the domain through a crawler technology, separating the text according to spaces by combining the corpus obtained by the training data of the domain, and finally counting the word frequency of each word to complete dictionary construction.

As to the above-mentioned aspect and any possible implementation manner, there is further provided an implementation manner, where the S4 specifically includes:

s41: four error cases that may occur in a word are determined, including: few letters, many letters, with letters being replaced and not separated by spaces;

s42: calculating the editing distance of the word with the error condition through an editing distance algorithm;

s43: and taking the error type and the editing distance as 1 as standards, generating a candidate set of error words, screening corresponding words in a dictionary in the candidate set of error words, and generating a first word candidate set.

As for the above-mentioned aspect and any possible implementation manner, there is further provided an implementation manner, where the S42 specifically is: the method comprises the steps of carrying out quantitative measurement on the difference degree of two character strings to obtain an editing distance, wherein the measuring mode is to calculate at least the processing times required for changing one character string into another character string, one time of processing comprises replacing, deleting and adding one letter, and the editing distance of two or more character strings is solved by a dynamic programming method.

The above-described aspect and any possible implementation manner further provide an implementation manner, and a specific calculation method of the edit distance is as follows:

representing the edit distance of two character strings a, b as lev_a,b(| a |, | b |), wherein a and b correspond to the length of a, b, lev, respectively_a,b(| a |, | b |) is expressed as follows:

lev_a,bthe distance between the first i character and the first j character in the first i character b in the (i, j) bit a, and the final edit distance is i ═ a |, and the distance lev when j ═ b |_a,b(|a|,|b|)。

As to the above-mentioned aspect and any possible implementation manner, there is further provided an implementation manner, where the S5 specifically includes:

s51: passing the words in the first word candidate set through a noise channel;

s52: taking P (w | x) as the probability that the correct word corresponding to the error word x is w, P (x | w) as the probability that the error word x is changed into the word w by the error, P (w) as the probability that the correct word w appears, and P (x) as the probability that the error word x appears;

v is a dictionary, one w is selected from V, and w with the maximum calculation probability is as follows:

the w with the maximum calculation probability is the optimal solution

S53: acquiring a noise model P (x | w) and a prior probability P (w) according to the calculation result in S52;

wherein the prior probability of a word: p (w)_i)＝P(w_i)|P(w_i-1)*P(w_i+1|w_i)；

Wherein w_iFor the wrong word, the previous word of the wrong word in the test text is w_i-1The latter word is w_i+1；

The results of the noise model P (x | w) are obtained by constructing a confusion matrix and counting the probability of each error occurring in the erroneous word data set.

The above aspect and any possible implementation manner further provide an application of the english word correction method, where the application specifically is: a section of text to be corrected is identified through OCR, and words are modified one by the English word correction method so as to improve the accuracy rate of the English words in the text.

The above-mentioned aspect and any possible implementation manner further provide an english word correction device, where the device includes a memory, a processor, and an english word correction processing program stored in the memory and executable on the processor, and the english word correction processing program, when executed by the processor, implements the steps of the english word correction method.

The above-mentioned aspects and any possible implementation manners further provide a computer-readable storage medium, on which an english word correction processing program is stored, and when being executed by a processor, the processing program implements the steps of the english word correction method.

Compared with the prior art, the invention can obtain the following technical effects:

1. aiming at the English text recognized by the OCR, the corrected word has higher accuracy;

2. the invention corrects some errors in a NLP (natural language processing) error correction mode and improves the accuracy of comparing the text recognized by OCR with the original text.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of an english word correction system according to an embodiment of the present invention.

[ detailed description ] embodiments

For better understanding of the technical solutions of the present invention, the following detailed descriptions of the embodiments of the present invention are provided with reference to the accompanying drawings.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The invention provides an English word correction system, a method, an application, a device and a readable storage medium based on a machine learning model, wherein the system corrects based on the result of OCR text recognition, and is characterized in that the system comprises:

An English word correction method, comprising the steps of:

s3: outputting a word to be corrected;

s8: and outputting the second word as a final correction result.

The S1 specifically includes: determining the domain to which the dictionary belongs, obtaining a corpus of the domain through a crawler technology, separating the text according to spaces by combining the corpus obtained by the training data of the domain, and finally counting the word frequency of each word to complete dictionary construction.

The S4 specifically includes:

The S42 specifically includes: the method comprises the steps of carrying out quantitative measurement on the difference degree of two character strings to obtain an editing distance, wherein the measuring mode is to calculate at least the processing times required for changing one character string into another character string, one time of processing comprises replacing, deleting and adding one letter, and the editing distance of two or more character strings is solved by a dynamic programming method.

The specific calculation method of the edit distance is as follows:

The S5 specifically includes:

s51: passing the words in the first word candidate set through a noise channel;

the w with the maximum calculation probability is the optimal solution

An application of an English word correction method is specifically as follows: a piece of text to be corrected is identified through OCR, and words are modified one by one in the English word correction method according to any one of claims 2-7, so that the accuracy rate of the English words in the text is improved.

An english word correction apparatus, the apparatus comprising a memory, a processor, and an english word correction processing program stored on the memory and executable on the processor, the english word correction processing program, when executed by the processor, implementing the steps of the english word correction method as described.

A computer-readable storage medium, on which an english word correction processing program is stored, which, when executed by a processor, implements the steps of the english word correction method as described.

Example 1:

the English word correction method comprises the following steps: a dictionary of a certain domain, such as a financial domain, is first constructed. To get a sufficient corpus, crawler technology can be used to obtain it from the web. And (3) combining the existing corpus and the crawled corpus, cutting the text according to the blank space, counting the word frequency of each word, and constructing a dictionary. The correction method is to modify each word one by one for a section of text to be corrected. Scanning each word in the sentence, and if the word is in the constructed dictionary and is considered to be a correct word, continuing to scan the next word; if not, a candidate set of this wrong word is generated, which is a set of all possible correct words generated from the 4 possible resulting error types (alternate letters, low-written letters, high-written letters, no space) and with an edit distance of 1 limited in combination with the lexicon. The best candidate is then selected (calculated using the noise channel + prior probability) and the original wrong word is replaced.

As shown in fig. 1, the method of the present invention comprises the following steps:

1-word repairing method

For a piece of text to be corrected, first, each word is scanned in turn, and if the word is not in the dictionary (not included in the dictionary), it is considered to be a misspelled word. Second, a "most accurate" word is selected from a set of candidate correct words, and the "most accurate" word is the result to be found (the correct word corresponding to the incorrect word).

2 Generation of candidate set

A word may be due to 4 cases: few letters, many letters, letters replaced or not separated by spaces result in a wrong word.

And searching a group of candidate words by using an edit distance algorithm.

The edit distance is a quantitative measure of the difference between two strings, and the measure is to determine how many times a string is changed into another string. The editing Distance adopted by the invention is a Levenshtein Distance (Levenshtein Distance), and one-time processing comprises replacing, deleting or adding a letter. The edit distance of the two character strings can be solved by a dynamic programming method.

The specific calculation method of the edit distance is as follows:

the levenstein distance of two strings a, b is denoted lev_a,b(| a |, | b |), wherein a and b correspond to the lengths of a, b, respectively. lev_a,b(| a |, | b |) may be expressed as follows:

lev_a,b(i, j) refers to the distance between the first j characters of the first i characters b in a. The last edit distance is the distance lev when i ═ a |, j ═ b |_a,b(|a|,|b|)。

The invention firstly generates a candidate set of error words according to the error type and the editing distance as 1, and then filters words in a dictionary in the candidate set to generate a correct word candidate set.

3 Bayes inference

Initially the initial word passes through the noise path to generate a noisy word. The present invention obtains guessed words by decoding noisy words. The solution can be solved by a Bayesian inference method.

P (w | x) represents the probability that the correct word corresponding to the incorrect word x is w, P (x | w) represents the probability that the incorrect word x has been erroneously changed into the word w, P (w) represents the probability that the correct word w has been generated, and P (x) represents the probability that the incorrect word x has been generated.

According to the Bayesian formula, the method comprises the following steps:

v is a dictionary, one w is selected from V, and the probability is calculated and is the maximumThe w of (a) is the correct word corresponding to the incorrect word x, and the correct word is recorded as:

find the w with the largest calculation result (the largest probability), which is the optimal solution

The denominator p (x) may not need to be calculated because p (x) may be treated as a constant value unchanged. The formula can then be simplified as:

part of the formula is P (x | w), which is referred to herein as the noise model, which is the likelihood function. And P (w) is referred to herein as the prior probability.

4. Method for solving prior probability P (w)

A corpus (thesaurus) is selected and a series of candidate words are then generated from the wrong words by an edit distance algorithm. The candidate word is a word set with an edit distance of 1 from the original word.

The present invention employs a binary language model. The prior probability of each candidate word is related to both its preceding and succeeding words. Assume the wrong word is w_iWith the previous word in the test text as w_i-1The latter word is w_i+1. The prior probabilities of the words are:

P(w_i)＝P(w_i)|P(w_i-1)*P(w_i+1|w_i)

5. solving method of noise channel model

P (x | w) means how likely it is that a wrong word x is caused, given a correct candidate word w.

The error condition may take several forms:

the correct word becomes the wrong word one less letter. ('word' - > 'wor')

A correct word goes wrong with a letter to become a wrong word. ('word' - > 'word')

The correct word is written with more letters than a letter to become the wrong word. ('word' - > 'word')

Two correct words are concatenated together to become an incorrect word, separated by no space in between. ('word to' - > 'word')

These errors can all be found with the wrong word plus an edit distance of 1.

For each error, a confusion matrix is constructed, and the probability of each error occurring in the erroneous-word dataset is counted.

And finally multiplying the prior probability and the likelihood probability, wherein the candidate word with the maximum probability is the final result.

The english word correction system, method, application, device and readable storage medium provided in the embodiments of the present application are described in detail above. The above description of the embodiments is only for the purpose of helping to understand the method of the present application and its core ideas; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

As used in the specification and claims, certain terms are used to refer to particular components. As one skilled in the art will appreciate, manufacturers may refer to a component by different names. This specification and claims do not intend to distinguish between components that differ in name but not function. In the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. "substantially" means within an acceptable error range, and a person skilled in the art can solve the technical problem within a certain error range to substantially achieve the technical effect. The description which follows is a preferred embodiment of the present application, but is made for the purpose of illustrating the general principles of the application and not for the purpose of limiting the scope of the application. The protection scope of the present application shall be subject to the definitions of the appended claims.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a commodity or system that includes the element.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The foregoing description shows and describes several preferred embodiments of the present application, but as aforementioned, it is to be understood that the application is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the application as described herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the application, which is to be protected by the claims appended hereto.

Claims

1. An English word correction system, which corrects based on a result of OCR recognition of a text, comprising:

2. An english word correction method comprising the english word correction system according to claim 1, characterized in that the method comprises the steps of:

s3: outputting a word to be corrected;

s8: and outputting the second word as a final correction result.

3. The method for correcting an english word according to claim 2, wherein the S1 is specifically: determining the domain to which the dictionary belongs, obtaining a corpus of the domain through a crawler technology, separating the text according to spaces by combining the corpus obtained by the training data of the domain, and finally counting the word frequency of each word to complete dictionary construction.

4. The method for correcting an english word according to claim 2, wherein the S4 specifically includes:

5. The method for correcting an english word according to claim 4, wherein the S42 is specifically: the method comprises the steps of carrying out quantitative measurement on the difference degree of two character strings to obtain an editing distance, wherein the measuring mode is to calculate at least the processing times required for changing one character string into another character string, one time of processing comprises replacing, deleting and adding one letter, and the editing distance of two or more character strings is solved by a dynamic programming method.

6. The method for correcting English words according to claim 5, wherein the specific calculation method of the edit distance is as follows:

7. The method for correcting an english word according to claim 6, wherein the S5 specifically includes:

s51: passing the words in the first word candidate set through a noise channel;

s52: p (w | x) represents the probability that the correct word corresponding to the error word x is w, P (x | w) represents the probability that the error word x is changed into the word w in error, P (w) represents the probability that the correct word w appears, and P (x) represents the probability that the error word x appears;

the w with the maximum calculation probability is the optimal solution

Wherein w_iFor wrong words, wrong notesThe word preceding the test text is w_i-1The latter word is w_i+1；

8. An application of an English word correction method is characterized in that the application specifically comprises the following steps: a piece of text to be corrected is identified through OCR, and words are modified one by one in the English word correction method according to any one of claims 2-7, so that the accuracy rate of the English words in the text is improved.

9. An apparatus for correcting english words, comprising a memory, a processor, and a processing program for correcting english words stored in the memory and executable on the processor, wherein the processing program for correcting english words realizes the steps of the method for correcting english words according to any one of claims 2 to 7 when executed by the processor.

10. A computer-readable storage medium, wherein an english word correction processing program is stored on the computer-readable storage medium, and when the processing program is executed by a processor, the steps of the english word correction method according to any one of claims 2 to 7 are implemented.