CN113095067A - OCR error correction method, device, electronic equipment and storage medium - Google Patents

OCR error correction method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113095067A
CN113095067A CN202110235350.6A CN202110235350A CN113095067A CN 113095067 A CN113095067 A CN 113095067A CN 202110235350 A CN202110235350 A CN 202110235350A CN 113095067 A CN113095067 A CN 113095067A
Authority
CN
China
Prior art keywords
sample
error
text
corrected
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110235350.6A
Other languages
Chinese (zh)
Inventor
石峻宇
乔媛媛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202110235350.6A priority Critical patent/CN113095067A/en
Publication of CN113095067A publication Critical patent/CN113095067A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Abstract

The embodiment of the invention provides an OCR error correction method, an OCR error correction device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring an OCR error text of a target field containing error characters; replacing the error characters with placeholders to obtain a text to be corrected; inputting the OCR error text and the text to be corrected into a pre-trained error correction model; acquiring a target character output by the error correction model; the target characters are: the correct characters at the position of the placeholder in the text to be corrected, which are predicted by the error correction model; and replacing the placeholder in the text to be corrected with the target character to obtain the corrected OCR text. Therefore, by applying the embodiment of the invention, the correct character at the position of the placeholder in the text to be corrected can be obtained through the error correction model, so that the corrected OCR text can be obtained, manual correction is not needed, the time for correcting the error in the OCR text is reduced, and the efficiency for correcting the OCR error is improved.

Description

OCR error correction method, device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer application technologies, and in particular, to a method and an apparatus for OCR error correction, an electronic device, and a storage medium.
Background
At present, by using an OCR (Optical Character Recognition) technology, a printed text can be scanned into a computer text, which is convenient for a user to retrieve and analyze the computer text on a computer. However, the obtained computer text may have character errors in the computer text.
At present, in the related art, after the wrong character is determined, the OCR error is corrected by adopting a manual correction method, but the time for correcting the wrong character in the OCR text by adopting the manual correction method is longer, and the efficiency of OCR error correction is lower.
Disclosure of Invention
Embodiments of the present invention provide an OCR error correction method, an OCR error correction apparatus, an electronic device, and a storage medium, so as to reduce time for correcting an error character in an OCR text and improve efficiency of correcting an OCR error. The specific technical scheme is as follows:
in a first aspect, an embodiment of the present invention provides a method for OCR error correction, where the method includes:
acquiring an OCR error text of a target field containing error characters;
replacing the error characters with placeholders to obtain a text to be corrected;
inputting the OCR error text and the text to be corrected into a pre-trained error correction model; the error correction model is: aiming at a target field in advance, training a preset language representation BERT model by using a plurality of sample error texts containing error characters, sample texts to be corrected corresponding to the sample error texts and containing error character placeholders and correct sample texts corresponding to the sample error texts;
acquiring a target character output by the error correction model; the target characters are: the correct characters at the position of the placeholder in the text to be corrected, which are predicted by the error correction model;
and replacing the placeholder in the text to be corrected with the target character to obtain the corrected OCR text.
Optionally, the training process of the error correction model includes:
replacing a plurality of characters in a plurality of texts in a corpus with placeholders to serve as a plurality of first sample texts to be corrected, and taking the plurality of characters as first sample marking characters; the plurality of texts in the corpus is: text containing the correct characters;
training an initial language representation BERT model by using the plurality of first sample texts to be corrected and the first sample label characters to obtain an intermediate language representation BERT model;
and aiming at the target field, training the intermediate language representation BERT model by using a plurality of second sample error texts containing error characters, second to-be-corrected sample texts containing error character placeholders corresponding to the second sample error texts and correct sample texts corresponding to the second sample error texts to obtain an error correction model.
Optionally, the step of training the initial language representation BERT model by using the plurality of first sample texts to be corrected and the first sample annotation characters to obtain an intermediate language representation BERT model includes:
inputting each first sample text to be corrected into a current initial language representation BERT model, and acquiring each first sample target character output by the current initial language representation BERT model; the first sample target character is: correct characters at placeholder positions in the first sample text to be corrected, which are predicted by the current initial language representation BERT model;
calculating a first loss value based on each first sample target character, each first sample label character and a preset first loss function;
judging whether the current initial language representation BERT model is converged or not according to the first loss value;
if so, determining that the current initial language representation BERT model is an intermediate language representation BERT model; and if not, adjusting and updating the network parameters of the current initial language representation BERT model, returning to the step of inputting each first sample text to be corrected into the current initial language representation BERT model and acquiring each first sample target character output by the current initial language representation BERT model.
Optionally, for the target field, the step of training the intermediate language representation BERT model to obtain an error correction model by using a plurality of second sample error texts containing error characters, second to-be-corrected sample texts containing error character placeholders corresponding to the plurality of second sample error texts, and correct sample texts corresponding to the plurality of second sample error texts includes:
aiming at the target field, acquiring a plurality of second sample error texts containing error characters and corresponding correct sample texts thereof;
replacing a plurality of error characters in the second sample error text with placeholders based on the plurality of second sample error texts and the corresponding correct sample texts thereof to be used as a second sample text to be corrected, and using correct characters corresponding to the plurality of error characters in the second sample error text as second sample marking characters;
inputting each second sample error text and a corresponding second sample text to be corrected into a current intermediate language representation (BERT) model to obtain each second sample target character output by the current intermediate language representation (BERT) model; the second sample target character is: correct characters at placeholder positions in the second sample text to be corrected, predicted by the current intermediate language representation BERT model;
calculating a second loss value based on each second sample target character, each second sample label character and a preset second loss function;
judging whether the current intermediate language representation BERT model is converged or not according to the second loss value;
if so, determining that the current intermediate language representation BERT model is a trained error correction model; and if not, adjusting and updating the network parameters of the current intermediate language representation BERT model, and returning to the step of inputting each second sample error text and the corresponding second sample text to be corrected into the current intermediate language representation BERT model to obtain each second sample target character output by the current intermediate language representation BERT model.
In a second aspect, an embodiment of the present invention provides an apparatus for OCR error correction, where the apparatus includes:
an OCR error text acquisition unit for acquiring an OCR error text of a target field containing error characters;
a to-be-corrected text obtaining unit, configured to replace the erroneous characters with placeholders, and obtain a to-be-corrected text;
the input unit is used for inputting the OCR error text and the text to be corrected into a pre-trained error correction model; the error correction model is: aiming at a target field in advance, training a preset language representation BERT model by using a plurality of sample error texts containing error characters, sample texts to be corrected corresponding to the sample error texts and containing error character placeholders and correct sample texts corresponding to the sample error texts;
a target character acquisition unit for acquiring a target character output by the error correction model; the target characters are: the correct characters at the position of the placeholder in the text to be corrected, which are predicted by the error correction model;
and the corrected OCR text obtaining unit is used for replacing the placeholder in the text to be corrected with the target character to obtain the corrected OCR text.
Optionally, the apparatus further includes: an error correction model training unit;
the error correction model training unit includes:
the system comprises a sample text and annotation character acquisition module in a corpus, a correction module and a correction module, wherein the sample text and annotation character acquisition module is used for replacing a plurality of characters in a plurality of texts in the corpus with placeholders as a plurality of first sample texts to be corrected and using the plurality of characters as first sample annotation characters; the plurality of texts in the corpus is: text containing the correct characters;
an intermediate language representation BERT model obtaining module, configured to train an initial language representation BERT model using the plurality of first sample texts to be corrected and the first sample annotation characters, so as to obtain an intermediate language representation BERT model;
and the error correction model obtaining module is used for training the intermediate language representation BERT model to obtain an error correction model by using a plurality of second sample error texts containing error characters, second to-be-corrected sample texts containing error character placeholders corresponding to the plurality of second sample error texts and correct sample texts corresponding to the plurality of second sample error texts aiming at the target field.
Optionally, the intermediate language representation BERT model obtaining module is specifically configured to:
inputting each first sample text to be corrected into a current initial language representation BERT model, and acquiring each first sample target character output by the current initial language representation BERT model; the first sample target character is: correct characters at placeholder positions in the first sample text to be corrected, which are predicted by the current initial language representation BERT model;
calculating a first loss value based on each first sample target character, each first sample label character and a preset first loss function;
judging whether the current initial language representation BERT model is converged or not according to the first loss value;
if so, determining that the current initial language representation BERT model is an intermediate language representation BERT model; and if not, adjusting and updating the network parameters of the current initial language representation BERT model, returning to the step of inputting each first sample text to be corrected into the current initial language representation BERT model and acquiring each first sample target character output by the current initial language representation BERT model.
Optionally, the error correction model obtaining module is specifically configured to:
aiming at the target field, acquiring a plurality of second sample error texts containing error characters and corresponding correct sample texts thereof;
replacing a plurality of error characters in the second sample error text with placeholders based on the plurality of second sample error texts and the corresponding correct sample texts thereof to be used as a second sample text to be corrected, and using correct characters corresponding to the plurality of error characters in the second sample error text as second sample marking characters;
inputting each second sample error text and a corresponding second sample text to be corrected into a current intermediate language representation (BERT) model to obtain each second sample target character output by the current intermediate language representation (BERT) model; the second sample target character is: correct characters at placeholder positions in the second sample text to be corrected, predicted by the current intermediate language representation BERT model;
calculating a second loss value based on each second sample target character, each second sample label character and a preset second loss function;
judging whether the current intermediate language representation BERT model is converged or not according to the second loss value;
if so, determining that the current intermediate language representation BERT model is a trained error correction model; and if not, adjusting and updating the network parameters of the current intermediate language representation BERT model, and returning to the step of inputting each second sample error text and the corresponding second sample text to be corrected into the current intermediate language representation BERT model to obtain each second sample target character output by the current intermediate language representation BERT model.
In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing any OCR error correction method steps when executing the program stored in the memory.
In a fourth aspect, the present invention provides a computer-readable storage medium, in which a computer program is stored, where the computer program is executed by a processor to perform any of the steps of the OCR error correction method.
In a fifth aspect, embodiments of the present invention also provide a computer program product containing instructions, which when run on a computer, cause the computer to perform any of the OCR error correction methods described above.
The method, the device, the electronic equipment and the storage medium for OCR error correction provided by the embodiment of the invention can acquire an OCR error text of a target field containing error characters; replacing the error characters with placeholders to obtain a text to be corrected; inputting the OCR error text and the text to be corrected into a pre-trained error correction model; the error correction model is: aiming at a target field in advance, training a preset language representation BERT model by using a plurality of sample error texts containing error characters, sample texts to be corrected corresponding to the sample error texts and containing error character placeholders and correct sample texts corresponding to the sample error texts; acquiring a target character output by the error correction model; the target characters are: the correct characters at the position of the placeholder in the text to be corrected, which are predicted by the error correction model; and replacing the placeholder in the text to be corrected with the target character to obtain the corrected OCR text.
Therefore, by applying the embodiment of the invention, the correct character at the position of the placeholder in the text to be corrected can be obtained through the error correction model, so that the corrected OCR text can be obtained, manual correction is not needed, the time for correcting the error in the OCR text is reduced, and the efficiency for correcting the OCR error is improved.
Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart illustrating a method for OCR error correction according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for training an error correction model according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart illustrating a method for training an error correction model according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an apparatus for OCR error correction according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to reduce the time for correcting the error characters in the OCR text and improve the efficiency of correcting the OCR error, embodiments of the present invention provide a method, an apparatus, an electronic device, and a storage medium for OCR error correction.
The method for correcting the OCR errors provided by the embodiment of the invention can be applied to any electronic equipment which needs to correct the OCR errors, such as: a computer or a mobile terminal, etc., which are not limited herein. For convenience of description, the electronic device is hereinafter referred to simply as an electronic device.
Referring to fig. 1, a specific processing flow of the method for OCR error correction according to an embodiment of the present invention is as shown in fig. 1, and may include:
step S101, obtaining an OCR error text of a target field containing error characters.
And S102, replacing the error characters with placeholders to obtain the text to be corrected.
Step S103, inputting the OCR error text and the text to be corrected into a pre-trained error correction model.
The error correction model is: aiming at a target field in advance, training a preset language representation BERT model by using a plurality of sample error texts containing error characters, sample texts to be corrected corresponding to the sample error texts and containing error character placeholders and correct sample texts corresponding to the sample error texts;
step S104, acquiring a target character output by the error correction model; the target characters are: and the error correction model predicts the correct character at the position of the placeholder in the text to be corrected.
And step S105, replacing the placeholder in the text to be corrected with a target character to obtain the corrected OCR text.
Therefore, by applying the embodiment of the invention, the correct character at the position of the placeholder in the text to be corrected can be obtained through the error correction model, so that the corrected OCR text can be obtained, manual correction is not needed, the time for correcting the error in the OCR text is reduced, and the efficiency for correcting the OCR error is improved.
Moreover, in this embodiment, the erroneous character may carry a part of information from the correct character, and therefore, the erroneous character is not directly discarded, the OCR erroneous text and the text to be corrected are both input into the error correction model, and the correct character at the position of the placeholder predicted by the obtained error correction model is more accurate, so that the obtained corrected OCR text is more accurate.
As can be implemented, the training procedure of the error correction model in the above embodiment can be specifically referred to fig. 2 and fig. 3.
Referring to fig. 2, a flowchart of a method for training an error correction model according to an embodiment of the present invention is shown in fig. 2, where a specific processing flow of the method may include:
step S201, replacing a plurality of characters in a plurality of texts in a corpus with placeholders as a plurality of first sample texts to be corrected, and taking the plurality of characters as first sample marking characters; the plurality of texts in the corpus is: text containing the correct characters.
As may be implemented, the corpus may be text in wikipedia. The characters of the text in the corpus may all be correct characters.
Step S202, training the initial language representation BERT model by using a plurality of first sample texts to be corrected and first sample annotation characters to obtain an intermediate language representation BERT model.
Step S203, aiming at the target field, a plurality of second sample error texts containing error characters, second to-be-corrected sample texts containing error character placeholders corresponding to the second sample error texts and correct sample texts corresponding to the second sample error texts are used for training the intermediate language representation BERT model to obtain an error correction model.
It may be implemented that the second sample text to be corrected is a text in which the error characters in the second sample error text are replaced by placeholders.
Thus, the initial language representation BERT model is trained based on a plurality of texts in the corpus, and the obtained intermediate language representation BERT model has understood semantic information of a large number of texts in the corpus and can correct wrong characters. In order to correct the wrong characters in the text in the target field more accurately, the text in the target field is used for training the intermediate language representation BERT model, the obtained intermediate language representation BERT model of the error correction model has better effect of correcting the wrong characters in the text in the target field. Moreover, the initial language representation BERT model is trained by using a plurality of texts in the material library, so that the defects that the number of texts in the target field is insufficient and the samples are not comprehensive are overcome.
In other embodiments, the language representation BERT model may be trained only based on the text in the corpus, and the trained model is used as an error correction model; or training the language representation BERT model only based on the text of the target field, and taking the trained model as an error correction model. And the effect of correcting the wrong characters in the text of the target field can be achieved to a certain extent.
Referring to fig. 3, another flowchart of a training method for an error correction model according to an embodiment of the present invention is shown in fig. 3, where a specific processing flow of the method may include:
step S301, replacing a plurality of characters in a plurality of texts in a corpus with placeholders as a plurality of first sample texts to be corrected, and taking the plurality of characters as first sample marking characters; the plurality of texts in the corpus is: text containing the correct characters.
That is, the first sample reference character is: the true correct character.
For example: one text in the corpus is: "You need to bath your dog"; the characters "need" and "You" may be replaced with a placeholder [ MASK ], and "You [ MASK ] to bath [ MASK ] dog" is obtained as the first sample text to be corrected, where the characters "need" and "You" are used as the first sample labeled characters.
Step S302, inputting each first sample text to be corrected into the current initial language representation BERT model, and obtaining each first sample target character output by the current initial language representation BERT model.
The first sample target character is: the current initial language features the correct character predicted by the BERT model to be located at the position of the placeholder in the first sample text to be corrected.
That is, the first sample target character is: the correct character predicted by the model.
Step S303, calculating a first loss value based on each first sample target character, each first sample label character, and a preset first loss function.
In the embodiment of the present invention, the loss value may be calculated by using, without limitation, an L1 norm loss function, an L2 norm loss function, a smooth L1 loss function, and the like.
And step S304, judging whether the current initial language representation BERT model converges or not according to the first loss value.
If the judgment result is no, that is, the current initial language representation BERT model does not converge, executing step S305; if the result of the judgment is yes, that is, the current initial language representation BERT model converges, step S306 is executed.
Step S305, adjusting and updating the network parameters of the current initial language representation BERT model. The process returns to step S302.
And S306, determining the current initial language representation BERT model as an intermediate language representation BERT model.
It may be implemented to use the intermediate language representation BERT model as the current intermediate language representation BERT model.
Step S307, for the target field, a plurality of second sample error texts containing the error characters and corresponding correct sample texts are obtained.
Step S308, replacing the plurality of error characters in the second sample error text with placeholders based on the plurality of second sample error texts and the corresponding correct sample texts thereof, and then using the replaced placeholders as a second sample text to be corrected, and using the correct characters corresponding to the plurality of error characters in the second sample error text as second sample annotation characters.
One specific embodiment of obtaining the second sample text to be corrected may be: obtaining a plurality of error characters in the second sample error text and correct characters corresponding to the plurality of error characters based on the longest common subsequence algorithm according to the second sample error text and the correct sample text corresponding to the second sample error text; and replacing a plurality of error characters in the second sample error text with placeholders, and obtaining a second sample text to be corrected, wherein the placeholders are replaced with the error characters.
For example: the second sample error text is X ═ X (X)1,y2,x3,x4,x5) (ii) a Where X is the second sample error text, X1、y2、x3、x4And x5Respectively representing each character in the second sample error text;
the correct sample text corresponding to the second sample error text is Y ═ x (x)1,x2,x3,x4,x5) (ii) a Where Y is the second sample error text, x1、x2、x3、x4And x5Respectively corresponding to the second sample erroneous textIndividual characters in the correct sample text;
determining the error character as y by the longest common subsequence algorithm (LCS)2The corresponding correct character is x2Then x is2Marking the character as a second sample;
error the second sample by y in the text2Replacement by placeholder [ MASK ]]Obtaining C ═ x1,[MASK],x3,x4,x5) As a second sample text to be corrected; where C is the second sample text to be corrected, x1、[MASK]、x3、x4And x5Respectively, each character in the second sample text to be corrected.
Step S309, inputting each second sample error text and the corresponding second sample text to be corrected into the current intermediate language representation BERT model, and obtaining each second sample target character output by the current intermediate language representation BERT model.
The second sample target character is: and the current intermediate language representation BERT model predicts the correct character at the position of the placeholder in the second sample text to be corrected.
For example: changing the above-mentioned X to (X)1,y2,x3,x4,x5) And C ═ x1,[MASK],x3,x4,x5) And inputting the current intermediate language representation BERT model to obtain a second sample target character output by the current intermediate language representation BERT model. If the second sample target character is x2If so, the prediction result of the character output by the current intermediate language representation BERT model is correct; if the second sample target character is not x2Then the predicted result of the character output by the current intermediate language representation BERT model is wrong.
Step S310, calculating a second loss value based on each second sample target character, each second sample label character, and a preset second loss function.
As may be implemented, the second loss function may be the same as the first loss function.
And step S311, judging whether the current intermediate language representation BERT model converges or not according to the second loss value.
If the judgment result is negative, that is, the current intermediate language representation BERT model does not converge, executing step S312; if the result of the determination is yes, that is, the current intermediate language representation BERT model converges, step S313 is executed.
Step S312, adjusting and updating the network parameters of the current intermediate language representation BERT model. The process returns to step S309.
And step S313, determining the current intermediate language representation BERT model as a trained error correction model.
In this embodiment, the initial language representation BERT model is trained based on a plurality of texts in the corpus, and the obtained intermediate language representation BERT model understands semantic information of a large amount of texts in the corpus and can correct the wrong characters. And training the intermediate language representation BERT model by using the text in the target field to obtain an error correction model, wherein the intermediate language representation BERT model can more accurately correct the error characters in the text in the target field and has better effect of correcting the error characters in the text in the target field. Moreover, the initial language representation BERT model is trained by using a plurality of texts in the material library, so that the defects that the number of texts in the target field is insufficient and the samples are not comprehensive are overcome.
Moreover, the error characters may carry part of information from the correct characters, so that the error characters are not directly discarded, when the intermediate language representation BERT model is trained by using a text in a target field, both the second sample error text and the second sample text to be corrected are input into the current intermediate language representation BERT model, the obtained error correction model has higher accuracy, and the predicted correct characters at the placeholder positions are more accurate.
As shown in fig. 4, the structural schematic diagram of the OCR error correction apparatus provided in the embodiment of the present invention includes:
an OCR error text acquiring unit 401, configured to acquire an OCR error text of a target field containing an error character;
a to-be-corrected text obtaining unit 402, configured to replace the error character with a placeholder, and obtain a to-be-corrected text;
an input unit 403, configured to input the OCR error text and the text to be corrected to a pre-trained error correction model; the error correction model is: aiming at a target field in advance, training a preset language representation BERT model by using a plurality of sample error texts containing error characters, sample texts to be corrected corresponding to the sample error texts and containing error character placeholders and correct sample texts corresponding to the sample error texts;
a target character acquisition unit 404, configured to acquire a target character output by the error correction model; the target characters are: the correct characters at the position of the placeholder in the text to be corrected, which are predicted by the error correction model;
the corrected OCR text obtaining unit 405 is configured to replace placeholders in the text to be corrected with target characters to obtain a corrected OCR text.
Optionally, on the basis of the apparatus shown in fig. 4, the apparatus may further include: an error correction model training unit;
an error correction model training unit comprising:
the system comprises a sample text and annotation character acquisition module in a corpus, a correction module and a correction module, wherein the sample text and annotation character acquisition module is used for replacing a plurality of characters in a plurality of texts in the corpus with placeholders as a plurality of first sample texts to be corrected and using the plurality of characters as first sample annotation characters; the plurality of texts in the corpus is: text containing the correct characters;
the intermediate language representation BERT model obtaining module is used for training the initial language representation BERT model by using a plurality of first sample texts to be corrected and first sample marking characters to obtain an intermediate language representation BERT model;
and the error correction model obtaining module is used for training the intermediate language representation BERT model to obtain an error correction model by using a plurality of second sample error texts containing error characters, second to-be-corrected sample texts containing error character placeholders corresponding to the plurality of second sample error texts and correct sample texts corresponding to the plurality of second sample error texts aiming at the target field.
Optionally, the intermediate language representation BERT model obtaining module is specifically configured to:
inputting each first sample text to be corrected into a current initial language representation BERT model, and acquiring each first sample target character output by the current initial language representation BERT model; the first sample target character is: correct characters at placeholder positions in the first sample text to be corrected, which are predicted by the current initial language representation BERT model;
calculating a first loss value based on each first sample target character, each first sample label character and a preset first loss function;
judging whether the current initial language representation BERT model is converged or not according to the first loss value;
if so, determining that the current initial language representation BERT model is an intermediate language representation BERT model; and if not, adjusting and updating the network parameters of the current initial language representation BERT model, returning to the step of inputting each first sample text to be corrected into the current initial language representation BERT model and acquiring each first sample target character output by the current initial language representation BERT model.
Optionally, the error correction model obtaining module is specifically configured to:
aiming at the target field, acquiring a plurality of second sample error texts containing error characters and corresponding correct sample texts thereof;
replacing a plurality of error characters in the second sample error text with placeholders based on the plurality of second sample error texts and the corresponding correct sample texts to be used as second sample texts to be corrected, and using the correct characters corresponding to the plurality of error characters in the second sample error texts as second sample marking characters;
inputting each second sample error text and the corresponding second sample text to be corrected into the current intermediate language representation BERT model to obtain each second sample target character output by the current intermediate language representation BERT model; the second sample target character is: correct characters at placeholder positions in the second sample text to be corrected, predicted by the current intermediate language representation BERT model;
calculating a second loss value based on each second sample target character, each second sample label character and a preset second loss function;
judging whether the current intermediate language representation BERT model is converged or not according to the second loss value;
if so, determining that the current intermediate language representation BERT model is a trained error correction model; and if not, adjusting and updating the network parameters of the current intermediate language representation BERT model, and returning to the step of inputting each second sample error text and the corresponding second sample text to be corrected into the current intermediate language representation BERT model to obtain each second sample target character output by the current intermediate language representation BERT model.
Therefore, by applying the embodiment of the invention, the correct character at the position of the placeholder in the text to be corrected can be obtained through the error correction model, so that the corrected OCR text can be obtained, manual correction is not needed, the time for correcting the error in the OCR text is reduced, and the efficiency for correcting the OCR error is improved.
An embodiment of the present invention further provides an electronic device, as shown in fig. 5, which includes a processor 501, a communication interface 502, a memory 503 and a communication bus 504, where the processor 501, the communication interface 502 and the memory 503 complete mutual communication through the communication bus 504,
a memory 503 for storing a computer program;
the processor 501, when executing the program stored in the memory 503, implements the following steps:
acquiring an OCR error text of a target field containing error characters;
replacing the error characters with placeholders to obtain a text to be corrected;
inputting the OCR error text and the text to be corrected into a pre-trained error correction model; the error correction model is: aiming at a target field in advance, training a preset language representation BERT model by using a plurality of sample error texts containing error characters, sample texts to be corrected corresponding to the sample error texts and containing error character placeholders and correct sample texts corresponding to the sample error texts;
acquiring a target character output by the error correction model; the target characters are: the correct characters at the position of the placeholder in the text to be corrected, which are predicted by the error correction model;
and replacing the placeholder in the text to be corrected with the target character to obtain the corrected OCR text.
Therefore, by applying the embodiment of the invention, the correct character at the position of the placeholder in the text to be corrected can be obtained through the error correction model, so that the corrected OCR text can be obtained, manual correction is not needed, the time for correcting the error in the OCR text is reduced, and the efficiency for correcting the OCR error is improved.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
In yet another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any of the OCR error correction methods described above.
In yet another embodiment provided by the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the OCR error correction methods of the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the apparatus for OCR error correction, the electronic device, the computer readable storage medium, and the computer program product, etc., since they are substantially similar to the method embodiments of OCR error correction, the description is relatively simple, and relevant points can be referred to the partial description of the method embodiments of OCR error correction.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (10)

1. A method for OCR error correction, the method comprising:
acquiring an optical character recognition OCR error text of a target field containing error characters;
replacing the error characters with placeholders to obtain a text to be corrected;
inputting the OCR error text and the text to be corrected into a pre-trained error correction model; the error correction model is: aiming at a target field in advance, training a preset language representation BERT model by using a plurality of sample error texts containing error characters, sample texts to be corrected corresponding to the sample error texts and containing error character placeholders and correct sample texts corresponding to the sample error texts;
acquiring a target character output by the error correction model; the target characters are: the correct characters at the position of the placeholder in the text to be corrected, which are predicted by the error correction model;
and replacing the placeholder in the text to be corrected with the target character to obtain the corrected OCR text.
2. The method of claim 1, wherein the training process of the error correction model comprises:
replacing a plurality of characters in a plurality of texts in a corpus with placeholders to serve as a plurality of first sample texts to be corrected, and taking the plurality of characters as first sample marking characters; the plurality of texts in the corpus is: text containing the correct characters;
training an initial language representation BERT model by using the plurality of first sample texts to be corrected and the first sample label characters to obtain an intermediate language representation BERT model;
and aiming at the target field, training the intermediate language representation BERT model by using a plurality of second sample error texts containing error characters, second to-be-corrected sample texts containing error character placeholders corresponding to the second sample error texts and correct sample texts corresponding to the second sample error texts to obtain an error correction model.
3. The method of claim 2, wherein the step of training an initial language-characterization BERT model using the plurality of first sample texts to be corrected and the first sample annotation characters to obtain an intermediate language-characterization BERT model comprises:
inputting each first sample text to be corrected into a current initial language representation BERT model, and acquiring each first sample target character output by the current initial language representation BERT model; the first sample target character is: correct characters at placeholder positions in the first sample text to be corrected, which are predicted by the current initial language representation BERT model;
calculating a first loss value based on each first sample target character, each first sample label character and a preset first loss function;
judging whether the current initial language representation BERT model is converged or not according to the first loss value;
if so, determining that the current initial language representation BERT model is an intermediate language representation BERT model; and if not, adjusting and updating the network parameters of the current initial language representation BERT model, returning to the step of inputting each first sample text to be corrected into the current initial language representation BERT model and acquiring each first sample target character output by the current initial language representation BERT model.
4. The method according to claim 3, wherein the step of training the intermediate language representation BERT model to obtain an error correction model by using, for the target domain, a plurality of second sample error texts containing error characters, second to-be-corrected sample texts containing error character placeholders corresponding to the plurality of second sample error texts, and correct sample texts corresponding to the plurality of second sample error texts comprises:
aiming at the target field, acquiring a plurality of second sample error texts containing error characters and corresponding correct sample texts thereof;
replacing a plurality of error characters in the second sample error text with placeholders based on the plurality of second sample error texts and the corresponding correct sample texts thereof to be used as a second sample text to be corrected, and using correct characters corresponding to the plurality of error characters in the second sample error text as second sample marking characters;
inputting each second sample error text and a corresponding second sample text to be corrected into a current intermediate language representation (BERT) model to obtain each second sample target character output by the current intermediate language representation (BERT) model; the second sample target character is: correct characters at placeholder positions in the second sample text to be corrected, predicted by the current intermediate language representation BERT model;
calculating a second loss value based on each second sample target character, each second sample label character and a preset second loss function;
judging whether the current intermediate language representation BERT model is converged or not according to the second loss value;
if so, determining that the current intermediate language representation BERT model is a trained error correction model; and if not, adjusting and updating the network parameters of the current intermediate language representation BERT model, and returning to the step of inputting each second sample error text and the corresponding second sample text to be corrected into the current intermediate language representation BERT model to obtain each second sample target character output by the current intermediate language representation BERT model.
5. An apparatus for OCR error correction, the apparatus comprising:
an OCR error text acquisition unit for acquiring an optical character recognition OCR error text of a target field containing error characters;
a to-be-corrected text obtaining unit, configured to replace the erroneous characters with placeholders, and obtain a to-be-corrected text;
the input unit is used for inputting the OCR error text and the text to be corrected into a pre-trained error correction model; the error correction model is: aiming at a target field in advance, training a preset language representation BERT model by using a plurality of sample error texts containing error characters, sample texts to be corrected corresponding to the sample error texts and containing error character placeholders and correct sample texts corresponding to the sample error texts;
a target character acquisition unit for acquiring a target character output by the error correction model; the target characters are: the correct characters at the position of the placeholder in the text to be corrected, which are predicted by the error correction model;
and the corrected OCR text obtaining unit is used for replacing the placeholder in the text to be corrected with the target character to obtain the corrected OCR text.
6. The apparatus of claim 5, further comprising: an error correction model training unit;
the error correction model training unit includes:
the system comprises a sample text and annotation character acquisition module in a corpus, a correction module and a correction module, wherein the sample text and annotation character acquisition module is used for replacing a plurality of characters in a plurality of texts in the corpus with placeholders as a plurality of first sample texts to be corrected and using the plurality of characters as first sample annotation characters; the plurality of texts in the corpus is: text containing the correct characters;
an intermediate language representation BERT model obtaining module, configured to train an initial language representation BERT model using the plurality of first sample texts to be corrected and the first sample annotation characters, so as to obtain an intermediate language representation BERT model;
and the error correction model obtaining module is used for training the intermediate language representation BERT model to obtain an error correction model by using a plurality of second sample error texts containing error characters, second to-be-corrected sample texts containing error character placeholders corresponding to the plurality of second sample error texts and correct sample texts corresponding to the plurality of second sample error texts aiming at the target field.
7. The apparatus according to claim 6, characterized in that said intermediate language representation BERT model obtaining module is specifically configured to:
inputting each first sample text to be corrected into a current initial language representation BERT model, and acquiring each first sample target character output by the current initial language representation BERT model; the first sample target character is: correct characters at placeholder positions in the first sample text to be corrected, which are predicted by the current initial language representation BERT model;
calculating a first loss value based on each first sample target character, each first sample label character and a preset first loss function;
judging whether the current initial language representation BERT model is converged or not according to the first loss value;
if so, determining that the current initial language representation BERT model is an intermediate language representation BERT model; and if not, adjusting and updating the network parameters of the current initial language representation BERT model, returning to the step of inputting each first sample text to be corrected into the current initial language representation BERT model and acquiring each first sample target character output by the current initial language representation BERT model.
8. The apparatus of claim 7, wherein the error correction model obtaining module is specifically configured to:
aiming at the target field, acquiring a plurality of second sample error texts containing error characters and corresponding correct sample texts thereof;
replacing a plurality of error characters in the second sample error text with placeholders based on the plurality of second sample error texts and the corresponding correct sample texts thereof to be used as a second sample text to be corrected, and using correct characters corresponding to the plurality of error characters in the second sample error text as second sample marking characters;
inputting each second sample error text and a corresponding second sample text to be corrected into a current intermediate language representation (BERT) model to obtain each second sample target character output by the current intermediate language representation (BERT) model; the second sample target character is: correct characters at placeholder positions in the second sample text to be corrected, predicted by the current intermediate language representation BERT model;
calculating a second loss value based on each second sample target character, each second sample label character and a preset second loss function;
judging whether the current intermediate language representation BERT model is converged or not according to the second loss value;
if so, determining that the current intermediate language representation BERT model is a trained error correction model; and if not, adjusting and updating the network parameters of the current intermediate language representation BERT model, and returning to the step of inputting each second sample error text and the corresponding second sample text to be corrected into the current intermediate language representation BERT model to obtain each second sample target character output by the current intermediate language representation BERT model.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1 to 4 when executing a program stored in the memory.
10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 4.
CN202110235350.6A 2021-03-03 2021-03-03 OCR error correction method, device, electronic equipment and storage medium Pending CN113095067A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110235350.6A CN113095067A (en) 2021-03-03 2021-03-03 OCR error correction method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110235350.6A CN113095067A (en) 2021-03-03 2021-03-03 OCR error correction method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113095067A true CN113095067A (en) 2021-07-09

Family

ID=76666298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110235350.6A Pending CN113095067A (en) 2021-03-03 2021-03-03 OCR error correction method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113095067A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115525730A (en) * 2022-02-27 2022-12-27 博才汇(宁波)信息科技有限公司 Webpage content extraction method and device based on page empowerment and electronic equipment
WO2023173560A1 (en) * 2022-03-16 2023-09-21 来也科技(北京)有限公司 Rpa and ai based text error correction method, training method and related device thereof

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8553968B1 (en) * 2005-02-18 2013-10-08 Western Digital Technologies, Inc. Using optical character recognition augmented by an error correction code to detect serial numbers written on a wafer
CN110969012A (en) * 2019-11-29 2020-04-07 北京字节跳动网络技术有限公司 Text error correction method and device, storage medium and electronic equipment
CN111046652A (en) * 2019-12-10 2020-04-21 拉扎斯网络科技(上海)有限公司 Text error correction method, text error correction device, storage medium, and electronic apparatus
CN111126045A (en) * 2019-11-25 2020-05-08 泰康保险集团股份有限公司 Text error correction method and device
CN111460827A (en) * 2020-04-01 2020-07-28 北京爱咔咔信息技术有限公司 Text information processing method, system, equipment and computer readable storage medium
US20200311459A1 (en) * 2019-03-29 2020-10-01 Abbyy Production Llc Training language models using text corpora comprising realistic optical character recognition (ocr) errors

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8553968B1 (en) * 2005-02-18 2013-10-08 Western Digital Technologies, Inc. Using optical character recognition augmented by an error correction code to detect serial numbers written on a wafer
US20200311459A1 (en) * 2019-03-29 2020-10-01 Abbyy Production Llc Training language models using text corpora comprising realistic optical character recognition (ocr) errors
CN111126045A (en) * 2019-11-25 2020-05-08 泰康保险集团股份有限公司 Text error correction method and device
CN110969012A (en) * 2019-11-29 2020-04-07 北京字节跳动网络技术有限公司 Text error correction method and device, storage medium and electronic equipment
CN111046652A (en) * 2019-12-10 2020-04-21 拉扎斯网络科技(上海)有限公司 Text error correction method, text error correction device, storage medium, and electronic apparatus
CN111460827A (en) * 2020-04-01 2020-07-28 北京爱咔咔信息技术有限公司 Text information processing method, system, equipment and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
THI TUYET HAI NGUYEN 等: "Neural Machine Translation with BERT for Post-OCR Error Detection and Correction", 《THE ACM/IMEE JOINT CONFERENCE ON DIGITAL LIBRARIES IN 2020》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115525730A (en) * 2022-02-27 2022-12-27 博才汇(宁波)信息科技有限公司 Webpage content extraction method and device based on page empowerment and electronic equipment
CN115525730B (en) * 2022-02-27 2024-04-19 山东视角数字技术有限公司 Webpage content extraction method and device based on page weighting and electronic equipment
WO2023173560A1 (en) * 2022-03-16 2023-09-21 来也科技(北京)有限公司 Rpa and ai based text error correction method, training method and related device thereof

Similar Documents

Publication Publication Date Title
CN112016310A (en) Text error correction method, system, device and readable storage medium
CN113095067A (en) OCR error correction method, device, electronic equipment and storage medium
CN110718226A (en) Speech recognition result processing method and device, electronic equipment and medium
CN111222368A (en) Method and device for identifying document paragraph and electronic equipment
CN112417848A (en) Corpus generation method and device and computer equipment
CN111325031A (en) Resume parsing method and device
CN113569021B (en) Method for classifying users, computer device and readable storage medium
CN112233669A (en) Speech content prompting method and system
CN111243593A (en) Speech recognition error correction method, mobile terminal and computer-readable storage medium
CN110929514B (en) Text collation method, text collation apparatus, computer-readable storage medium, and electronic device
CN112199500A (en) Emotional tendency identification method and device for comments and electronic equipment
CN112329470A (en) Intelligent address identification method and device based on end-to-end model training
CN109670040B (en) Writing assistance method and device, storage medium and computer equipment
CN115909386B (en) Method, equipment and storage medium for supplementing and correcting pipeline instrument flow chart
CN112163415A (en) User intention identification method and device for feedback content and electronic equipment
CN116013307A (en) Punctuation prediction method, punctuation prediction device, punctuation prediction equipment and computer storage medium
CN111797614A (en) Text processing method and device
CN114065858A (en) Model training method and device, electronic equipment and storage medium
CN110895924B (en) Method and device for reading document content aloud, electronic equipment and readable storage medium
CN113656669A (en) Label updating method and device
CN113434494A (en) Data cleaning method and system, electronic equipment and storage medium
CN110728137A (en) Method and device for word segmentation
CN113626587A (en) Text type identification method and device, electronic equipment and medium
CN110929018A (en) Text processing method and device, storage medium and electronic equipment
CN111814949B (en) Data labeling method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210709