CN116935414A - OCR recognition result correction method and device - Google Patents
OCR recognition result correction method and device Download PDFInfo
- Publication number
- CN116935414A CN116935414A CN202310916950.8A CN202310916950A CN116935414A CN 116935414 A CN116935414 A CN 116935414A CN 202310916950 A CN202310916950 A CN 202310916950A CN 116935414 A CN116935414 A CN 116935414A
- Authority
- CN
- China
- Prior art keywords
- initial
- matching
- target
- result
- rule
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012937 correction Methods 0.000 title claims abstract description 158
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000004891 communication Methods 0.000 claims description 16
- 238000013507 mapping Methods 0.000 claims description 16
- 230000000873 masking effect Effects 0.000 claims description 8
- 238000012986 modification Methods 0.000 claims description 8
- 230000004048 modification Effects 0.000 claims description 8
- 238000012015 optical character recognition Methods 0.000 abstract description 42
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000012797 qualification Methods 0.000 description 9
- 238000010276 construction Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 1
- 238000011022 operating instruction Methods 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/26—Techniques for post-processing, e.g. correcting the recognition result
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Character Discrimination (AREA)
Abstract
The invention discloses a correction method and device of an OCR (optical character recognition) result, relates to the technical field of text recognition and the field of financial science and technology, and mainly aims to solve the problem of low correction accuracy of the OCR result. The method mainly comprises the steps of obtaining an initial recognition result to be corrected and outputting target recognition model information of the initial recognition result; identifying a target correction strategy matched with the target recognition model information from correction strategies, wherein the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule; and correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result. The method is mainly used for correcting the OCR recognition result.
Description
Technical Field
The invention relates to the technical field of text recognition and the technical field of finance and technology, in particular to a method and a device for correcting an OCR recognition result.
Background
With the continuous development of automobile financing and renting businesses, optical character recognition (Optical Character Recognition, OCR) technology is also widely introduced into the field of automobile financing and renting, for example, for driving license, driving license verification, vehicle qualification verification, archiving, and the like. OCR refers to the process of analyzing, identifying and processing text data images, especially plane paper images, and acquiring text and layout information, and is the currently mainstream character identification method. Because the probability characteristic of OCR recognition does not have 100% accurate model, and the complete accuracy of the recognition result cannot be ensured, the accuracy of the recognition result needs to be optimized based on image preprocessing or the recognition result so as to improve the accuracy of recognition.
The existing post-processing of recognition results is mainly based on manually setting character type conditions to correct characters, for example, english letters recognized in pure number character segments are replaced by numbers with similar shapes. However, when the character segment is identified as being a plurality of types of mixed characters, especially when the character segment with complex character types such as drivers license, driving license, vehicle qualification certificate, vehicle insurance policy and the like in the automobile financing and renting business is faced, accurate correction cannot be carried out based on a single type of replacement rule, so that the correction accuracy rate of OCR recognition results is lower.
Disclosure of Invention
In view of this, the invention provides a method and a device for correcting an OCR recognition result, a medium and a computer device, and aims to solve the problems that the correction accuracy is low in the existing OCR recognition result, especially in the case of complex character types such as drivers license, driving license, vehicle qualification license, vehicle insurance policy and the like in the automobile financing and renting business.
According to one aspect of the present invention, there is provided a method for correcting an OCR recognition result, including:
acquiring an initial recognition result to be corrected and outputting target recognition model information of the initial recognition result;
Identifying a target correction strategy matched with the target recognition model information from correction strategies, wherein the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule;
and correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result.
Further, before the target correction strategy matched with the target identification model information is identified from the correction strategies, the method further comprises:
for each recognition model, acquiring a history recognition result of the recognition model, and respectively constructing a positive sample set and a negative sample set based on the history recognition result;
extracting initial matching content from the repeated negative samples aiming at each repeated negative sample in the negative sample set, and updating the matching content based on positive sample matching results of the initial matching content to obtain regular matching rules of each repeated negative sample in the negative sample set;
and determining a replacement rule corresponding to the regular matching rule based on a negative sample corresponding to the regular matching rule and a positive sample corresponding to the negative sample, and constructing and obtaining a correction strategy of each target recognition model based on the regular matching rule and the replacement rule.
Further, the updating of the matching content based on the positive sample matching result of the initial matching content to obtain a regular matching rule of each repeated negative sample in the negative sample set includes:
matching the initial matching content with the positive sample set aiming at each repeated negative sample to obtain a positive sample matching result;
if the positive sample matching result is non-empty, adding context content to the initial matching content to obtain updated initial matching content;
if the positive sample matching result of the updated initial matching content is non-null, continuing to add context content to the updated initial matching content until the positive sample matching result of the updated matching content is null, and generating a regular matching rule based on the updated matching content.
Further, the generating a regular matching rule based on the updated matching content includes:
masking the object to be corrected in the updated matching content to obtain a masking result;
and extracting the regular rule based on the mask result to obtain a regular matching rule.
Further, the correction strategy is in the form of a regular matching list, and before the target correction strategy matched with the target identification model information is identified from the correction strategies, the method further comprises:
1) For each recognition model, acquiring a history recognition result of the recognition model, and respectively constructing a positive sample set and a negative sample set based on the history recognition result;
2) Extracting a first repeated negative sample from the negative sample set, and constructing a first regular matching rule and a first replacement rule of the first repeated negative sample;
3) Adding the first regular matching rule and the first replacement rule to the last bit of the initial regular matching list to obtain an updated initial regular matching list;
4) Correcting the negative sample set based on the updated initial regular matching list to obtain a corrected negative sample set;
5) If the number of the repeated negative samples in the corrected negative sample set is greater than zero, extracting a second repeated negative sample from the corrected negative sample set, and constructing a second regular matching rule and a second replacement rule of the second repeated negative sample;
6) Adding the second regular matching rule and the second replacement rule to the last bit of the initial regular matching list after the initial updating to obtain an initial regular matching list after the updating again;
7) Correcting the corrected negative sample set based on the updated initial regular matching list;
8) Repeating the steps 5) to 7) until the number of repeated negative samples in the corrected negative sample set is equal to zero, and determining the last updated result of the initial regular matching list as a correction strategy.
Further, the target correction policy includes a target regular matching rule and a target replacement rule, and the correcting the initial recognition result based on the target correction policy includes:
identifying a character to be corrected from the initial identification result based on the target regular matching rule;
and correcting the character to be corrected based on the target replacement rule to obtain a correction result of the initial recognition result.
Further, before the initial recognition result to be corrected is obtained and the target recognition model information of the initial recognition result is output, the method further includes:
acquiring a text image to be identified and a business category of the text image to be identified;
identifying a target identification model matched with the service category from an identification model mapping relation set, wherein the identification model mapping relation set comprises mapping relations between different service categories and different identification model identification marks;
And identifying the text image to be identified based on the target identification model to obtain an initial identification result.
According to another aspect of the present invention, there is provided an OCR recognition result correction apparatus comprising:
the acquisition module is used for acquiring an initial identification result to be corrected and outputting target identification model information of the initial identification result;
the matching module is used for identifying a target correction strategy matched with the target recognition model information from correction strategies, the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule;
and the correction module is used for correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result.
Further, the apparatus further comprises:
the acquisition module is further used for acquiring historical recognition results of the recognition models aiming at the recognition models, and respectively constructing a positive sample set and a negative sample set based on the historical recognition results;
the updating module is used for extracting initial matching content from the repeated negative samples aiming at each repeated negative sample in the negative sample set, and updating the matching content based on positive sample matching results of the initial matching content to obtain regular matching rules of each repeated negative sample in the negative sample set;
And the determining module is used for determining a replacement rule corresponding to the regular matching rule based on a negative sample corresponding to the regular matching rule and a positive sample corresponding to the negative sample, and constructing a correction strategy of each target recognition model based on the regular matching rule and the replacement rule.
Further, the updating module includes:
the matching unit is used for matching the initial matching content with the positive sample set aiming at each repeated negative sample to obtain a positive sample matching result;
the first updating unit is used for adding context content to the initial matching content if the positive sample matching result is non-empty, so as to obtain updated initial matching content;
and the second updating unit is used for continuously adding context content to the updated initial matching content if the positive sample matching result of the updated initial matching content is non-null until the positive sample matching result of the updated matching content is null, and generating a regular matching rule based on the updated matching content.
Further, in a specific application scenario, the second updating unit is specifically configured to mask a to-be-corrected object in the updated matching content, so as to obtain a mask result;
And extracting the regular rule based on the mask result to obtain a regular matching rule.
Further, the apparatus further comprises: regular matching list construction module
The regular matching list construction module is used for 1) acquiring historical recognition results of the recognition models aiming at each recognition model, and respectively constructing a positive sample set and a negative sample set based on the historical recognition results;
2) Extracting a first repeated negative sample from the negative sample set, and constructing a first regular matching rule and a first replacement rule of the first repeated negative sample;
3) Adding the first regular matching rule and the first replacement rule to the last bit of the initial regular matching list to obtain an updated initial regular matching list;
4) Correcting the negative sample set based on the updated initial regular matching list to obtain a corrected negative sample set;
5) If the number of the repeated negative samples in the corrected negative sample set is greater than zero, extracting a second repeated negative sample from the corrected negative sample set, and constructing a second regular matching rule and a second replacement rule of the second repeated negative sample;
6) Adding the second regular matching rule and the second replacement rule to the last bit of the initial regular matching list after the initial updating to obtain an initial regular matching list after the updating again;
7) Correcting the corrected negative sample set based on the updated initial regular matching list;
8) Repeating the steps 5) to 7) until the number of repeated negative samples in the corrected negative sample set is equal to zero, and determining the last updated result of the initial regular matching list as a correction strategy.
Further, the correction module includes:
the identification unit is used for identifying characters to be corrected from the initial identification result based on the target regular matching rule;
and the correction unit is used for correcting the character to be corrected based on the target replacement rule to obtain a correction result of the initial recognition result.
Further, the apparatus further comprises:
the acquisition module is also used for acquiring the text image to be identified and the business category of the text image to be identified;
the first recognition module is used for recognizing a target recognition model matched with the service category from a recognition model mapping relation set, wherein the recognition model mapping relation set comprises mapping relations between different service categories and recognition identifications of different recognition models;
and the second recognition module is used for recognizing the text image to be recognized based on the target recognition model to obtain an initial recognition result.
According to still another aspect of the present invention, there is provided a storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the above-described OCR recognition result correction method.
According to still another aspect of the present invention, there is provided a computer apparatus including: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the OCR recognition result correction method.
By means of the technical scheme, the technical scheme provided by the embodiment of the invention has at least the following advantages:
the invention provides a correction method and a correction device for OCR recognition results, firstly, obtaining initial recognition results to be corrected and outputting target recognition model information of the initial recognition results; identifying a target correction strategy matched with the target recognition model information from correction strategies, wherein the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule; and correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result. Compared with the prior art, the method and the device for correcting the recognition errors of the OCR recognition result based on the correction strategy have the advantages that the recognition results of different recognition models are corrected based on the corresponding correction strategy, the correction strategy is generated based on the historical data of the corresponding recognition model, recognition errors of the recognition model can be corrected more comprehensively and accurately, the correction strategy comprises the regular matching rule, the misrecognition characters can be found more accurately, the situation that the correct characters are modified by mistake due to the simple character replacement mode is avoided, and therefore correction accuracy of the OCR recognition result is improved effectively.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 shows a flowchart of a method for correcting OCR recognition results according to an embodiment of the present invention;
FIG. 2 is a flowchart of another method for correcting OCR recognition results according to an embodiment of the present invention;
FIG. 3 is a block diagram showing the constitution of a correction device for OCR recognition results according to an embodiment of the present invention;
fig. 4 shows a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The embodiment of the invention provides a method for correcting an OCR recognition result, which is shown in figure 1 and comprises the following steps:
101. and acquiring an initial recognition result to be corrected and outputting target recognition model information of the initial recognition result.
In the embodiment of the invention, the initial recognition result of the to-be-corrected processing is a result of recognizing the paper text or the image text to be recognized based on the OCR target recognition model. The OCR target recognition model may be an optical character recognition device, or a model applied in an optical character recognition engine, and the model may be an existing open source model, such as a PaddleOCR model, an easycr model, or an improved model based on an existing model, which is not specifically limited in the embodiment of the present invention. The paper text or the image text to be identified may be text in any business field, for example, text of drivers license, driving license, vehicle qualification license, etc. in aspects of automobile financing, leasing, etc., or text in other fields, such as insurance, medical treatment, etc., and the embodiment of the invention is not limited specifically. In order to determine the target recognition model on which the initial recognition result required to be corrected depends, the information of the target recognition model needs to be acquired, and the information can be the name of the target recognition model or the unique recognition identifier of the target recognition model, such as a code, a number and the like configured for the model in advance.
102. And identifying the target correction strategy matched with the target identification model information from the correction strategies.
In the embodiment of the invention, in order to improve the correction accuracy, a correction strategy corresponding to the target recognition model is constructed based on the historical recognition result of the target recognition model, namely, the correction strategy is constructed in a targeted manner in advance according to repeated recognition errors of different target recognition models in the recognition result, so that each target recognition model is correspondingly provided with the correction strategy matched with the characteristics of the recognition result. For example, in a vehicle financing leasing scene, the qualification of leased vehicles is identified, errors of the A model identification result comprise identification of Chinese characters as other symbols, errors of the B model identification result comprise identification of GB as CB, and therefore, aiming at differences of errors of different model identification results, corresponding strategies are matched in a targeted manner, and searching and correcting of contents to be corrected are more accurate. It should be noted that, if there is a target recognition model with a high similarity of error types of the recognition result, the plurality of target recognition models may share the same correction policy, and the embodiment of the present invention is not limited specifically. The correction strategy comprises regular matching rules and replacement rules corresponding to the regular matching rules, the regular matching rules are rules based on regular expression, the regular matching rules are used for searching the content to be corrected, which accords with the logic rules of the current regular expression, from the initial identification result, and each regular matching rule corresponds to the replacement rules and is used for correcting the searched content to be corrected into correct content.
It should be noted that, for different target recognition models, different correction strategies are matched, because the regular matching rule and the replacement rule pair are constructed based on the historical recognition result of the target recognition model of the initial recognition result, the regular matching rule can more accurately find the content of the recognition error from the initial recognition result, so that the corrected object and the error content in the recognition result of the target recognition model have higher matching degree, more accurately and comprehensively find the error in the recognition result, and accurately correct the OCR recognition result, thereby effectively improving the accuracy of correcting the OCR recognition result.
103. And correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result.
In the embodiment of the invention, the target correction strategy comprises a plurality of groups of regular matching rules and replacement rule pairs, and the regular matching rules and the replacement rule pairs can have a sequence relation or have no sequence relation. In the case of a sequential relationship, for example, each group of regular matching rules and replacement rule pairs are used as one list item of the path list seed for managing regular matching, each list item is sequentially arranged, traversing searching is sequentially carried out on all contents of the initial identification result according to the sequential relationship, and the searched contents are replaced with error contents according to the replacement rules corresponding to the regular matching rules, so that correction is completed. Under the condition that no sequence relation exists, no sequence relation exists between the regular matching rule and the replacement rule pair, the regular matching rule and the replacement rule pair can be randomly extracted to carry out traversal search on all contents of the initial identification result, the searched contents are replaced with error contents according to the replacement rule corresponding to the regular matching rule, until the regular matching rule and the replacement rule pair of each group complete searching and replacement, and the obtained result is a correction processing result of the initial identification result.
It should be noted that, the process of searching for the error content in the initial recognition result based on the regular matching rule is a process of matching based on the text rule, and is not a simple character comparison. In the vehicle qualification recognition results of financing and renting vehicles, the country is often recognized as other characters, such as Q, R and the like, if the Q, R is simply replaced by the country, the recognition results of the Q, R are likely to be corrected into the country, so that error correction is caused, but a regular matching rule can be used for adding text rules to the simple country, and as the country in the vehicle qualification is always preceded by data such as year and the like, the country is followed by the character V, the initial recognition result is searched based on the regular matching rule of the number+non-number+V, so that the position which is not the country is avoided from being searched. The searching of the error content is carried out based on the regular matching rule, so that searching errors of the content to be corrected can be avoided, the correction accuracy is improved, in addition, the regular expression reflects the text rule with rules and logics, the classification of the search rule of the correction content can be realized, the number of the search rules and the search times are greatly reduced, and the correction efficiency of the identification result is improved.
For further explanation and limitation, before the step of identifying the target correction strategy matching the target identification model information from the correction strategies, as shown in fig. 2, the method further includes:
201. and acquiring a historical recognition result of each recognition model, and respectively constructing a positive sample set and a negative sample set based on the historical recognition result.
202. And extracting initial matching content from the repeated negative samples aiming at each repeated negative sample in the negative sample set, and updating the matching content based on positive sample matching results of the initial matching content to obtain regular matching rules of each repeated negative sample in the negative sample set.
203. And determining a replacement rule corresponding to the regular matching rule based on a negative sample corresponding to the regular matching rule and a positive sample corresponding to the negative sample, and constructing and obtaining a correction strategy of each target recognition model based on the regular matching rule and the replacement rule.
In the embodiment of the invention, in order to obtain an accurate correction strategy, historical recognition results of each target recognition model are respectively collected, negative samples containing error contents and positive samples are recognized from the historical recognition results, and the positive samples comprise correct contents in the historical recognition results and correct content samples corresponding to the error contents in the negative samples. The text to be identified corresponding to the history identification result may be the same type of text, for example, a vehicle qualification certificate of 50 vehicles, or may be different types of text, for example, 20 vehicle qualification certificates, 20 driving certificates, and 20 driving certificates, which is not particularly limited in the embodiment of the present invention. The initial matching content may be the misrecognized character itself or may be a word or sentence containing the misrecognized character, and the embodiment of the present invention is not limited specifically. In order to avoid the overfitting of the correction strategy, repeated negative samples in the negative samples are selected, namely repeated misidentification characters are used for generating the regular matching rules. After the regular matching rule is obtained, a replacement rule is generated based on a positive sample corresponding to the negative sample searched by the regular matching rule, wherein the replacement rule can be a positive sample, for example, the regular matching rule is "national VGB", the replacement rule is "national V, GB", and the replacement rule can also be a rule containing the positive sample, for example, the regular matching rule is: "number+non-number+V", denoted "\d\DV", substitution rules are: "Guo V" is replaced with "non-number +V" denoted as lambda x: "".join ([ x [0] [0], "Guo V" ]) ], where D is a number and D is a non-number.
In an embodiment of the present invention, for further explanation and limitation, the step of updating the matching content based on the positive sample matching result of the initial matching content to obtain a regular matching rule of each repeated negative sample in the negative sample set includes:
matching the initial matching content with the positive sample set aiming at each repeated negative sample to obtain a positive sample matching result;
if the positive sample matching result is non-empty, adding context content to the initial matching content to obtain updated initial matching content;
if the positive sample matching result of the updated initial matching content is non-null, continuing to add context content to the updated initial matching content until the positive sample matching result of the updated matching content is null, and generating a regular matching rule based on the updated matching content.
In the embodiment of the invention, the error sample in the negative sample is randomly searched, the repeated error sample is searched, an object which is used as the current object to be constructed with the regular matching rule and the replacement rule is randomly extracted from the repeated error sample, the constructed result is written into the initial regular matching list, the correct sample in the positive sample set is searched according to each error sample, if the searched result is non-null, namely, the content is matched from the positive sample set, the identification result is corrected based on the initial matching content, the condition that the correct content is corrected into the error content can occur, and further adjustment is needed for the initial matching content. Specifically, by adding context information to the initial matching content, for example, if the initial matching content can be matched to the correct sample from the positive sample set based on "state V", the above "2000" is extracted from the sentence in which the incorrect sample appears, the digital character is added to the front of the initial matching content, the positive sample set is searched based on "2000 state V", if the correct sample can still be matched, the context information is continuously added to "2000 state V" according to the sentence in which the incorrect sample appears until the adjusted initial matching content does not match the correct sample from the positive sample set, and then the adjusted initial matching content (updated matching content) can be used to generate the regular matching rule of the current incorrect sample.
It should be noted that, the matching content for searching the misidentification content is corrected by the positive sample set matching result, so that the probability of correcting the correct sample in the identification result can be greatly reduced, and meanwhile, the efficiency of constructing the correction strategy can be greatly improved. In addition, the matching content is adjusted in a manner of increasing the context, so that the searching accuracy of the misrecognized content can be improved, and the accuracy of correction of the recognition result is improved.
In one embodiment of the present invention, for further explanation and limitation, the generating the regular matching rule based on the updated matching content includes:
masking the object to be corrected in the updated matching content to obtain a masking result;
and extracting the regular rule based on the mask result to obtain a regular matching rule.
In the embodiment of the invention, in order to extract the misrecognition content searching rule which can be universally applied to the OCR recognition result, after the updated matching content with the empty matching result of the positive sample set is obtained, masking the misrecognition character in the content, and extracting the regular expression of the masking result based on a regular expression extracting tool to obtain the regular expression rule which accords with the occurrence rule of the current misrecognition character in the text. For example, "2000@V" where "@" is a misrecognized character of "country" is replaced with [ MARK ] to obtain "2000[ MARK ] V", and a regular expression is extracted from "2000[ MARK ] V" to obtain "\d\DV", i.e., a combination of "number+non-number+V". Regardless of which characters the "national" word is misidentified as, as long as the character satisfies the text rule that the national word appears, the character is identified as the character to be corrected. The regular expression extracting tool may be a Jmeter extractor, or may be another tool capable of extracting a regular expression, which is not specifically limited in the embodiment of the present invention.
For further explanation and limitation, the method according to claim 1, wherein the correction policy is in the form of a regular matching list, and before the target correction policy matching the target identification model information is identified from the correction policies, the method further comprises:
1) For each recognition model, acquiring a history recognition result of the recognition model, and respectively constructing a positive sample set and a negative sample set based on the history recognition result;
2) Extracting a first repeated negative sample from the negative sample set, and constructing a first regular matching rule and a first replacement rule of the first repeated negative sample;
3) Adding the first regular matching rule and the first replacement rule to the last bit of the initial regular matching list to obtain an updated initial regular matching list;
4) Correcting the negative sample set based on the updated initial regular matching list to obtain a corrected negative sample set;
5) If the number of the repeated negative samples in the corrected negative sample set is greater than zero, extracting a second repeated negative sample from the corrected negative sample set, and constructing a second regular matching rule and a second replacement rule of the second repeated negative sample;
6) Adding the second regular matching rule and the second replacement rule to the last bit of the initial regular matching list after the initial updating to obtain an initial regular matching list after the updating again;
7) Correcting the corrected negative sample set based on the updated initial regular matching list;
8) Repeating the steps 5) to 7) until the number of repeated negative samples in the corrected negative sample set is equal to zero, and determining the last updated result of the initial regular matching list as a correction strategy.
In the embodiment of the invention, the correction strategy is in the form of a regular matching list pattern_reply_pattern_list, the list is a pattern list for managing regular matching, the first bit of each list item in the list is a regular matching rule pattern, and the second bit is a corresponding replacement rule reply. In this list, there is an execution order relationship between each regular matching rule and the replacement rule set. In the construction process of the regular matching list, the error sample in the negative sample is corrected with the list items which are already constructed in advance, namely the regular matching rule and the replacement rule group which are already constructed in advance, and new regular matching rules and replacement rules are reconstructed based on the corrected negative sample set, so that the dependency relationship between each regular matching rule and each replacement rule pair is ensured, the situation that a plurality of regular matching rules are crossed is avoided, and the overfitting of the correction process is caused, and the correction accuracy is further ensured.
In an embodiment of the present invention, for further explanation and limitation, the correcting the initial recognition result based on the target correction policy in the step includes:
identifying a character to be corrected from the initial identification result based on the target regular matching rule;
and correcting the character to be corrected based on the target replacement rule to obtain a correction result of the initial recognition result.
In the embodiment of the invention, the target correction strategy comprises a target regular matching rule and a target replacement rule. When no sequence relation exists between the target regular matching rules, one rule can be randomly extracted from the target regular matching rules to match with the text in the initial recognition result, if the character to be corrected is matched, the target replacement rule corresponding to the target regular matching rule is utilized to correct, and then one rule is extracted from the residual target regular matching rules to match with the text in the initial recognition result until all the target regular matching rules are traversed once. In an embodiment, for further explanation and limitation, before the step of obtaining the initial recognition result to be corrected and outputting the target recognition model information of the initial recognition result, the method further includes:
When the sequence relation is acquired, matching the target regular matching rule with the text in the initial recognition result in sequence, and executing the next item regular matching rule after the character to be corrected matched by the previous target regular matching rule is corrected, so that the correction sequence is ensured.
In a text image to be identified, the service class of the text image to be identified;
identifying a target identification model matched with the service category from the identification model mapping relation set;
and identifying the text image to be identified based on the target identification model to obtain an initial identification result.
In the embodiment of the invention, the mapping relation set of the identification models comprises the mapping relation between different pre-established service categories and different identification model identification marks. The recognition model is obtained by training a text image sample to be recognized under the service category with a mapping relation. Because the characteristics of different recognition models are different, for example, some recognition models are good at recognizing certificates and table text images, and two recognition models are good at recognizing plain text images, the recognition models good at recognizing certificates can be used for recognizing the document text images, and the models good at recognizing the plain text images can be used for recognizing the plain text images, so that the accuracy of initial recognition results is improved. Because the service types corresponding to each recognition model are different, the service types of the historical recognition results are also different, the matching degree of the sample in the positive and negative sample sets constructed based on the historical recognition results of different models and the service types in the actual correction process is higher, and the pertinence of constructing the correction strategy based on the positive and negative sample sets is stronger, so that the correction accuracy of the correction strategy is higher.
The invention provides a correction method of OCR recognition results, which comprises the steps of firstly, obtaining initial recognition results to be corrected and outputting target recognition model information of the initial recognition results; identifying a target correction strategy matched with the target recognition model information from correction strategies, wherein the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule; and correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result. Compared with the prior art, the method and the device for correcting the recognition errors of the OCR recognition result based on the correction strategy have the advantages that the recognition results of different recognition models are corrected based on the corresponding correction strategy, the correction strategy is generated based on the historical data of the corresponding recognition model, recognition errors of the recognition model can be corrected more comprehensively and accurately, the correction strategy comprises the regular matching rule, the misrecognition characters can be found more accurately, the situation that the correct characters are modified by mistake due to the simple character replacement mode is avoided, and therefore correction accuracy of the OCR recognition result is improved effectively.
Further, as an implementation of the method shown in fig. 1, an embodiment of the present invention provides a device for correcting an OCR recognition result, as shown in fig. 3, where the device includes:
the acquiring module 31 is configured to acquire an initial recognition result to be corrected, and output target recognition model information of the initial recognition result;
a matching module 32, configured to identify a target modification policy that matches the target identification model information from modification policies, where the target modification policy is constructed based on a historical identification result of the target identification model, and the modification policy includes a regular matching rule and a replacement rule corresponding to the regular matching rule;
and the correction module 33 is configured to correct the initial recognition result based on the target correction policy, so as to obtain a corrected result of the initial recognition result.
Further, the apparatus further comprises:
the obtaining module 31 is further configured to obtain, for each recognition model, a history recognition result of the recognition model, and construct a positive sample set and a negative sample set based on the history recognition result, respectively;
the updating module is used for extracting initial matching content from the repeated negative samples aiming at each repeated negative sample in the negative sample set, and updating the matching content based on positive sample matching results of the initial matching content to obtain regular matching rules of each repeated negative sample in the negative sample set;
And the determining module is used for determining a replacement rule corresponding to the regular matching rule based on a negative sample corresponding to the regular matching rule and a positive sample corresponding to the negative sample, and constructing a correction strategy of each target recognition model based on the regular matching rule and the replacement rule.
Further, the updating module includes:
the matching unit is used for matching the initial matching content with the positive sample set aiming at each repeated negative sample to obtain a positive sample matching result;
the first updating unit is used for adding context content to the initial matching content if the positive sample matching result is non-empty, so as to obtain updated initial matching content;
and the second updating unit is used for continuously adding context content to the updated initial matching content if the positive sample matching result of the updated initial matching content is non-null until the positive sample matching result of the updated matching content is null, and generating a regular matching rule based on the updated matching content.
Further, in a specific application scenario, the second updating unit is specifically configured to mask a to-be-corrected object in the updated matching content, so as to obtain a mask result;
And extracting the regular rule based on the mask result to obtain a regular matching rule.
Further, the apparatus further comprises: regular matching list construction module
The regular matching list construction module is used for 1) acquiring historical recognition results of the recognition models aiming at each recognition model, and respectively constructing a positive sample set and a negative sample set based on the historical recognition results;
2) Extracting a first repeated negative sample from the negative sample set, and constructing a first regular matching rule and a first replacement rule of the first repeated negative sample;
3) Adding the first regular matching rule and the first replacement rule to the last bit of the initial regular matching list to obtain an updated initial regular matching list;
4) Correcting the negative sample set based on the updated initial regular matching list to obtain a corrected negative sample set;
5) If the number of the repeated negative samples in the corrected negative sample set is greater than zero, extracting a second repeated negative sample from the corrected negative sample set, and constructing a second regular matching rule and a second replacement rule of the second repeated negative sample;
6) Adding the second regular matching rule and the second replacement rule to the last bit of the initial regular matching list after the initial updating to obtain an initial regular matching list after the updating again;
7) Correcting the corrected negative sample set based on the updated initial regular matching list;
8) Repeating the steps 5) to 7) until the number of repeated negative samples in the corrected negative sample set is equal to zero, and determining the last updated result of the initial regular matching list as a correction strategy.
Further, the correction module includes:
the identification unit is used for identifying characters to be corrected from the initial identification result based on the target regular matching rule;
and the correction unit is used for correcting the character to be corrected based on the target replacement rule to obtain a correction result of the initial recognition result.
Further, the apparatus further comprises:
the acquiring module 31 is further configured to acquire a text image to be identified, and a service class of the text image to be identified;
the first recognition module is used for recognizing a target recognition model matched with the service category from a recognition model mapping relation set, wherein the recognition model mapping relation set comprises mapping relations between different service categories and recognition identifications of different recognition models;
and the second recognition module is used for recognizing the text image to be recognized based on the target recognition model to obtain an initial recognition result.
The invention provides a correction device of OCR recognition results, which comprises the steps of firstly, obtaining initial recognition results to be corrected and outputting target recognition model information of the initial recognition results; identifying a target correction strategy matched with the target recognition model information from correction strategies, wherein the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule; and correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result. Compared with the prior art, the method and the device for correcting the recognition errors of the OCR recognition result based on the correction strategy have the advantages that the recognition results of different recognition models are corrected based on the corresponding correction strategy, the correction strategy is generated based on the historical data of the corresponding recognition model, recognition errors of the recognition model can be corrected more comprehensively and accurately, the correction strategy comprises the regular matching rule, the misrecognition characters can be found more accurately, the situation that the correct characters are modified by mistake due to the simple character replacement mode is avoided, and therefore correction accuracy of the OCR recognition result is improved effectively.
According to an embodiment of the present invention, there is provided a storage medium storing at least one executable instruction for performing the method for correcting the OCR recognition result in any of the above-described method embodiments.
The method, the device, the medium and the computer equipment for correcting the OCR result show a schematic structure of the computer equipment according to one embodiment of the invention, and the specific embodiment of the invention does not limit the specific implementation of the computer equipment.
As shown in fig. 4, the computer device may include: a processor 402, a communication interface (Communications Interface) 404, a memory 406, and a communication bus 408.
Wherein: processor 402, communication interface 404, and memory 406 communicate with each other via communication bus 408.
A communication interface 404 for communicating with network elements of other devices, such as clients or other servers.
The processor 402 is configured to execute the program 410, and may specifically perform relevant steps in the above-described embodiments of the method for correcting the OCR recognition result.
In particular, program 410 may include program code including computer-operating instructions.
The processor 402 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included in the computer device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.
Memory 406 for storing programs 410. Memory 406 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
Program 410 may be specifically operable to cause processor 402 to:
acquiring an initial recognition result to be corrected and outputting target recognition model information of the initial recognition result;
identifying a target correction strategy matched with the target recognition model information from correction strategies, wherein the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule;
And correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result.
It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A method for correcting OCR recognition results, comprising:
acquiring an initial recognition result to be corrected and outputting target recognition model information of the initial recognition result;
identifying a target correction strategy matched with the target recognition model information from correction strategies, wherein the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule;
and correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result.
2. The method of claim 1, wherein prior to identifying a target correction strategy from the correction strategies that matches the target recognition model information, the method further comprises:
for each recognition model, acquiring a history recognition result of the recognition model, and respectively constructing a positive sample set and a negative sample set based on the history recognition result;
extracting initial matching content from the repeated negative samples aiming at each repeated negative sample in the negative sample set, and updating the matching content based on positive sample matching results of the initial matching content to obtain regular matching rules of each repeated negative sample in the negative sample set;
And determining a replacement rule corresponding to the regular matching rule based on a negative sample corresponding to the regular matching rule and a positive sample corresponding to the negative sample, and constructing and obtaining a correction strategy of each target recognition model based on the regular matching rule and the replacement rule.
3. The method of claim 2, wherein the updating the matching content based on the positive sample matching result of the initial matching content to obtain a regular matching rule for each repeated negative sample in the negative sample set comprises:
matching the initial matching content with the positive sample set aiming at each repeated negative sample to obtain a positive sample matching result;
if the positive sample matching result is non-empty, adding context content to the initial matching content to obtain updated initial matching content;
if the positive sample matching result of the updated initial matching content is non-null, continuing to add context content to the updated initial matching content until the positive sample matching result of the updated matching content is null, and generating a regular matching rule based on the updated matching content.
4. The method of claim 3, wherein the generating a regular matching rule based on the updated matching content comprises:
Masking the object to be corrected in the updated matching content to obtain a masking result;
and extracting the regular rule based on the mask result to obtain a regular matching rule.
5. The method of claim 1, wherein the correction policy is in the form of a regular matching list, and wherein prior to identifying a target correction policy from the correction policies that matches the target identification model information, the method further comprises:
1) For each recognition model, acquiring a history recognition result of the recognition model, and respectively constructing a positive sample set and a negative sample set based on the history recognition result;
2) Extracting a first repeated negative sample from the negative sample set, and constructing a first regular matching rule and a first replacement rule of the first repeated negative sample;
3) Adding the first regular matching rule and the first replacement rule to the last bit of the initial regular matching list to obtain an updated initial regular matching list;
4) Correcting the negative sample set based on the updated initial regular matching list to obtain a corrected negative sample set;
5) If the number of the repeated negative samples in the corrected negative sample set is greater than zero, extracting a second repeated negative sample from the corrected negative sample set, and constructing a second regular matching rule and a second replacement rule of the second repeated negative sample;
6) Adding the second regular matching rule and the second replacement rule to the last bit of the initial regular matching list after the initial updating to obtain an initial regular matching list after the updating again;
7) Correcting the corrected negative sample set based on the updated initial regular matching list;
8) Repeating the steps 5) to 7) until the number of repeated negative samples in the corrected negative sample set is equal to zero, and determining the last updated result of the initial regular matching list as a correction strategy.
6. The method of claim 1, wherein the target modification policy includes a target regular matching rule and a target replacement rule, and wherein modifying the initial recognition result based on the target modification policy includes:
identifying a character to be corrected from the initial identification result based on the target regular matching rule;
and correcting the character to be corrected based on the target replacement rule to obtain a correction result of the initial recognition result.
7. The method according to any one of claims 1 to 6, wherein before the initial recognition result to be corrected is obtained and the target recognition model information of the initial recognition result is output, the method further comprises:
Acquiring a text image to be identified and a business category of the text image to be identified;
identifying a target identification model matched with the service category from an identification model mapping relation set, wherein the identification model mapping relation set comprises mapping relations between different service categories and different identification model identification marks;
and identifying the text image to be identified based on the target identification model to obtain an initial identification result.
8. An OCR recognition result correction apparatus comprising:
the acquisition module is used for acquiring an initial identification result to be corrected and outputting target identification model information of the initial identification result;
the matching module is used for identifying a target correction strategy matched with the target recognition model information from correction strategies, the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule;
and the correction module is used for correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result.
9. A storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the OCR recognition result correction method of any one of claims 1-7.
10. A computer device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform operations corresponding to the method for correcting the OCR recognition result according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310916950.8A CN116935414A (en) | 2023-07-24 | 2023-07-24 | OCR recognition result correction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310916950.8A CN116935414A (en) | 2023-07-24 | 2023-07-24 | OCR recognition result correction method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116935414A true CN116935414A (en) | 2023-10-24 |
Family
ID=88385899
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310916950.8A Pending CN116935414A (en) | 2023-07-24 | 2023-07-24 | OCR recognition result correction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116935414A (en) |
-
2023
- 2023-07-24 CN CN202310916950.8A patent/CN116935414A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109829155B (en) | Keyword determination method, automatic scoring method, device, equipment and medium | |
US20100257440A1 (en) | High precision web extraction using site knowledge | |
CN112257613B (en) | Physical examination report information structured extraction method and device and computer equipment | |
CN110929125A (en) | Search recall method, apparatus, device and storage medium thereof | |
CN110704719B (en) | Enterprise search text word segmentation method and device | |
CN112464845B (en) | Bill recognition method, equipment and computer storage medium | |
CN113033185B (en) | Standard text error correction method and device, electronic equipment and storage medium | |
WO2022134580A1 (en) | Method and apparatus for acquiring certificate information, and storage medium and computer device | |
CN111581346A (en) | Event extraction method and device | |
CN113642320A (en) | Method, device, equipment and medium for extracting document directory structure | |
CN111782892B (en) | Similar character recognition method, device, apparatus and storage medium based on prefix tree | |
EP2138959A1 (en) | Word recognizing method and word recognizing program | |
CN113283389A (en) | Handwritten character quality detection method, device, equipment and storage medium | |
CN114677689B (en) | Text image recognition error correction method and electronic equipment | |
CN116935414A (en) | OCR recognition result correction method and device | |
CN114861625A (en) | Method for obtaining target training sample, electronic device and medium | |
CN111985486A (en) | Image information identification method and device, storage medium and computer equipment | |
CN113177543A (en) | Certificate identification method, device, equipment and storage medium | |
CN112989820A (en) | Legal document positioning method, device, equipment and storage medium | |
CN111506756A (en) | Similar picture searching method and system, electronic device and storage medium | |
CN117370583B (en) | Knowledge-graph entity alignment method and system based on generation of countermeasure network | |
CN114328938B (en) | Image report structured extraction method | |
CN115905561B (en) | Body alignment method and device, electronic equipment and storage medium | |
CN113837129B (en) | Method, device, equipment and storage medium for identifying wrongly written characters of handwritten signature | |
CN114694152B (en) | Printed text credibility fusion method and device based on three-source OCR (optical character recognition) result |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |