CN116935414A - OCR recognition result correction method and device - Google Patents

OCR recognition result correction method and device Download PDF

Info

Publication number
CN116935414A
CN116935414A CN202310916950.8A CN202310916950A CN116935414A CN 116935414 A CN116935414 A CN 116935414A CN 202310916950 A CN202310916950 A CN 202310916950A CN 116935414 A CN116935414 A CN 116935414A
Authority
CN
China
Prior art keywords
initial
matching
target
result
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310916950.8A
Other languages
Chinese (zh)
Inventor
张焱凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Financial Leasing Co Ltd
Original Assignee
Ping An International Financial Leasing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Financial Leasing Co Ltd filed Critical Ping An International Financial Leasing Co Ltd
Priority to CN202310916950.8A priority Critical patent/CN116935414A/en
Publication of CN116935414A publication Critical patent/CN116935414A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a correction method and device of an OCR (optical character recognition) result, relates to the technical field of text recognition and the field of financial science and technology, and mainly aims to solve the problem of low correction accuracy of the OCR result. The method mainly comprises the steps of obtaining an initial recognition result to be corrected and outputting target recognition model information of the initial recognition result; identifying a target correction strategy matched with the target recognition model information from correction strategies, wherein the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule; and correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result. The method is mainly used for correcting the OCR recognition result.

Description

OCR recognition result correction method and device
Technical Field
The invention relates to the technical field of text recognition and the technical field of finance and technology, in particular to a method and a device for correcting an OCR recognition result.
Background
With the continuous development of automobile financing and renting businesses, optical character recognition (Optical Character Recognition, OCR) technology is also widely introduced into the field of automobile financing and renting, for example, for driving license, driving license verification, vehicle qualification verification, archiving, and the like. OCR refers to the process of analyzing, identifying and processing text data images, especially plane paper images, and acquiring text and layout information, and is the currently mainstream character identification method. Because the probability characteristic of OCR recognition does not have 100% accurate model, and the complete accuracy of the recognition result cannot be ensured, the accuracy of the recognition result needs to be optimized based on image preprocessing or the recognition result so as to improve the accuracy of recognition.
The existing post-processing of recognition results is mainly based on manually setting character type conditions to correct characters, for example, english letters recognized in pure number character segments are replaced by numbers with similar shapes. However, when the character segment is identified as being a plurality of types of mixed characters, especially when the character segment with complex character types such as drivers license, driving license, vehicle qualification certificate, vehicle insurance policy and the like in the automobile financing and renting business is faced, accurate correction cannot be carried out based on a single type of replacement rule, so that the correction accuracy rate of OCR recognition results is lower.
Disclosure of Invention
In view of this, the invention provides a method and a device for correcting an OCR recognition result, a medium and a computer device, and aims to solve the problems that the correction accuracy is low in the existing OCR recognition result, especially in the case of complex character types such as drivers license, driving license, vehicle qualification license, vehicle insurance policy and the like in the automobile financing and renting business.
According to one aspect of the present invention, there is provided a method for correcting an OCR recognition result, including:
acquiring an initial recognition result to be corrected and outputting target recognition model information of the initial recognition result;
Identifying a target correction strategy matched with the target recognition model information from correction strategies, wherein the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule;
and correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result.
Further, before the target correction strategy matched with the target identification model information is identified from the correction strategies, the method further comprises:
for each recognition model, acquiring a history recognition result of the recognition model, and respectively constructing a positive sample set and a negative sample set based on the history recognition result;
extracting initial matching content from the repeated negative samples aiming at each repeated negative sample in the negative sample set, and updating the matching content based on positive sample matching results of the initial matching content to obtain regular matching rules of each repeated negative sample in the negative sample set;
and determining a replacement rule corresponding to the regular matching rule based on a negative sample corresponding to the regular matching rule and a positive sample corresponding to the negative sample, and constructing and obtaining a correction strategy of each target recognition model based on the regular matching rule and the replacement rule.
Further, the updating of the matching content based on the positive sample matching result of the initial matching content to obtain a regular matching rule of each repeated negative sample in the negative sample set includes:
matching the initial matching content with the positive sample set aiming at each repeated negative sample to obtain a positive sample matching result;
if the positive sample matching result is non-empty, adding context content to the initial matching content to obtain updated initial matching content;
if the positive sample matching result of the updated initial matching content is non-null, continuing to add context content to the updated initial matching content until the positive sample matching result of the updated matching content is null, and generating a regular matching rule based on the updated matching content.
Further, the generating a regular matching rule based on the updated matching content includes:
masking the object to be corrected in the updated matching content to obtain a masking result;
and extracting the regular rule based on the mask result to obtain a regular matching rule.
Further, the correction strategy is in the form of a regular matching list, and before the target correction strategy matched with the target identification model information is identified from the correction strategies, the method further comprises:
1) For each recognition model, acquiring a history recognition result of the recognition model, and respectively constructing a positive sample set and a negative sample set based on the history recognition result;
2) Extracting a first repeated negative sample from the negative sample set, and constructing a first regular matching rule and a first replacement rule of the first repeated negative sample;
3) Adding the first regular matching rule and the first replacement rule to the last bit of the initial regular matching list to obtain an updated initial regular matching list;
4) Correcting the negative sample set based on the updated initial regular matching list to obtain a corrected negative sample set;
5) If the number of the repeated negative samples in the corrected negative sample set is greater than zero, extracting a second repeated negative sample from the corrected negative sample set, and constructing a second regular matching rule and a second replacement rule of the second repeated negative sample;
6) Adding the second regular matching rule and the second replacement rule to the last bit of the initial regular matching list after the initial updating to obtain an initial regular matching list after the updating again;
7) Correcting the corrected negative sample set based on the updated initial regular matching list;
8) Repeating the steps 5) to 7) until the number of repeated negative samples in the corrected negative sample set is equal to zero, and determining the last updated result of the initial regular matching list as a correction strategy.
Further, the target correction policy includes a target regular matching rule and a target replacement rule, and the correcting the initial recognition result based on the target correction policy includes:
identifying a character to be corrected from the initial identification result based on the target regular matching rule;
and correcting the character to be corrected based on the target replacement rule to obtain a correction result of the initial recognition result.
Further, before the initial recognition result to be corrected is obtained and the target recognition model information of the initial recognition result is output, the method further includes:
acquiring a text image to be identified and a business category of the text image to be identified;
identifying a target identification model matched with the service category from an identification model mapping relation set, wherein the identification model mapping relation set comprises mapping relations between different service categories and different identification model identification marks;
And identifying the text image to be identified based on the target identification model to obtain an initial identification result.
According to another aspect of the present invention, there is provided an OCR recognition result correction apparatus comprising:
the acquisition module is used for acquiring an initial identification result to be corrected and outputting target identification model information of the initial identification result;
the matching module is used for identifying a target correction strategy matched with the target recognition model information from correction strategies, the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule;
and the correction module is used for correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result.
Further, the apparatus further comprises:
the acquisition module is further used for acquiring historical recognition results of the recognition models aiming at the recognition models, and respectively constructing a positive sample set and a negative sample set based on the historical recognition results;
the updating module is used for extracting initial matching content from the repeated negative samples aiming at each repeated negative sample in the negative sample set, and updating the matching content based on positive sample matching results of the initial matching content to obtain regular matching rules of each repeated negative sample in the negative sample set;
And the determining module is used for determining a replacement rule corresponding to the regular matching rule based on a negative sample corresponding to the regular matching rule and a positive sample corresponding to the negative sample, and constructing a correction strategy of each target recognition model based on the regular matching rule and the replacement rule.
Further, the updating module includes:
the matching unit is used for matching the initial matching content with the positive sample set aiming at each repeated negative sample to obtain a positive sample matching result;
the first updating unit is used for adding context content to the initial matching content if the positive sample matching result is non-empty, so as to obtain updated initial matching content;
and the second updating unit is used for continuously adding context content to the updated initial matching content if the positive sample matching result of the updated initial matching content is non-null until the positive sample matching result of the updated matching content is null, and generating a regular matching rule based on the updated matching content.
Further, in a specific application scenario, the second updating unit is specifically configured to mask a to-be-corrected object in the updated matching content, so as to obtain a mask result;
And extracting the regular rule based on the mask result to obtain a regular matching rule.
Further, the apparatus further comprises: regular matching list construction module
The regular matching list construction module is used for 1) acquiring historical recognition results of the recognition models aiming at each recognition model, and respectively constructing a positive sample set and a negative sample set based on the historical recognition results;
2) Extracting a first repeated negative sample from the negative sample set, and constructing a first regular matching rule and a first replacement rule of the first repeated negative sample;
3) Adding the first regular matching rule and the first replacement rule to the last bit of the initial regular matching list to obtain an updated initial regular matching list;
4) Correcting the negative sample set based on the updated initial regular matching list to obtain a corrected negative sample set;
5) If the number of the repeated negative samples in the corrected negative sample set is greater than zero, extracting a second repeated negative sample from the corrected negative sample set, and constructing a second regular matching rule and a second replacement rule of the second repeated negative sample;
6) Adding the second regular matching rule and the second replacement rule to the last bit of the initial regular matching list after the initial updating to obtain an initial regular matching list after the updating again;
7) Correcting the corrected negative sample set based on the updated initial regular matching list;
8) Repeating the steps 5) to 7) until the number of repeated negative samples in the corrected negative sample set is equal to zero, and determining the last updated result of the initial regular matching list as a correction strategy.
Further, the correction module includes:
the identification unit is used for identifying characters to be corrected from the initial identification result based on the target regular matching rule;
and the correction unit is used for correcting the character to be corrected based on the target replacement rule to obtain a correction result of the initial recognition result.
Further, the apparatus further comprises:
the acquisition module is also used for acquiring the text image to be identified and the business category of the text image to be identified;
the first recognition module is used for recognizing a target recognition model matched with the service category from a recognition model mapping relation set, wherein the recognition model mapping relation set comprises mapping relations between different service categories and recognition identifications of different recognition models;
and the second recognition module is used for recognizing the text image to be recognized based on the target recognition model to obtain an initial recognition result.
According to still another aspect of the present invention, there is provided a storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the above-described OCR recognition result correction method.
According to still another aspect of the present invention, there is provided a computer apparatus including: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the OCR recognition result correction method.
By means of the technical scheme, the technical scheme provided by the embodiment of the invention has at least the following advantages:
the invention provides a correction method and a correction device for OCR recognition results, firstly, obtaining initial recognition results to be corrected and outputting target recognition model information of the initial recognition results; identifying a target correction strategy matched with the target recognition model information from correction strategies, wherein the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule; and correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result. Compared with the prior art, the method and the device for correcting the recognition errors of the OCR recognition result based on the correction strategy have the advantages that the recognition results of different recognition models are corrected based on the corresponding correction strategy, the correction strategy is generated based on the historical data of the corresponding recognition model, recognition errors of the recognition model can be corrected more comprehensively and accurately, the correction strategy comprises the regular matching rule, the misrecognition characters can be found more accurately, the situation that the correct characters are modified by mistake due to the simple character replacement mode is avoided, and therefore correction accuracy of the OCR recognition result is improved effectively.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 shows a flowchart of a method for correcting OCR recognition results according to an embodiment of the present invention;
FIG. 2 is a flowchart of another method for correcting OCR recognition results according to an embodiment of the present invention;
FIG. 3 is a block diagram showing the constitution of a correction device for OCR recognition results according to an embodiment of the present invention;
fig. 4 shows a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The embodiment of the invention provides a method for correcting an OCR recognition result, which is shown in figure 1 and comprises the following steps:
101. and acquiring an initial recognition result to be corrected and outputting target recognition model information of the initial recognition result.
In the embodiment of the invention, the initial recognition result of the to-be-corrected processing is a result of recognizing the paper text or the image text to be recognized based on the OCR target recognition model. The OCR target recognition model may be an optical character recognition device, or a model applied in an optical character recognition engine, and the model may be an existing open source model, such as a PaddleOCR model, an easycr model, or an improved model based on an existing model, which is not specifically limited in the embodiment of the present invention. The paper text or the image text to be identified may be text in any business field, for example, text of drivers license, driving license, vehicle qualification license, etc. in aspects of automobile financing, leasing, etc., or text in other fields, such as insurance, medical treatment, etc., and the embodiment of the invention is not limited specifically. In order to determine the target recognition model on which the initial recognition result required to be corrected depends, the information of the target recognition model needs to be acquired, and the information can be the name of the target recognition model or the unique recognition identifier of the target recognition model, such as a code, a number and the like configured for the model in advance.
102. And identifying the target correction strategy matched with the target identification model information from the correction strategies.
In the embodiment of the invention, in order to improve the correction accuracy, a correction strategy corresponding to the target recognition model is constructed based on the historical recognition result of the target recognition model, namely, the correction strategy is constructed in a targeted manner in advance according to repeated recognition errors of different target recognition models in the recognition result, so that each target recognition model is correspondingly provided with the correction strategy matched with the characteristics of the recognition result. For example, in a vehicle financing leasing scene, the qualification of leased vehicles is identified, errors of the A model identification result comprise identification of Chinese characters as other symbols, errors of the B model identification result comprise identification of GB as CB, and therefore, aiming at differences of errors of different model identification results, corresponding strategies are matched in a targeted manner, and searching and correcting of contents to be corrected are more accurate. It should be noted that, if there is a target recognition model with a high similarity of error types of the recognition result, the plurality of target recognition models may share the same correction policy, and the embodiment of the present invention is not limited specifically. The correction strategy comprises regular matching rules and replacement rules corresponding to the regular matching rules, the regular matching rules are rules based on regular expression, the regular matching rules are used for searching the content to be corrected, which accords with the logic rules of the current regular expression, from the initial identification result, and each regular matching rule corresponds to the replacement rules and is used for correcting the searched content to be corrected into correct content.
It should be noted that, for different target recognition models, different correction strategies are matched, because the regular matching rule and the replacement rule pair are constructed based on the historical recognition result of the target recognition model of the initial recognition result, the regular matching rule can more accurately find the content of the recognition error from the initial recognition result, so that the corrected object and the error content in the recognition result of the target recognition model have higher matching degree, more accurately and comprehensively find the error in the recognition result, and accurately correct the OCR recognition result, thereby effectively improving the accuracy of correcting the OCR recognition result.
103. And correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result.
In the embodiment of the invention, the target correction strategy comprises a plurality of groups of regular matching rules and replacement rule pairs, and the regular matching rules and the replacement rule pairs can have a sequence relation or have no sequence relation. In the case of a sequential relationship, for example, each group of regular matching rules and replacement rule pairs are used as one list item of the path list seed for managing regular matching, each list item is sequentially arranged, traversing searching is sequentially carried out on all contents of the initial identification result according to the sequential relationship, and the searched contents are replaced with error contents according to the replacement rules corresponding to the regular matching rules, so that correction is completed. Under the condition that no sequence relation exists, no sequence relation exists between the regular matching rule and the replacement rule pair, the regular matching rule and the replacement rule pair can be randomly extracted to carry out traversal search on all contents of the initial identification result, the searched contents are replaced with error contents according to the replacement rule corresponding to the regular matching rule, until the regular matching rule and the replacement rule pair of each group complete searching and replacement, and the obtained result is a correction processing result of the initial identification result.
It should be noted that, the process of searching for the error content in the initial recognition result based on the regular matching rule is a process of matching based on the text rule, and is not a simple character comparison. In the vehicle qualification recognition results of financing and renting vehicles, the country is often recognized as other characters, such as Q, R and the like, if the Q, R is simply replaced by the country, the recognition results of the Q, R are likely to be corrected into the country, so that error correction is caused, but a regular matching rule can be used for adding text rules to the simple country, and as the country in the vehicle qualification is always preceded by data such as year and the like, the country is followed by the character V, the initial recognition result is searched based on the regular matching rule of the number+non-number+V, so that the position which is not the country is avoided from being searched. The searching of the error content is carried out based on the regular matching rule, so that searching errors of the content to be corrected can be avoided, the correction accuracy is improved, in addition, the regular expression reflects the text rule with rules and logics, the classification of the search rule of the correction content can be realized, the number of the search rules and the search times are greatly reduced, and the correction efficiency of the identification result is improved.
For further explanation and limitation, before the step of identifying the target correction strategy matching the target identification model information from the correction strategies, as shown in fig. 2, the method further includes:
201. and acquiring a historical recognition result of each recognition model, and respectively constructing a positive sample set and a negative sample set based on the historical recognition result.
202. And extracting initial matching content from the repeated negative samples aiming at each repeated negative sample in the negative sample set, and updating the matching content based on positive sample matching results of the initial matching content to obtain regular matching rules of each repeated negative sample in the negative sample set.
203. And determining a replacement rule corresponding to the regular matching rule based on a negative sample corresponding to the regular matching rule and a positive sample corresponding to the negative sample, and constructing and obtaining a correction strategy of each target recognition model based on the regular matching rule and the replacement rule.
In the embodiment of the invention, in order to obtain an accurate correction strategy, historical recognition results of each target recognition model are respectively collected, negative samples containing error contents and positive samples are recognized from the historical recognition results, and the positive samples comprise correct contents in the historical recognition results and correct content samples corresponding to the error contents in the negative samples. The text to be identified corresponding to the history identification result may be the same type of text, for example, a vehicle qualification certificate of 50 vehicles, or may be different types of text, for example, 20 vehicle qualification certificates, 20 driving certificates, and 20 driving certificates, which is not particularly limited in the embodiment of the present invention. The initial matching content may be the misrecognized character itself or may be a word or sentence containing the misrecognized character, and the embodiment of the present invention is not limited specifically. In order to avoid the overfitting of the correction strategy, repeated negative samples in the negative samples are selected, namely repeated misidentification characters are used for generating the regular matching rules. After the regular matching rule is obtained, a replacement rule is generated based on a positive sample corresponding to the negative sample searched by the regular matching rule, wherein the replacement rule can be a positive sample, for example, the regular matching rule is "national VGB", the replacement rule is "national V, GB", and the replacement rule can also be a rule containing the positive sample, for example, the regular matching rule is: "number+non-number+V", denoted "\d\DV", substitution rules are: "Guo V" is replaced with "non-number +V" denoted as lambda x: "".join ([ x [0] [0], "Guo V" ]) ], where D is a number and D is a non-number.
In an embodiment of the present invention, for further explanation and limitation, the step of updating the matching content based on the positive sample matching result of the initial matching content to obtain a regular matching rule of each repeated negative sample in the negative sample set includes:
matching the initial matching content with the positive sample set aiming at each repeated negative sample to obtain a positive sample matching result;
if the positive sample matching result is non-empty, adding context content to the initial matching content to obtain updated initial matching content;
if the positive sample matching result of the updated initial matching content is non-null, continuing to add context content to the updated initial matching content until the positive sample matching result of the updated matching content is null, and generating a regular matching rule based on the updated matching content.
In the embodiment of the invention, the error sample in the negative sample is randomly searched, the repeated error sample is searched, an object which is used as the current object to be constructed with the regular matching rule and the replacement rule is randomly extracted from the repeated error sample, the constructed result is written into the initial regular matching list, the correct sample in the positive sample set is searched according to each error sample, if the searched result is non-null, namely, the content is matched from the positive sample set, the identification result is corrected based on the initial matching content, the condition that the correct content is corrected into the error content can occur, and further adjustment is needed for the initial matching content. Specifically, by adding context information to the initial matching content, for example, if the initial matching content can be matched to the correct sample from the positive sample set based on "state V", the above "2000" is extracted from the sentence in which the incorrect sample appears, the digital character is added to the front of the initial matching content, the positive sample set is searched based on "2000 state V", if the correct sample can still be matched, the context information is continuously added to "2000 state V" according to the sentence in which the incorrect sample appears until the adjusted initial matching content does not match the correct sample from the positive sample set, and then the adjusted initial matching content (updated matching content) can be used to generate the regular matching rule of the current incorrect sample.
It should be noted that, the matching content for searching the misidentification content is corrected by the positive sample set matching result, so that the probability of correcting the correct sample in the identification result can be greatly reduced, and meanwhile, the efficiency of constructing the correction strategy can be greatly improved. In addition, the matching content is adjusted in a manner of increasing the context, so that the searching accuracy of the misrecognized content can be improved, and the accuracy of correction of the recognition result is improved.
In one embodiment of the present invention, for further explanation and limitation, the generating the regular matching rule based on the updated matching content includes:
masking the object to be corrected in the updated matching content to obtain a masking result;
and extracting the regular rule based on the mask result to obtain a regular matching rule.
In the embodiment of the invention, in order to extract the misrecognition content searching rule which can be universally applied to the OCR recognition result, after the updated matching content with the empty matching result of the positive sample set is obtained, masking the misrecognition character in the content, and extracting the regular expression of the masking result based on a regular expression extracting tool to obtain the regular expression rule which accords with the occurrence rule of the current misrecognition character in the text. For example, "2000@V" where "@" is a misrecognized character of "country" is replaced with [ MARK ] to obtain "2000[ MARK ] V", and a regular expression is extracted from "2000[ MARK ] V" to obtain "\d\DV", i.e., a combination of "number+non-number+V". Regardless of which characters the "national" word is misidentified as, as long as the character satisfies the text rule that the national word appears, the character is identified as the character to be corrected. The regular expression extracting tool may be a Jmeter extractor, or may be another tool capable of extracting a regular expression, which is not specifically limited in the embodiment of the present invention.
For further explanation and limitation, the method according to claim 1, wherein the correction policy is in the form of a regular matching list, and before the target correction policy matching the target identification model information is identified from the correction policies, the method further comprises:
1) For each recognition model, acquiring a history recognition result of the recognition model, and respectively constructing a positive sample set and a negative sample set based on the history recognition result;
2) Extracting a first repeated negative sample from the negative sample set, and constructing a first regular matching rule and a first replacement rule of the first repeated negative sample;
3) Adding the first regular matching rule and the first replacement rule to the last bit of the initial regular matching list to obtain an updated initial regular matching list;
4) Correcting the negative sample set based on the updated initial regular matching list to obtain a corrected negative sample set;
5) If the number of the repeated negative samples in the corrected negative sample set is greater than zero, extracting a second repeated negative sample from the corrected negative sample set, and constructing a second regular matching rule and a second replacement rule of the second repeated negative sample;
6) Adding the second regular matching rule and the second replacement rule to the last bit of the initial regular matching list after the initial updating to obtain an initial regular matching list after the updating again;
7) Correcting the corrected negative sample set based on the updated initial regular matching list;
8) Repeating the steps 5) to 7) until the number of repeated negative samples in the corrected negative sample set is equal to zero, and determining the last updated result of the initial regular matching list as a correction strategy.
In the embodiment of the invention, the correction strategy is in the form of a regular matching list pattern_reply_pattern_list, the list is a pattern list for managing regular matching, the first bit of each list item in the list is a regular matching rule pattern, and the second bit is a corresponding replacement rule reply. In this list, there is an execution order relationship between each regular matching rule and the replacement rule set. In the construction process of the regular matching list, the error sample in the negative sample is corrected with the list items which are already constructed in advance, namely the regular matching rule and the replacement rule group which are already constructed in advance, and new regular matching rules and replacement rules are reconstructed based on the corrected negative sample set, so that the dependency relationship between each regular matching rule and each replacement rule pair is ensured, the situation that a plurality of regular matching rules are crossed is avoided, and the overfitting of the correction process is caused, and the correction accuracy is further ensured.
In an embodiment of the present invention, for further explanation and limitation, the correcting the initial recognition result based on the target correction policy in the step includes:
identifying a character to be corrected from the initial identification result based on the target regular matching rule;
and correcting the character to be corrected based on the target replacement rule to obtain a correction result of the initial recognition result.
In the embodiment of the invention, the target correction strategy comprises a target regular matching rule and a target replacement rule. When no sequence relation exists between the target regular matching rules, one rule can be randomly extracted from the target regular matching rules to match with the text in the initial recognition result, if the character to be corrected is matched, the target replacement rule corresponding to the target regular matching rule is utilized to correct, and then one rule is extracted from the residual target regular matching rules to match with the text in the initial recognition result until all the target regular matching rules are traversed once. In an embodiment, for further explanation and limitation, before the step of obtaining the initial recognition result to be corrected and outputting the target recognition model information of the initial recognition result, the method further includes:
When the sequence relation is acquired, matching the target regular matching rule with the text in the initial recognition result in sequence, and executing the next item regular matching rule after the character to be corrected matched by the previous target regular matching rule is corrected, so that the correction sequence is ensured.
In a text image to be identified, the service class of the text image to be identified;
identifying a target identification model matched with the service category from the identification model mapping relation set;
and identifying the text image to be identified based on the target identification model to obtain an initial identification result.
In the embodiment of the invention, the mapping relation set of the identification models comprises the mapping relation between different pre-established service categories and different identification model identification marks. The recognition model is obtained by training a text image sample to be recognized under the service category with a mapping relation. Because the characteristics of different recognition models are different, for example, some recognition models are good at recognizing certificates and table text images, and two recognition models are good at recognizing plain text images, the recognition models good at recognizing certificates can be used for recognizing the document text images, and the models good at recognizing the plain text images can be used for recognizing the plain text images, so that the accuracy of initial recognition results is improved. Because the service types corresponding to each recognition model are different, the service types of the historical recognition results are also different, the matching degree of the sample in the positive and negative sample sets constructed based on the historical recognition results of different models and the service types in the actual correction process is higher, and the pertinence of constructing the correction strategy based on the positive and negative sample sets is stronger, so that the correction accuracy of the correction strategy is higher.
The invention provides a correction method of OCR recognition results, which comprises the steps of firstly, obtaining initial recognition results to be corrected and outputting target recognition model information of the initial recognition results; identifying a target correction strategy matched with the target recognition model information from correction strategies, wherein the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule; and correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result. Compared with the prior art, the method and the device for correcting the recognition errors of the OCR recognition result based on the correction strategy have the advantages that the recognition results of different recognition models are corrected based on the corresponding correction strategy, the correction strategy is generated based on the historical data of the corresponding recognition model, recognition errors of the recognition model can be corrected more comprehensively and accurately, the correction strategy comprises the regular matching rule, the misrecognition characters can be found more accurately, the situation that the correct characters are modified by mistake due to the simple character replacement mode is avoided, and therefore correction accuracy of the OCR recognition result is improved effectively.
Further, as an implementation of the method shown in fig. 1, an embodiment of the present invention provides a device for correcting an OCR recognition result, as shown in fig. 3, where the device includes:
the acquiring module 31 is configured to acquire an initial recognition result to be corrected, and output target recognition model information of the initial recognition result;
a matching module 32, configured to identify a target modification policy that matches the target identification model information from modification policies, where the target modification policy is constructed based on a historical identification result of the target identification model, and the modification policy includes a regular matching rule and a replacement rule corresponding to the regular matching rule;
and the correction module 33 is configured to correct the initial recognition result based on the target correction policy, so as to obtain a corrected result of the initial recognition result.
Further, the apparatus further comprises:
the obtaining module 31 is further configured to obtain, for each recognition model, a history recognition result of the recognition model, and construct a positive sample set and a negative sample set based on the history recognition result, respectively;
the updating module is used for extracting initial matching content from the repeated negative samples aiming at each repeated negative sample in the negative sample set, and updating the matching content based on positive sample matching results of the initial matching content to obtain regular matching rules of each repeated negative sample in the negative sample set;
And the determining module is used for determining a replacement rule corresponding to the regular matching rule based on a negative sample corresponding to the regular matching rule and a positive sample corresponding to the negative sample, and constructing a correction strategy of each target recognition model based on the regular matching rule and the replacement rule.
Further, the updating module includes:
the matching unit is used for matching the initial matching content with the positive sample set aiming at each repeated negative sample to obtain a positive sample matching result;
the first updating unit is used for adding context content to the initial matching content if the positive sample matching result is non-empty, so as to obtain updated initial matching content;
and the second updating unit is used for continuously adding context content to the updated initial matching content if the positive sample matching result of the updated initial matching content is non-null until the positive sample matching result of the updated matching content is null, and generating a regular matching rule based on the updated matching content.
Further, in a specific application scenario, the second updating unit is specifically configured to mask a to-be-corrected object in the updated matching content, so as to obtain a mask result;
And extracting the regular rule based on the mask result to obtain a regular matching rule.
Further, the apparatus further comprises: regular matching list construction module
The regular matching list construction module is used for 1) acquiring historical recognition results of the recognition models aiming at each recognition model, and respectively constructing a positive sample set and a negative sample set based on the historical recognition results;
2) Extracting a first repeated negative sample from the negative sample set, and constructing a first regular matching rule and a first replacement rule of the first repeated negative sample;
3) Adding the first regular matching rule and the first replacement rule to the last bit of the initial regular matching list to obtain an updated initial regular matching list;
4) Correcting the negative sample set based on the updated initial regular matching list to obtain a corrected negative sample set;
5) If the number of the repeated negative samples in the corrected negative sample set is greater than zero, extracting a second repeated negative sample from the corrected negative sample set, and constructing a second regular matching rule and a second replacement rule of the second repeated negative sample;
6) Adding the second regular matching rule and the second replacement rule to the last bit of the initial regular matching list after the initial updating to obtain an initial regular matching list after the updating again;
7) Correcting the corrected negative sample set based on the updated initial regular matching list;
8) Repeating the steps 5) to 7) until the number of repeated negative samples in the corrected negative sample set is equal to zero, and determining the last updated result of the initial regular matching list as a correction strategy.
Further, the correction module includes:
the identification unit is used for identifying characters to be corrected from the initial identification result based on the target regular matching rule;
and the correction unit is used for correcting the character to be corrected based on the target replacement rule to obtain a correction result of the initial recognition result.
Further, the apparatus further comprises:
the acquiring module 31 is further configured to acquire a text image to be identified, and a service class of the text image to be identified;
the first recognition module is used for recognizing a target recognition model matched with the service category from a recognition model mapping relation set, wherein the recognition model mapping relation set comprises mapping relations between different service categories and recognition identifications of different recognition models;
and the second recognition module is used for recognizing the text image to be recognized based on the target recognition model to obtain an initial recognition result.
The invention provides a correction device of OCR recognition results, which comprises the steps of firstly, obtaining initial recognition results to be corrected and outputting target recognition model information of the initial recognition results; identifying a target correction strategy matched with the target recognition model information from correction strategies, wherein the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule; and correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result. Compared with the prior art, the method and the device for correcting the recognition errors of the OCR recognition result based on the correction strategy have the advantages that the recognition results of different recognition models are corrected based on the corresponding correction strategy, the correction strategy is generated based on the historical data of the corresponding recognition model, recognition errors of the recognition model can be corrected more comprehensively and accurately, the correction strategy comprises the regular matching rule, the misrecognition characters can be found more accurately, the situation that the correct characters are modified by mistake due to the simple character replacement mode is avoided, and therefore correction accuracy of the OCR recognition result is improved effectively.
According to an embodiment of the present invention, there is provided a storage medium storing at least one executable instruction for performing the method for correcting the OCR recognition result in any of the above-described method embodiments.
The method, the device, the medium and the computer equipment for correcting the OCR result show a schematic structure of the computer equipment according to one embodiment of the invention, and the specific embodiment of the invention does not limit the specific implementation of the computer equipment.
As shown in fig. 4, the computer device may include: a processor 402, a communication interface (Communications Interface) 404, a memory 406, and a communication bus 408.
Wherein: processor 402, communication interface 404, and memory 406 communicate with each other via communication bus 408.
A communication interface 404 for communicating with network elements of other devices, such as clients or other servers.
The processor 402 is configured to execute the program 410, and may specifically perform relevant steps in the above-described embodiments of the method for correcting the OCR recognition result.
In particular, program 410 may include program code including computer-operating instructions.
The processor 402 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included in the computer device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.
Memory 406 for storing programs 410. Memory 406 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
Program 410 may be specifically operable to cause processor 402 to:
acquiring an initial recognition result to be corrected and outputting target recognition model information of the initial recognition result;
identifying a target correction strategy matched with the target recognition model information from correction strategies, wherein the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule;
And correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result.
It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for correcting OCR recognition results, comprising:
acquiring an initial recognition result to be corrected and outputting target recognition model information of the initial recognition result;
identifying a target correction strategy matched with the target recognition model information from correction strategies, wherein the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule;
and correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result.
2. The method of claim 1, wherein prior to identifying a target correction strategy from the correction strategies that matches the target recognition model information, the method further comprises:
for each recognition model, acquiring a history recognition result of the recognition model, and respectively constructing a positive sample set and a negative sample set based on the history recognition result;
extracting initial matching content from the repeated negative samples aiming at each repeated negative sample in the negative sample set, and updating the matching content based on positive sample matching results of the initial matching content to obtain regular matching rules of each repeated negative sample in the negative sample set;
And determining a replacement rule corresponding to the regular matching rule based on a negative sample corresponding to the regular matching rule and a positive sample corresponding to the negative sample, and constructing and obtaining a correction strategy of each target recognition model based on the regular matching rule and the replacement rule.
3. The method of claim 2, wherein the updating the matching content based on the positive sample matching result of the initial matching content to obtain a regular matching rule for each repeated negative sample in the negative sample set comprises:
matching the initial matching content with the positive sample set aiming at each repeated negative sample to obtain a positive sample matching result;
if the positive sample matching result is non-empty, adding context content to the initial matching content to obtain updated initial matching content;
if the positive sample matching result of the updated initial matching content is non-null, continuing to add context content to the updated initial matching content until the positive sample matching result of the updated matching content is null, and generating a regular matching rule based on the updated matching content.
4. The method of claim 3, wherein the generating a regular matching rule based on the updated matching content comprises:
Masking the object to be corrected in the updated matching content to obtain a masking result;
and extracting the regular rule based on the mask result to obtain a regular matching rule.
5. The method of claim 1, wherein the correction policy is in the form of a regular matching list, and wherein prior to identifying a target correction policy from the correction policies that matches the target identification model information, the method further comprises:
1) For each recognition model, acquiring a history recognition result of the recognition model, and respectively constructing a positive sample set and a negative sample set based on the history recognition result;
2) Extracting a first repeated negative sample from the negative sample set, and constructing a first regular matching rule and a first replacement rule of the first repeated negative sample;
3) Adding the first regular matching rule and the first replacement rule to the last bit of the initial regular matching list to obtain an updated initial regular matching list;
4) Correcting the negative sample set based on the updated initial regular matching list to obtain a corrected negative sample set;
5) If the number of the repeated negative samples in the corrected negative sample set is greater than zero, extracting a second repeated negative sample from the corrected negative sample set, and constructing a second regular matching rule and a second replacement rule of the second repeated negative sample;
6) Adding the second regular matching rule and the second replacement rule to the last bit of the initial regular matching list after the initial updating to obtain an initial regular matching list after the updating again;
7) Correcting the corrected negative sample set based on the updated initial regular matching list;
8) Repeating the steps 5) to 7) until the number of repeated negative samples in the corrected negative sample set is equal to zero, and determining the last updated result of the initial regular matching list as a correction strategy.
6. The method of claim 1, wherein the target modification policy includes a target regular matching rule and a target replacement rule, and wherein modifying the initial recognition result based on the target modification policy includes:
identifying a character to be corrected from the initial identification result based on the target regular matching rule;
and correcting the character to be corrected based on the target replacement rule to obtain a correction result of the initial recognition result.
7. The method according to any one of claims 1 to 6, wherein before the initial recognition result to be corrected is obtained and the target recognition model information of the initial recognition result is output, the method further comprises:
Acquiring a text image to be identified and a business category of the text image to be identified;
identifying a target identification model matched with the service category from an identification model mapping relation set, wherein the identification model mapping relation set comprises mapping relations between different service categories and different identification model identification marks;
and identifying the text image to be identified based on the target identification model to obtain an initial identification result.
8. An OCR recognition result correction apparatus comprising:
the acquisition module is used for acquiring an initial identification result to be corrected and outputting target identification model information of the initial identification result;
the matching module is used for identifying a target correction strategy matched with the target recognition model information from correction strategies, the target correction strategy is constructed based on a historical recognition result of the target recognition model, and the correction strategy comprises a regular matching rule and a replacement rule corresponding to the regular matching rule;
and the correction module is used for correcting the initial recognition result based on the target correction strategy to obtain a correction result of the initial recognition result.
9. A storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the OCR recognition result correction method of any one of claims 1-7.
10. A computer device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform operations corresponding to the method for correcting the OCR recognition result according to any one of claims 1 to 7.
CN202310916950.8A 2023-07-24 2023-07-24 OCR recognition result correction method and device Pending CN116935414A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310916950.8A CN116935414A (en) 2023-07-24 2023-07-24 OCR recognition result correction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310916950.8A CN116935414A (en) 2023-07-24 2023-07-24 OCR recognition result correction method and device

Publications (1)

Publication Number Publication Date
CN116935414A true CN116935414A (en) 2023-10-24

Family

ID=88385899

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310916950.8A Pending CN116935414A (en) 2023-07-24 2023-07-24 OCR recognition result correction method and device

Country Status (1)

Country Link
CN (1) CN116935414A (en)

Similar Documents

Publication Publication Date Title
CN109829155B (en) Keyword determination method, automatic scoring method, device, equipment and medium
US20100257440A1 (en) High precision web extraction using site knowledge
CN112257613B (en) Physical examination report information structured extraction method and device and computer equipment
CN110929125A (en) Search recall method, apparatus, device and storage medium thereof
CN110704719B (en) Enterprise search text word segmentation method and device
CN112464845B (en) Bill recognition method, equipment and computer storage medium
CN113033185B (en) Standard text error correction method and device, electronic equipment and storage medium
WO2022134580A1 (en) Method and apparatus for acquiring certificate information, and storage medium and computer device
CN111581346A (en) Event extraction method and device
CN113642320A (en) Method, device, equipment and medium for extracting document directory structure
CN111782892B (en) Similar character recognition method, device, apparatus and storage medium based on prefix tree
EP2138959A1 (en) Word recognizing method and word recognizing program
CN113283389A (en) Handwritten character quality detection method, device, equipment and storage medium
CN114677689B (en) Text image recognition error correction method and electronic equipment
CN116935414A (en) OCR recognition result correction method and device
CN114861625A (en) Method for obtaining target training sample, electronic device and medium
CN111985486A (en) Image information identification method and device, storage medium and computer equipment
CN113177543A (en) Certificate identification method, device, equipment and storage medium
CN112989820A (en) Legal document positioning method, device, equipment and storage medium
CN111506756A (en) Similar picture searching method and system, electronic device and storage medium
CN117370583B (en) Knowledge-graph entity alignment method and system based on generation of countermeasure network
CN114328938B (en) Image report structured extraction method
CN115905561B (en) Body alignment method and device, electronic equipment and storage medium
CN113837129B (en) Method, device, equipment and storage medium for identifying wrongly written characters of handwritten signature
CN114694152B (en) Printed text credibility fusion method and device based on three-source OCR (optical character recognition) result

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination