CN116758564A - Method and system for comparing OCR character recognition results - Google Patents

Method and system for comparing OCR character recognition results Download PDF

Info

Publication number
CN116758564A
CN116758564A CN202311021186.4A CN202311021186A CN116758564A CN 116758564 A CN116758564 A CN 116758564A CN 202311021186 A CN202311021186 A CN 202311021186A CN 116758564 A CN116758564 A CN 116758564A
Authority
CN
China
Prior art keywords
code
character
ocr
article
character number
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311021186.4A
Other languages
Chinese (zh)
Other versions
CN116758564B (en
Inventor
张宏坤
李慧
刘子禛
于龙
姜建宁
王军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Lvxin Siyuan Counter Forgery Technology Co ltd
Original Assignee
Shandong Lvxin Siyuan Counter Forgery Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Lvxin Siyuan Counter Forgery Technology Co ltd filed Critical Shandong Lvxin Siyuan Counter Forgery Technology Co ltd
Priority to CN202311021186.4A priority Critical patent/CN116758564B/en
Publication of CN116758564A publication Critical patent/CN116758564A/en
Application granted granted Critical
Publication of CN116758564B publication Critical patent/CN116758564B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • G06V30/242Division of the character sequences into groups prior to recognition; Selection of dictionaries
    • G06V30/246Division of the character sequences into groups prior to recognition; Selection of dictionaries using linguistic properties, e.g. specific for English or German language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Character Discrimination (AREA)

Abstract

The invention provides a method and a system for comparing OCR character recognition results, and relates to the technical field of character recognition. The method and the system divide the original article character number and the character string recognized by OCR respectively according to a convention rule, wherein the convention rule is to divide according to a character length unit or divide according to letters and numbers independently, and continuous letters or numbers are one section; and storing the split original character numbers of the articles into a data table to be used as compared original records, inquiring data in the data table according to the corresponding columns by the split character strings recognized by OCR, sequentially calculating the Levenstein distance between the inquiring result and the character strings recognized by OCR, and finding out the data with the smallest distance after calculating the Levenstein distance to obtain a final result. Thus, whether the result of OCR recognition of the character number of the article is correct or not is effectively judged, and if the result of OCR recognition is incorrect, the correct character number is informed.

Description

Method and system for comparing OCR character recognition results
Technical Field
The invention relates to the technical field of OCR character recognition, in particular to a method and a system for comparing OCR character recognition results.
Background
In each production and processing field, in order to facilitate management and traceability of materials, accessories, products and other articles, the articles are uniquely numbered. The character number and graphic code (bar code or two-dimensional code) are usually attached to the article by printing or pasting. But in some cases, because of some conditional restrictions, only character numbers can be attached to items.
The character number of the article is formed by unordered mixture of letters and numbers, when the character number of the article is identified in the prior art, single characters are usually identified one by one, and the average identification accuracy of the single characters is assumed to be p (p)<1) The article character number length is L (L>1) Accuracy P of recognition of the character number of the whole article 0 For the product of all individual character recognition rates P, i.e. P 0 =p L . Because p is<1, so the longer the article character number length is, the accuracy P of identifying the whole article character number 0 The lower.
Disclosure of Invention
Aiming at the technical problems that the longer the article character number length is, the lower the recognition accuracy of the whole article character number is, the invention provides a method and a system for comparing OCR character recognition results, the method and the system divide the original article character number and the OCR recognized character string respectively according to a stipulated rule, inquire the divided character string in sections, and calculate the Lymestein distance of the inquired result, thereby efficiently judging whether the article character number recognition result is correct and exists, and informing the correct character number if the article character number recognition result is incorrect.
Therefore, the technical scheme of the invention is that the method for comparing the recognition results of OCR characters comprises the following steps:
s1, collecting original article character number data to obtain a data set R;
s2, the character length is L (L>1) Each article character number CODE of (1) is broken according to a rule of convention and split into N segments (N>1) Get the set a= { CODE 1 ,CODE 2 , … CODE n };
S3, importing each element in the set A into a data table as a group of records;
s4, marking character strings of character numbers of articles to be recognized through OCR recognition as CODE r
Judging CODE r If the length is 0, directly returning a result of identifying the error; if the length is greater than L, only the first L characters are intercepted;
s5, the CODE is paired r After judgment and interception, splitting the data into N segments according to the agreed rule to obtain a set B= { CODE r1 , CODE r2 , … CODE rn };
S6, inquiring the elements in the set B according to the corresponding column records in the data table to obtain an inquiry result set C (the length of the set C is more than or equal to 0), wherein the set C is the set of inquired original article character numbers;
s7, sequentially combining the query results in the set C with the CODE r Performing Levenstein distance calculation;
s8, after the Levenstein distance is calculated, finding out the data CODE with the smallest distance L ,CODE L Namely, with the CODE r The most similar article character number.
Preferably, the data set R updates the extension in real time.
Preferably, the agreed rule of the article character number CODE is splitting according to the character length unit.
Preferably, the agreed rule of the article character number CODE is that every 3 characters are split into one segment.
Preferably, the convention for the article character number CODE is to split separately by letter and number, with consecutive letters or numbers being one segment.
Preferably, the data table is a multi-table or single-table, and the data table uses the same key value as the association.
A system for comparing OCR character recognition results adopts the method for comparing OCR character recognition results to recognize character numbers, and comprises a character number storage module, an OCR recognition module, a character number splitting module, a character number comparison module and a recognition result output module.
The method and the system have the beneficial effects that the original article character number and the character string recognized by OCR are respectively split according to the agreed rule, the agreed rule is that the original article character number and the character string recognized by OCR are split according to the character length unit, or the original article character number and the character string recognized by OCR are independently split according to letters and numbers, and continuous letters or numbers are a section; and storing the split original character numbers of the articles into a data table to be used as compared original records, inquiring data in the data table according to the corresponding columns by the split character strings recognized by OCR, sequentially calculating the Levenstein distance between the inquiring result and the character strings recognized by OCR, and finding out the data with the smallest distance after calculating the Levenstein distance to obtain a final result. Thus, whether the result of OCR recognition of the character number of the article is correct or not is effectively judged, and if the result of OCR recognition is incorrect, the correct character number is informed.
Drawings
FIG. 1 is a flow chart of the method for comparing OCR character recognition results according to the present invention;
FIG. 2 is a schematic diagram of a system for comparing OCR character recognition results according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
As shown in fig. 1, the invention provides a method for comparing OCR character recognition results, which comprises the following steps of:
s1, collecting original article character number data to obtain a data set R;
s2, breaking each article character number CODE with the character length L according to a convention rule, and splitting the article character number CODE into N sections to obtain a set A= { CODE 1 ,CODE 2 , … CODE n };
S3, importing each element in the set A into a data table as a group of records;
s4, marking character strings of character numbers of articles to be recognized through OCR recognition as CODE r
Judging CODE r If the length is 0, directly returning a result of identifying the error; if the length is greater than L, only the first L characters are intercepted;
s5, the CODE is paired r After judgment and interception, splitting the data into N segments according to the agreed rule to obtain a set B= { CODE r1 , CODE r2 , … CODE rn };
S6, inquiring the elements in the set B according to the corresponding column records in the data table to obtain an inquiry result set C, wherein the set C is an inquired original article character number set;
s7, sequentially combining the query results in the set C with the CODE r Performing Levenstein distance calculation;
s8, after the Levenstein distance is calculated, finding out the data CODE with the smallest distance L ,CODE L Namely, with the CODE r The most similar article character number.
Specifically, the invention gathers original article character number data to obtain a data set R. The data set R is a set of all original article character numbers, namely a set of already generated article character numbers, and is updated and expanded in real time according to the production progress.
The present invention uses CODEs to represent article character numbers. Breaking each article character number CODE according to the agreed rule, splitting into N segments (N>1) Get the set a= { CODE 1 ,CODE 2 , … CODE n }. The CODE rule of the character number CODE of the article is to split according to the unit of the character length, for example, every 2-5 characters are one segment, and the most optimal scheme is that every 3 characters are one segment. Alternatively, the convention for the article character number CODE is to split by letter and number alone, with successive letters or numbers being a segment, such as article character number AB1CD2473,split into AB,1, cd,2473.
Based on the above, the invention splits the data set R into N columns according to the agreed rules, and then each column is imported as a group of records into the database list table (the same key value is used as an association, and also can be imported into a plurality of tables.
The SQL pseudo code is as follows:
INSERT INTO table_code (code,code_1, code_2, … code_n) VALUES (CODE, CODE 1 , CODE 2 , … CODE n )
example (let l= 9,N =3, contract split rule: one group per three characters), table structure and data:
sequence number code code_1 code_2 code_3
1 AB1CD2EF3 AB1 CD2 EF3
2 67x78y89z 67x 78y 89z
3 KKK555www KKK 555 www
4 W29CD2t36 W29 CD2 t36
The invention marks the character string of the character number of the object to be recognized through OCR as CODE r
First, the CODE is determined r If the length is 0, an error is directly returned. If the length is greater than L, only the first L characters are intercepted. Then, the CODE is r And (4) inquiring whether a matching result exists or not from the database as an inquiry condition, if so, indicating that OCR recognition is accurate, and otherwise, continuing.
Examples (CODE r =67×78y89 z), the record matching to serial number 2 after the query indicates that the OCR recognition result is accurate.
CODE is arranged according to rules of convention r Splitting into N segments to obtain a set B= { CODE r1 , CODE r2 , … CODE rn }。
And querying the elements in the set B according to the corresponding columns in the data table to obtain a query result set C (the length of C is more than or equal to 0).
The SQL pseudo code is as follows:
SELECT * FROM table_code WHERE (code_1=CODE r1 OR code_2=CODE r2 , … OR code_n=CODE rn );
the query results in the set C are sequentially combined with the CODE r The levenstein distance calculation was performed. The Levenshtein distance, also known as Levenshtein distance, is one of the edit distances. Two fingersBetween strings, the minimum number of editing operations required to transfer from one to another. The allowed editing operations include replacing one character with another, inserting one character, deleting one character.
After the Levenstein distance is calculated, the piece of data with the smallest distance is found out: CODE L 。CODE L Namely, with the CODE r The most similar item number.
Examples (CODE r =w26 CD2t 36), then the query yields data c= { AB1CD2EF3, W29CD2t36}. Obtaining the CODE after the minimum distance calculation r (W26 CD2t 36) is approximately numbered W29CD2t36.
The method improves the recognition accuracy and recognition efficiency of a given character string by splitting the character number of the article, and finds out the actual number which is most similar to the character number.
The principle of the method for improving the recognition accuracy is that the same OCR is used for recognizing the accuracy { P } of any one element in the set A 1 , P 2 , …P n Is greater than P 0 A kind of electronic device. I.e. p 3 >p L Wherein L is>3。
Example 2
As shown in FIG. 2, a system for comparing OCR character recognition results comprises a character number acquisition module, a character number storage module, an OCR recognition module, a character number splitting module, a character number comparison module and a recognition result output module.
1. The character number acquisition module collects the original article character number data to obtain a data set R, the character number storage module stores the data of the data set R, and the data set R is updated and expanded in real time along with the generation progress of the original article number.
2. The character number splitting module splits the character length to L (L>1) Each article character number CODE of (1) is broken according to a rule of convention and split into N segments (N>1) Get the set a= { CODE 1 ,CODE 2 , … CODE n }。
3. The character number storage module stores each element in set a as a set of records into a data table.
The OCR module marks the character string of the character number of the object to be recognized through OCR as CODE r Judging the CODE r If the length is 0, directly returning a result of identifying the error; if the length is greater than L, only the first L characters are intercepted, and the CODE is processed r And the intercepting result is stored in a character number storage module.
5. CODE pair r After judgment and interception, the character number splitting module also splits the character number into N segments according to a convention rule to obtain a set B= { CODE r1 , CODE r2 , … CODE rn The character number storage module stores the data of the data set B.
6. The character number comparison module queries the elements in the set B according to the corresponding column records in the data table to obtain a query result set C (the length of C is more than or equal to 0), wherein the set C is the set of the queried original article character numbers.
7. The character number comparison module sequentially compares the query results in the set C with the CODE r The levenstein distance calculation was performed.
8. After the Levenstein distance is calculated, the data CODE with the smallest distance is found out L ,CODE L Namely, with the CODE r The most similar article character number.
9. The identification result output module outputs the identification result. If the CODE is judged r When the length of the item is 0, outputting the result of the character recognition error of the item; if the Levenstein distance calculation result is 0, outputting a correct recognition result; if the result of the Levenstein distance calculation is not 0, the most approximate article character number is output.
The method and the system divide the original article character number and the character string recognized by OCR respectively according to a convention rule, wherein the convention rule is to divide according to a character length unit or divide according to letters and numbers independently, and continuous letters or numbers are one section; and storing the split original character numbers of the articles into a data table to be used as compared original records, inquiring data in the data table according to the corresponding columns by the split character strings recognized by OCR, sequentially calculating the Levenstein distance between the inquiring result and the character strings recognized by OCR, and finding out the data with the smallest distance after calculating the Levenstein distance to obtain a final result. Thus, whether the result of OCR recognition of the character number of the article is correct or not is effectively judged, and if the result of OCR recognition is incorrect, the correct character number is informed.
However, the foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, so that the substitution of equivalent elements or equivalent variations and modifications within the scope of the invention are intended to fall within the scope of the claims.

Claims (7)

1. A method of comparing OCR character recognition results, comprising:
s1, collecting original article character number data to obtain a data set R;
s2, breaking each article character number CODE with the character length L according to a convention rule, and splitting the article character number CODE into N sections to obtain a set A= { CODE 1 ,CODE 2 , … CODE n };
S3, importing each element in the set A into a data table as a group of records;
s4, marking character strings of character numbers of articles to be recognized through OCR recognition as CODE r
Judging CODE r If the length is 0, directly returning a result of identifying the error; if the length is greater than L, only the first L characters are intercepted;
s5, the CODE is paired r After judgment and interception, splitting the data into N segments according to the agreed rule to obtain a set B= { CODE r1 , CODE r2 , … CODE rn };
S6, inquiring the elements in the set B according to the corresponding column records in the data table to obtain an inquiry result set C, wherein the set C is an inquired original article character number set;
s7, sequentially combining the query results in the set C with the CODE r Performing Levenstein distance calculation;
s8, levensteinAfter the distance calculation, find out the data CODE with the smallest distance L ,CODE L Namely, with the CODE r The most similar article character number.
2. A method of comparing OCR character recognition results according to claim 1, characterized in that the data set R updates the extension in real time.
3. The method of comparing OCR character recognition results according to claim 1, wherein the agreed rule of the article character number CODE is splitting according to character length units.
4. A method of comparing OCR character recognition results according to claim 3, wherein the agreed rule of the article character number CODE is split for each 3 characters in one segment.
5. A method of comparing OCR character recognition results according to claim 1, wherein the agreed rule of the article character number CODE is splitting according to letters and numbers individually, and consecutive letters or numbers are a segment.
6. A method of comparing OCR character recognition results according to claim 1, wherein the data table is a multi-table or single-table, the data tables being associated with the same key value.
7. A system for comparing OCR character recognition results, which is used for recognizing character numbers by adopting the method for comparing OCR character recognition results according to any one of claims 1-6, and is characterized by comprising a character number storage module, an OCR recognition module, a character number splitting module, a character number comparison module and a recognition result output module.
CN202311021186.4A 2023-08-15 2023-08-15 Method and system for comparing OCR character recognition results Active CN116758564B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311021186.4A CN116758564B (en) 2023-08-15 2023-08-15 Method and system for comparing OCR character recognition results

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311021186.4A CN116758564B (en) 2023-08-15 2023-08-15 Method and system for comparing OCR character recognition results

Publications (2)

Publication Number Publication Date
CN116758564A true CN116758564A (en) 2023-09-15
CN116758564B CN116758564B (en) 2023-11-10

Family

ID=87959388

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311021186.4A Active CN116758564B (en) 2023-08-15 2023-08-15 Method and system for comparing OCR character recognition results

Country Status (1)

Country Link
CN (1) CN116758564B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050123203A1 (en) * 2003-12-04 2005-06-09 International Business Machines Corporation Correcting segmentation errors in OCR
US20120128250A1 (en) * 2009-12-02 2012-05-24 David Petrou Generating a Combination of a Visual Query and Matching Canonical Document
US20160019430A1 (en) * 2014-07-21 2016-01-21 Optum, Inc. Targeted optical character recognition (ocr) for medical terminology
US20160125275A1 (en) * 2014-10-31 2016-05-05 Kabushiki Kaisha Toshiba Character recognition device, image display device, image retrieval device, character recognition method, and computer program product
CN106650715A (en) * 2016-10-26 2017-05-10 西安电子科技大学 Method for detecting and correcting errors of OCR recognition results of character strings according to permission set
CN111259894A (en) * 2020-01-20 2020-06-09 普信恒业科技发展(北京)有限公司 Certificate information identification method and device and computer equipment
CN111680679A (en) * 2020-06-03 2020-09-18 重庆数道科技有限公司 Automatic document identification method based on OCR
JP2021022261A (en) * 2019-07-30 2021-02-18 富士通フロンテック株式会社 Correction candidate determination device, correction candidate determination method and program
CN113128504A (en) * 2021-04-25 2021-07-16 福州符号信息科技有限公司 OCR recognition result error correction method and device based on verification rule
CN113392833A (en) * 2021-06-10 2021-09-14 沈阳派得林科技有限责任公司 Method for identifying type number of industrial radiographic negative image
CN115600564A (en) * 2022-10-09 2023-01-13 上海纽酷信息科技有限公司(Cn) Form rapid construction method based on OCR recognition technology
CN115830618A (en) * 2022-12-06 2023-03-21 北京闪星科技有限公司 Text recognition method and device, computer equipment and storage medium
CN116229484A (en) * 2023-01-31 2023-06-06 支付宝(杭州)信息技术有限公司 Text recognition method, list scanning method and device

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050123203A1 (en) * 2003-12-04 2005-06-09 International Business Machines Corporation Correcting segmentation errors in OCR
US20120128250A1 (en) * 2009-12-02 2012-05-24 David Petrou Generating a Combination of a Visual Query and Matching Canonical Document
US20160019430A1 (en) * 2014-07-21 2016-01-21 Optum, Inc. Targeted optical character recognition (ocr) for medical terminology
US20160125275A1 (en) * 2014-10-31 2016-05-05 Kabushiki Kaisha Toshiba Character recognition device, image display device, image retrieval device, character recognition method, and computer program product
CN106650715A (en) * 2016-10-26 2017-05-10 西安电子科技大学 Method for detecting and correcting errors of OCR recognition results of character strings according to permission set
JP2021022261A (en) * 2019-07-30 2021-02-18 富士通フロンテック株式会社 Correction candidate determination device, correction candidate determination method and program
CN111259894A (en) * 2020-01-20 2020-06-09 普信恒业科技发展(北京)有限公司 Certificate information identification method and device and computer equipment
CN111680679A (en) * 2020-06-03 2020-09-18 重庆数道科技有限公司 Automatic document identification method based on OCR
CN113128504A (en) * 2021-04-25 2021-07-16 福州符号信息科技有限公司 OCR recognition result error correction method and device based on verification rule
CN113392833A (en) * 2021-06-10 2021-09-14 沈阳派得林科技有限责任公司 Method for identifying type number of industrial radiographic negative image
CN115600564A (en) * 2022-10-09 2023-01-13 上海纽酷信息科技有限公司(Cn) Form rapid construction method based on OCR recognition technology
CN115830618A (en) * 2022-12-06 2023-03-21 北京闪星科技有限公司 Text recognition method and device, computer equipment and storage medium
CN116229484A (en) * 2023-01-31 2023-06-06 支付宝(杭州)信息技术有限公司 Text recognition method, list scanning method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ROHIT SALUJA 等: "Sub-Word Embeddings for OCR Corrections in Highly Fusional Indic Languages", IEEE, pages 172 *
赵莉: "基于OCR的拼写校正系统", 兵工自动化, pages 96 - 98 *
郭智 等: "融合属性信息的知识表示方法", 科学技术与工程, pages 262 - 268 *

Also Published As

Publication number Publication date
CN116758564B (en) 2023-11-10

Similar Documents

Publication Publication Date Title
Liu et al. NET–a system for extracting web data from flat and nested data records
JP2673091B2 (en) Search for token sequence in token sequence database
US7657506B2 (en) Methods and apparatus for automated matching and classification of data
US8082270B2 (en) Fuzzy search using progressive relaxation of search terms
CN101404032B (en) Video retrieval method and system based on contents
US7464101B2 (en) Fuzzy alphanumeric search apparatus and method
US20080005106A1 (en) System and method for automatic weight generation for probabilistic matching
JP5183155B2 (en) Batch search method and search system for a large number of sequences
CN105069086A (en) Method and system for optimizing electronic commerce commodity searching
CN104008171A (en) Legal database establishing method and legal retrieving service method
CN104618361B (en) A kind of network flow data method for reordering
CN100452048C (en) Method for enquiring electronic dictionary word with letter index table and system thereof
US8788522B2 (en) Pair character string retrieval system
CN108073701A (en) A kind of method of the rare pattern of Mining Multidimensional time series data
CN116758564B (en) Method and system for comparing OCR character recognition results
CN110019637B (en) Sorting algorithm for standard document retrieval
Peng et al. Document image template matching based on component block list
CN105447135A (en) Data search method and device
CN110134686B (en) Index creation method and system for fuzzy query of Chinese keywords
JP2006134106A (en) Business form recognition system, business form recognition method and computer program
JP2001216307A (en) Relational database management system and storage medium stored with same
Ford et al. Pattern matching techniques for correcting low-confidence OCR words in a known context
Somboonsak et al. A new edit distance method for finding similarity in Dna sequence
CN111931495A (en) Corpus fast matching method and error correction method based on dichotomy and editing distance
CN114791916B (en) Rapid comparison method of clinical test data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant