CN116758564A

CN116758564A - Method and system for comparing OCR character recognition results

Info

Publication number: CN116758564A
Application number: CN202311021186.4A
Authority: CN
Inventors: 张宏坤; 李慧; 刘子禛; 于龙; 姜建宁; 王军
Original assignee: Shandong Lvxin Siyuan Counter Forgery Technology Co ltd
Current assignee: Shandong Lvxin Siyuan Counter Forgery Technology Co ltd
Priority date: 2023-08-15
Filing date: 2023-08-15
Publication date: 2023-09-15
Anticipated expiration: 2043-08-15
Also published as: CN116758564B

Abstract

The invention provides a method and a system for comparing OCR character recognition results, and relates to the technical field of character recognition. The method and the system divide the original article character number and the character string recognized by OCR respectively according to a convention rule, wherein the convention rule is to divide according to a character length unit or divide according to letters and numbers independently, and continuous letters or numbers are one section; and storing the split original character numbers of the articles into a data table to be used as compared original records, inquiring data in the data table according to the corresponding columns by the split character strings recognized by OCR, sequentially calculating the Levenstein distance between the inquiring result and the character strings recognized by OCR, and finding out the data with the smallest distance after calculating the Levenstein distance to obtain a final result. Thus, whether the result of OCR recognition of the character number of the article is correct or not is effectively judged, and if the result of OCR recognition is incorrect, the correct character number is informed.

Description

Method and system for comparing OCR character recognition results

Technical Field

The invention relates to the technical field of OCR character recognition, in particular to a method and a system for comparing OCR character recognition results.

Background

In each production and processing field, in order to facilitate management and traceability of materials, accessories, products and other articles, the articles are uniquely numbered. The character number and graphic code (bar code or two-dimensional code) are usually attached to the article by printing or pasting. But in some cases, because of some conditional restrictions, only character numbers can be attached to items.

The character number of the article is formed by unordered mixture of letters and numbers, when the character number of the article is identified in the prior art, single characters are usually identified one by one, and the average identification accuracy of the single characters is assumed to be p (p)<1) The article character number length is L (L>1) Accuracy P of recognition of the character number of the whole article ₀ For the product of all individual character recognition rates P, i.e. P ₀ =p ^L . Because p is<1, so the longer the article character number length is, the accuracy P of identifying the whole article character number ₀ The lower.

Disclosure of Invention

Aiming at the technical problems that the longer the article character number length is, the lower the recognition accuracy of the whole article character number is, the invention provides a method and a system for comparing OCR character recognition results, the method and the system divide the original article character number and the OCR recognized character string respectively according to a stipulated rule, inquire the divided character string in sections, and calculate the Lymestein distance of the inquired result, thereby efficiently judging whether the article character number recognition result is correct and exists, and informing the correct character number if the article character number recognition result is incorrect.

Therefore, the technical scheme of the invention is that the method for comparing the recognition results of OCR characters comprises the following steps:

s1, collecting original article character number data to obtain a data set R;

s2, the character length is L (L>1) Each article character number CODE of (1) is broken according to a rule of convention and split into N segments (N>1) Get the set a= { CODE ₁ ,CODE ₂ , … CODE _n }；

S3, importing each element in the set A into a data table as a group of records;

s4, marking character strings of character numbers of articles to be recognized through OCR recognition as CODE _r ，

Judging CODE _r If the length is 0, directly returning a result of identifying the error; if the length is greater than L, only the first L characters are intercepted;

s5, the CODE is paired _r After judgment and interception, splitting the data into N segments according to the agreed rule to obtain a set B= { CODE _r1 , CODE _r2 , … CODE _rn }；

S6, inquiring the elements in the set B according to the corresponding column records in the data table to obtain an inquiry result set C (the length of the set C is more than or equal to 0), wherein the set C is the set of inquired original article character numbers;

s7, sequentially combining the query results in the set C with the CODE _r Performing Levenstein distance calculation;

s8, after the Levenstein distance is calculated, finding out the data CODE with the smallest distance _L ，CODE _L Namely, with the CODE _r The most similar article character number.

Preferably, the data set R updates the extension in real time.

Preferably, the agreed rule of the article character number CODE is splitting according to the character length unit.

Preferably, the agreed rule of the article character number CODE is that every 3 characters are split into one segment.

Preferably, the convention for the article character number CODE is to split separately by letter and number, with consecutive letters or numbers being one segment.

Preferably, the data table is a multi-table or single-table, and the data table uses the same key value as the association.

A system for comparing OCR character recognition results adopts the method for comparing OCR character recognition results to recognize character numbers, and comprises a character number storage module, an OCR recognition module, a character number splitting module, a character number comparison module and a recognition result output module.

The method and the system have the beneficial effects that the original article character number and the character string recognized by OCR are respectively split according to the agreed rule, the agreed rule is that the original article character number and the character string recognized by OCR are split according to the character length unit, or the original article character number and the character string recognized by OCR are independently split according to letters and numbers, and continuous letters or numbers are a section; and storing the split original character numbers of the articles into a data table to be used as compared original records, inquiring data in the data table according to the corresponding columns by the split character strings recognized by OCR, sequentially calculating the Levenstein distance between the inquiring result and the character strings recognized by OCR, and finding out the data with the smallest distance after calculating the Levenstein distance to obtain a final result. Thus, whether the result of OCR recognition of the character number of the article is correct or not is effectively judged, and if the result of OCR recognition is incorrect, the correct character number is informed.

Drawings

FIG. 1 is a flow chart of the method for comparing OCR character recognition results according to the present invention;

FIG. 2 is a schematic diagram of a system for comparing OCR character recognition results according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 1, the invention provides a method for comparing OCR character recognition results, which comprises the following steps of:

s1, collecting original article character number data to obtain a data set R;

s2, breaking each article character number CODE with the character length L according to a convention rule, and splitting the article character number CODE into N sections to obtain a set A= { CODE ₁ ,CODE ₂ , … CODE _n }；

S6, inquiring the elements in the set B according to the corresponding column records in the data table to obtain an inquiry result set C, wherein the set C is an inquired original article character number set;

Specifically, the invention gathers original article character number data to obtain a data set R. The data set R is a set of all original article character numbers, namely a set of already generated article character numbers, and is updated and expanded in real time according to the production progress.

The present invention uses CODEs to represent article character numbers. Breaking each article character number CODE according to the agreed rule, splitting into N segments (N>1) Get the set a= { CODE ₁ ,CODE ₂ , … CODE _n }. The CODE rule of the character number CODE of the article is to split according to the unit of the character length, for example, every 2-5 characters are one segment, and the most optimal scheme is that every 3 characters are one segment. Alternatively, the convention for the article character number CODE is to split by letter and number alone, with successive letters or numbers being a segment, such as article character number AB1CD2473,split into AB,1, cd,2473.

Based on the above, the invention splits the data set R into N columns according to the agreed rules, and then each column is imported as a group of records into the database list table (the same key value is used as an association, and also can be imported into a plurality of tables.

The SQL pseudo code is as follows:

INSERT INTO table_code (code，code_1, code_2, … code_n) VALUES (CODE, CODE ₁ , CODE ₂ , … CODE _n )

example (let l= 9,N =3, contract split rule: one group per three characters), table structure and data:

sequence number	code	code_1	code_2	code_3
					1	AB1CD2EF3	AB1	CD2	EF3
2	67x78y89z	67x	78y	89z
					3	KKK555www	KKK	555	www
4	W29CD2t36	W29	CD2	t36

The invention marks the character string of the character number of the object to be recognized through OCR as CODE _r 。

First, the CODE is determined _r If the length is 0, an error is directly returned. If the length is greater than L, only the first L characters are intercepted. Then, the CODE is _r And (4) inquiring whether a matching result exists or not from the database as an inquiry condition, if so, indicating that OCR recognition is accurate, and otherwise, continuing.

Examples (CODE _r =67×78y89 z), the record matching to serial number 2 after the query indicates that the OCR recognition result is accurate.

CODE is arranged according to rules of convention _r Splitting into N segments to obtain a set B= { CODE _r1 , CODE _r2 , … CODE _rn }。

And querying the elements in the set B according to the corresponding columns in the data table to obtain a query result set C (the length of C is more than or equal to 0).

The SQL pseudo code is as follows:

SELECT * FROM table_code WHERE (code_1=CODE _r1 OR code_2=CODE _r2 , … OR code_n=CODE _rn );

the query results in the set C are sequentially combined with the CODE _r The levenstein distance calculation was performed. The Levenshtein distance, also known as Levenshtein distance, is one of the edit distances. Two fingersBetween strings, the minimum number of editing operations required to transfer from one to another. The allowed editing operations include replacing one character with another, inserting one character, deleting one character.

After the Levenstein distance is calculated, the piece of data with the smallest distance is found out: CODE _L 。CODE _L Namely, with the CODE _r The most similar item number.

Examples (CODE _r =w26 CD2t 36), then the query yields data c= { AB1CD2EF3, W29CD2t36}. Obtaining the CODE after the minimum distance calculation _r (W26 CD2t 36) is approximately numbered W29CD2t36.

The method improves the recognition accuracy and recognition efficiency of a given character string by splitting the character number of the article, and finds out the actual number which is most similar to the character number.

The principle of the method for improving the recognition accuracy is that the same OCR is used for recognizing the accuracy { P } of any one element in the set A ₁ , P ₂ , …P _n Is greater than P ₀ A kind of electronic device. I.e. p ³ >p ^L Wherein L is>3。

Example 2

As shown in FIG. 2, a system for comparing OCR character recognition results comprises a character number acquisition module, a character number storage module, an OCR recognition module, a character number splitting module, a character number comparison module and a recognition result output module.

1. The character number acquisition module collects the original article character number data to obtain a data set R, the character number storage module stores the data of the data set R, and the data set R is updated and expanded in real time along with the generation progress of the original article number.

2. The character number splitting module splits the character length to L (L>1) Each article character number CODE of (1) is broken according to a rule of convention and split into N segments (N>1) Get the set a= { CODE ₁ ,CODE ₂ , … CODE _n }。

3. The character number storage module stores each element in set a as a set of records into a data table.

The OCR module marks the character string of the character number of the object to be recognized through OCR as CODE _r Judging the CODE _r If the length is 0, directly returning a result of identifying the error; if the length is greater than L, only the first L characters are intercepted, and the CODE is processed _r And the intercepting result is stored in a character number storage module.

5. CODE pair _r After judgment and interception, the character number splitting module also splits the character number into N segments according to a convention rule to obtain a set B= { CODE _r1 , CODE _r2 , … CODE _rn The character number storage module stores the data of the data set B.

6. The character number comparison module queries the elements in the set B according to the corresponding column records in the data table to obtain a query result set C (the length of C is more than or equal to 0), wherein the set C is the set of the queried original article character numbers.

7. The character number comparison module sequentially compares the query results in the set C with the CODE _r The levenstein distance calculation was performed.

8. After the Levenstein distance is calculated, the data CODE with the smallest distance is found out _L ，CODE _L Namely, with the CODE _r The most similar article character number.

9. The identification result output module outputs the identification result. If the CODE is judged _r When the length of the item is 0, outputting the result of the character recognition error of the item; if the Levenstein distance calculation result is 0, outputting a correct recognition result; if the result of the Levenstein distance calculation is not 0, the most approximate article character number is output.

The method and the system divide the original article character number and the character string recognized by OCR respectively according to a convention rule, wherein the convention rule is to divide according to a character length unit or divide according to letters and numbers independently, and continuous letters or numbers are one section; and storing the split original character numbers of the articles into a data table to be used as compared original records, inquiring data in the data table according to the corresponding columns by the split character strings recognized by OCR, sequentially calculating the Levenstein distance between the inquiring result and the character strings recognized by OCR, and finding out the data with the smallest distance after calculating the Levenstein distance to obtain a final result. Thus, whether the result of OCR recognition of the character number of the article is correct or not is effectively judged, and if the result of OCR recognition is incorrect, the correct character number is informed.

However, the foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, so that the substitution of equivalent elements or equivalent variations and modifications within the scope of the invention are intended to fall within the scope of the claims.

Claims

1. A method of comparing OCR character recognition results, comprising:

s1, collecting original article character number data to obtain a data set R;

s8, levensteinAfter the distance calculation, find out the data CODE with the smallest distance _L ，CODE _L Namely, with the CODE _r The most similar article character number.

2. A method of comparing OCR character recognition results according to claim 1, characterized in that the data set R updates the extension in real time.

3. The method of comparing OCR character recognition results according to claim 1, wherein the agreed rule of the article character number CODE is splitting according to character length units.

4. A method of comparing OCR character recognition results according to claim 3, wherein the agreed rule of the article character number CODE is split for each 3 characters in one segment.

5. A method of comparing OCR character recognition results according to claim 1, wherein the agreed rule of the article character number CODE is splitting according to letters and numbers individually, and consecutive letters or numbers are a segment.

6. A method of comparing OCR character recognition results according to claim 1, wherein the data table is a multi-table or single-table, the data tables being associated with the same key value.

7. A system for comparing OCR character recognition results, which is used for recognizing character numbers by adopting the method for comparing OCR character recognition results according to any one of claims 1-6, and is characterized by comprising a character number storage module, an OCR recognition module, a character number splitting module, a character number comparison module and a recognition result output module.