CN109447055B

CN109447055B - OCR (optical character recognition) -based character similarity recognition method

Info

Publication number: CN109447055B
Application number: CN201811211186.XA
Authority: CN
Inventors: 席敬; 焦勇; 伏虎
Original assignee: China Telecom Wanwei Information Technology Co Ltd
Current assignee: China Telecom Wanwei Information Technology Co Ltd
Priority date: 2018-10-17
Filing date: 2018-10-17
Publication date: 2022-05-03
Anticipated expiration: 2038-10-17
Also published as: CN109447055A

Abstract

The invention relates to the technical field of computers, in particular to the field of pattern recognition and deep learning, and more particularly relates to a method for recognizing similar characters based on OCR (optical character recognition). The traditional font identification mode is changed, both the character text and the font can be identified, the text identification accuracy is greatly improved through multi-sample comparison and threshold screening, and the character font is effectively identified. The method is particularly suitable for character recognition of similar font and font, and realizes double accurate recognition of font and font. The characters are cut into 96 × 96 pixels by horizontal segmentation and vertical segmentation, so that the extraction of pixel characteristic information is facilitated, the mutual interference between adjacent characters is avoided, and the recognition efficiency is effectively improved.

Description

OCR (optical character recognition) -based character similarity recognition method

Technical Field

The invention relates to the technical field of computers, in particular to the field of pattern recognition and deep learning, and more particularly relates to a method for recognizing similar characters based on OCR (optical character recognition).

Background

Optical Character Recognition (OCR) is a method of converting an image file printed on paper into a text file by combining an Optical technology and a computer technology, and OCR can be used for automatic scanning and long-term storage of bills such as bank bills, large-volume document, file files, tax receipts and the like.

OCR recognition is usually a technical measure of recognition rate, recognition speed, layout understanding, and layout reconstruction. The technology has better recognition rate to common characters, but has certain technical problems in the field of Chinese characters with rich structures and fonts, particularly in the case of similar fonts, such as: the characters such as (noon, dry), (run, bubble, cannon) and the like have the problems of low recognition efficiency and low precision. In addition, in the prior art, the same font and different fonts of the character cannot be judged, errors are very easy to occur when the same font and different fonts of the character are identified, the identification results are different after repeated identification for many times, manual intervention and error correction are sometimes needed, and the identification accuracy is greatly reduced.

Disclosure of Invention

The invention provides a method for recognizing similar characters based on OCR (optical character recognition) font, which has high recognition rate, high recognition speed and high precision.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for recognizing similar characters based on OCR (optical character recognition) font comprises the following steps:

A. raw OCR image pre-processing

Performing text correction on the oblique characters, removing noise in the picture, and converting the image contrast and Gamma correction into a gray image;

B. image text detection

Extracting character pixel characteristic information of the preprocessed gray level image, and extracting the character pixel characteristic information by adopting a CNN neural network to convert the character pixel characteristic information into a characteristic vector in a single hot code form, wherein the characteristic vector is used as a basis for identifying the character pixel characteristic information of a character identification module;

C. identifying computations

Using different fonts of a standard word stock as training samples n, and recording each different font of the standard word stock as n₁、n₂… …, calculating Euclidean distance D of each font of the training sample_n1、D_n2……_,The character recognition module adopts a google-inclusion-v 4 framework, recognizes characters of an image to be recognized as a recognition sample p, and calculates the Euclidean distance D of the recognition sample p_PCalculating a comparison threshold value a of the recognition sample and the training samples with different fonts by using the following formula,

、

……；

D. character text font recognition

Selecting a contrast threshold a₁、a₂… …, 0.4-0.6, and outputs the text and font of the corresponding recognized character.

And in the step B, extracting character pixel characteristic information of the preprocessed gray-scale image, and cutting each character into 96 × 96 pixels by horizontal segmentation and vertical segmentation.

And C, using 16 fonts of 3755 characters in the national standard primary word stock as the training sample n.

Selecting a contrast threshold a in the step D₁、a₂… …, the closest training sample to 0.5, outputs the text and font of the corresponding recognized character.

And in the step C, the character recognition module adopts a google-inclusion-v 4 framework to split the 5 x 5 two-dimensional convolution kernel into 1 x 5 and 5 x 1 one-dimensional convolution kernels.

The invention has the beneficial effects that:

1. the traditional font identification mode is changed, both the character text and the font can be identified, the text identification accuracy is greatly improved through multi-sample comparison and threshold screening, and the character font is effectively identified. The method is particularly suitable for character recognition of similar font and font, and realizes double accurate recognition of font and font.

2. The characters are cut into 96 × 96 pixels by horizontal segmentation and vertical segmentation, so that the extraction of pixel characteristic information is facilitated, the mutual interference between adjacent characters is avoided, and the recognition efficiency is effectively improved.

3. The invention is in contrast threshold a₁、a₂… …, a training sample closest to 0.5 is selected, and the corresponding text and font of the recognized character are output, so that the recognition accuracy is improved, and the manual intervention error correction is avoided.

4. The character recognition module adopts a google-inclusion-v 4 framework to split a two-dimensional convolution kernel of 5 x 5 into a one-dimensional convolution kernel of 1 x 5 and 5 x 1, so that not only is overfitting prevented, but also the nonlinear expansion capability is increased, and the character feature diversity is reserved.

Drawings

FIG. 1 is a schematic view of the recognition of the present invention.

Detailed Description

A. raw OCR image pre-processing

B. image text detection

C. identifying computations

Using different fonts of a standard word stock as training samples n, and recording each different font of the standard word stock as n₁、n₂… …, calculating Euclidean distance D of each font of the training sample_n1、D_n2……_,The character recognition module adopts a google-inclusion-v 4 framework, recognizes characters of an image to be recognized as a recognition sample p, and calculates the Euclidean distance D of the recognition sample p_PCalculating the pair of the recognition sample and the training sample with different fonts by using the following formulaThe ratio of the threshold value a to the threshold value a,

、

……；

D. character text font recognition

Comparative experiment 1

The test was carried out with three-letter dry sones as cases:

setting a font interference item at noon;

setting a font interference item black body and a simulated Song song;

the test method is as follows: screening the Song body trunk, the black body trunk and the imitated Song body trunk; song in, black in, or imitation of Song in; 9 pictures of Song body noon, black noon, imitation Song noon, etc.; the results are manually identified as dry 3 cases, 3 cases and 3 noon 3 cases;

a ZOL software download network Hanwang OCR free Chinese edition and a starting point software park network orc software v8.1 are adopted to carry out a plurality of comparison tests, and the specific comparison results are as follows:

	the invention	Hanwang OCR	orc software v8.1
				For the first time	Dry	3 cases, in 3 cases, 3 noon	4 cases in dry, 2 cases in noon and 3 cases in noon	Dry 3 cases, in 3 cases, 3 noon
For the second time	Dry	3 cases, in 3 cases, 3 noon	5 dry cases, 3 cases and 1 noon						1 dry, 4 in 4 and 4 at noon
				The third time	Dry	3 cases, in 3 cases, 3 noon	Dry	3 cases, in 3 cases, 3 noon		2 dry cases, 2 cases and 5 noon cases

Result analysis from single picture character pattern recognition, character pattern texts can be recognized by the method, a Hanwang OCR free Chinese version of a ZOL software download network and orc software v8.1 of a starting point software park network, but the prior art has instability, interference items have certain influence on the existing recognition software, the recognition result is unstable, and manual intervention and error correction are needed. The contrast threshold of the invention for 9 pictures is selected from one closest to 0.5 in 0.4-0.6, if the pictures are seriously unclear or can not be effectively identified, the contrast threshold can fall between 0.1-0.3 or 0.7-0.9, and automatic error correction prompt is realized.

Comparative experiment 2

Compared with the technology (CN 201580029025.7) applied by Google corporation for distributed optical character recognition and distributed machine language translation, the invention judges the font by the proximity of a contrast threshold value of 0.5. The technology of the reference CN201580029025.7 cannot judge the font.

In conclusion, the scheme is particularly suitable for character recognition of similar character patterns and similar character fonts, and double accurate recognition of the character patterns and the character fonts is realized. In addition, the method of the invention is convenient to implant with the existing software, and greatly reduces the difficulty of software development identification on the basis of ensuring the identification efficiency.

Claims

1. A method for recognizing similar characters based on OCR (optical character recognition) font is characterized by comprising the following steps:

A. raw OCR image pre-processing

B. image text detection

C. identifying computations

、

……；

D. character text font recognition

2. An OCR-based character recognition method of similar characters according to claim 1, wherein the extraction of character pixel feature information is performed on the preprocessed gray-scale image in step B, and each character is cut into 96 × 96 pixels by horizontal segmentation and vertical segmentation.

3. The OCR-based character recognition method of similar font information according to claim 1, wherein said training sample n in step C is 16 fonts of 3755 characters in national standard primary font library.

4. The method as claimed in claim 1, wherein the step D is performed by selecting a threshold a for comparison₁、a₂… …, the closest training sample to 0.5, outputs the text and font of the corresponding recognized character.

5. The method according to claim 1, wherein said character recognition module in step C uses a google-inclusion-v 4 framework to split the 5 x 5 two-dimensional convolution kernel into 1 x 5 and 5 x 1 one-dimensional convolution kernels.