CN109447055B - OCR (optical character recognition) -based character similarity recognition method - Google Patents

OCR (optical character recognition) -based character similarity recognition method Download PDF

Info

Publication number
CN109447055B
CN109447055B CN201811211186.XA CN201811211186A CN109447055B CN 109447055 B CN109447055 B CN 109447055B CN 201811211186 A CN201811211186 A CN 201811211186A CN 109447055 B CN109447055 B CN 109447055B
Authority
CN
China
Prior art keywords
character
font
recognition
ocr
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811211186.XA
Other languages
Chinese (zh)
Other versions
CN109447055A (en
Inventor
席敬
焦勇
伏虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Wanwei Information Technology Co Ltd
Original Assignee
China Telecom Wanwei Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Wanwei Information Technology Co Ltd filed Critical China Telecom Wanwei Information Technology Co Ltd
Priority to CN201811211186.XA priority Critical patent/CN109447055B/en
Publication of CN109447055A publication Critical patent/CN109447055A/en
Application granted granted Critical
Publication of CN109447055B publication Critical patent/CN109447055B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds

Abstract

The invention relates to the technical field of computers, in particular to the field of pattern recognition and deep learning, and more particularly relates to a method for recognizing similar characters based on OCR (optical character recognition). The traditional font identification mode is changed, both the character text and the font can be identified, the text identification accuracy is greatly improved through multi-sample comparison and threshold screening, and the character font is effectively identified. The method is particularly suitable for character recognition of similar font and font, and realizes double accurate recognition of font and font. The characters are cut into 96 × 96 pixels by horizontal segmentation and vertical segmentation, so that the extraction of pixel characteristic information is facilitated, the mutual interference between adjacent characters is avoided, and the recognition efficiency is effectively improved.

Description

OCR (optical character recognition) -based character similarity recognition method
Technical Field
The invention relates to the technical field of computers, in particular to the field of pattern recognition and deep learning, and more particularly relates to a method for recognizing similar characters based on OCR (optical character recognition).
Background
Optical Character Recognition (OCR) is a method of converting an image file printed on paper into a text file by combining an Optical technology and a computer technology, and OCR can be used for automatic scanning and long-term storage of bills such as bank bills, large-volume document, file files, tax receipts and the like.
OCR recognition is usually a technical measure of recognition rate, recognition speed, layout understanding, and layout reconstruction. The technology has better recognition rate to common characters, but has certain technical problems in the field of Chinese characters with rich structures and fonts, particularly in the case of similar fonts, such as: the characters such as (noon, dry), (run, bubble, cannon) and the like have the problems of low recognition efficiency and low precision. In addition, in the prior art, the same font and different fonts of the character cannot be judged, errors are very easy to occur when the same font and different fonts of the character are identified, the identification results are different after repeated identification for many times, manual intervention and error correction are sometimes needed, and the identification accuracy is greatly reduced.
Disclosure of Invention
The invention provides a method for recognizing similar characters based on OCR (optical character recognition) font, which has high recognition rate, high recognition speed and high precision.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for recognizing similar characters based on OCR (optical character recognition) font comprises the following steps:
A. raw OCR image pre-processing
Performing text correction on the oblique characters, removing noise in the picture, and converting the image contrast and Gamma correction into a gray image;
B. image text detection
Extracting character pixel characteristic information of the preprocessed gray level image, and extracting the character pixel characteristic information by adopting a CNN neural network to convert the character pixel characteristic information into a characteristic vector in a single hot code form, wherein the characteristic vector is used as a basis for identifying the character pixel characteristic information of a character identification module;
C. identifying computations
Using different fonts of a standard word stock as training samples n, and recording each different font of the standard word stock as n1、n2… …, calculating Euclidean distance D of each font of the training samplen1、Dn2……,The character recognition module adopts a google-inclusion-v 4 framework, recognizes characters of an image to be recognized as a recognition sample p, and calculates the Euclidean distance D of the recognition sample pPCalculating a comparison threshold value a of the recognition sample and the training samples with different fonts by using the following formula,
Figure 244135DEST_PATH_IMAGE001
Figure 708615DEST_PATH_IMAGE002
……;
D. character text font recognition
Selecting a contrast threshold a1、a2… …, 0.4-0.6, and outputs the text and font of the corresponding recognized character.
And in the step B, extracting character pixel characteristic information of the preprocessed gray-scale image, and cutting each character into 96 × 96 pixels by horizontal segmentation and vertical segmentation.
And C, using 16 fonts of 3755 characters in the national standard primary word stock as the training sample n.
Selecting a contrast threshold a in the step D1、a2… …, the closest training sample to 0.5, outputs the text and font of the corresponding recognized character.
And in the step C, the character recognition module adopts a google-inclusion-v 4 framework to split the 5 x 5 two-dimensional convolution kernel into 1 x 5 and 5 x 1 one-dimensional convolution kernels.
The invention has the beneficial effects that:
1. the traditional font identification mode is changed, both the character text and the font can be identified, the text identification accuracy is greatly improved through multi-sample comparison and threshold screening, and the character font is effectively identified. The method is particularly suitable for character recognition of similar font and font, and realizes double accurate recognition of font and font.
2. The characters are cut into 96 × 96 pixels by horizontal segmentation and vertical segmentation, so that the extraction of pixel characteristic information is facilitated, the mutual interference between adjacent characters is avoided, and the recognition efficiency is effectively improved.
3. The invention is in contrast threshold a1、a2… …, a training sample closest to 0.5 is selected, and the corresponding text and font of the recognized character are output, so that the recognition accuracy is improved, and the manual intervention error correction is avoided.
4. The character recognition module adopts a google-inclusion-v 4 framework to split a two-dimensional convolution kernel of 5 x 5 into a one-dimensional convolution kernel of 1 x 5 and 5 x 1, so that not only is overfitting prevented, but also the nonlinear expansion capability is increased, and the character feature diversity is reserved.
Drawings
FIG. 1 is a schematic view of the recognition of the present invention.
Detailed Description
A method for recognizing similar characters based on OCR (optical character recognition) font comprises the following steps:
A. raw OCR image pre-processing
Performing text correction on the oblique characters, removing noise in the picture, and converting the image contrast and Gamma correction into a gray image;
B. image text detection
Extracting character pixel characteristic information of the preprocessed gray level image, and extracting the character pixel characteristic information by adopting a CNN neural network to convert the character pixel characteristic information into a characteristic vector in a single hot code form, wherein the characteristic vector is used as a basis for identifying the character pixel characteristic information of a character identification module;
C. identifying computations
Using different fonts of a standard word stock as training samples n, and recording each different font of the standard word stock as n1、n2… …, calculating Euclidean distance D of each font of the training samplen1、Dn2……,The character recognition module adopts a google-inclusion-v 4 framework, recognizes characters of an image to be recognized as a recognition sample p, and calculates the Euclidean distance D of the recognition sample pPCalculating the pair of the recognition sample and the training sample with different fonts by using the following formulaThe ratio of the threshold value a to the threshold value a,
Figure 282553DEST_PATH_IMAGE001
Figure 807076DEST_PATH_IMAGE002
……;
D. character text font recognition
Selecting a contrast threshold a1、a2… …, 0.4-0.6, and outputs the text and font of the corresponding recognized character.
And in the step B, extracting character pixel characteristic information of the preprocessed gray-scale image, and cutting each character into 96 × 96 pixels by horizontal segmentation and vertical segmentation.
And C, using 16 fonts of 3755 characters in the national standard primary word stock as the training sample n.
Selecting a contrast threshold a in the step D1、a2… …, the closest training sample to 0.5, outputs the text and font of the corresponding recognized character.
And in the step C, the character recognition module adopts a google-inclusion-v 4 framework to split the 5 x 5 two-dimensional convolution kernel into 1 x 5 and 5 x 1 one-dimensional convolution kernels.
Comparative experiment 1
The test was carried out with three-letter dry sones as cases:
setting a font interference item at noon;
setting a font interference item black body and a simulated Song song;
the test method is as follows: screening the Song body trunk, the black body trunk and the imitated Song body trunk; song in, black in, or imitation of Song in; 9 pictures of Song body noon, black noon, imitation Song noon, etc.; the results are manually identified as dry 3 cases, 3 cases and 3 noon 3 cases;
a ZOL software download network Hanwang OCR free Chinese edition and a starting point software park network orc software v8.1 are adopted to carry out a plurality of comparison tests, and the specific comparison results are as follows:
the invention Hanwang OCR orc software v8.1
For the first time Dry 3 cases, in 3 cases, 3 noon 4 cases in dry, 2 cases in noon and 3 cases in noon Dry 3 cases, in 3 cases, 3 noon
For the second time Dry 3 cases, in 3 cases, 3 noon 5 dry cases, 3 cases and 1 noon 1 dry, 4 in 4 and 4 at noon
The third time Dry 3 cases, in 3 cases, 3 noon Dry 3 cases, in 3 cases, 3 noon 2 dry cases, 2 cases and 5 noon cases
Result analysis from single picture character pattern recognition, character pattern texts can be recognized by the method, a Hanwang OCR free Chinese version of a ZOL software download network and orc software v8.1 of a starting point software park network, but the prior art has instability, interference items have certain influence on the existing recognition software, the recognition result is unstable, and manual intervention and error correction are needed. The contrast threshold of the invention for 9 pictures is selected from one closest to 0.5 in 0.4-0.6, if the pictures are seriously unclear or can not be effectively identified, the contrast threshold can fall between 0.1-0.3 or 0.7-0.9, and automatic error correction prompt is realized.
Comparative experiment 2
Compared with the technology (CN 201580029025.7) applied by Google corporation for distributed optical character recognition and distributed machine language translation, the invention judges the font by the proximity of a contrast threshold value of 0.5. The technology of the reference CN201580029025.7 cannot judge the font.
In conclusion, the scheme is particularly suitable for character recognition of similar character patterns and similar character fonts, and double accurate recognition of the character patterns and the character fonts is realized. In addition, the method of the invention is convenient to implant with the existing software, and greatly reduces the difficulty of software development identification on the basis of ensuring the identification efficiency.

Claims (5)

1. A method for recognizing similar characters based on OCR (optical character recognition) font is characterized by comprising the following steps:
A. raw OCR image pre-processing
Performing text correction on the oblique characters, removing noise in the picture, and converting the image contrast and Gamma correction into a gray image;
B. image text detection
Extracting character pixel characteristic information of the preprocessed gray level image, and extracting the character pixel characteristic information by adopting a CNN neural network to convert the character pixel characteristic information into a characteristic vector in a single hot code form, wherein the characteristic vector is used as a basis for identifying the character pixel characteristic information of a character identification module;
C. identifying computations
Using different fonts of a standard word stock as training samples n, and recording each different font of the standard word stock as n1、n2… …, calculating Euclidean distance D of each font of the training samplen1、Dn2……,The character recognition module adopts a google-inclusion-v 4 framework, recognizes characters of an image to be recognized as a recognition sample p, and calculates the Euclidean distance D of the recognition sample pPCalculating a comparison threshold value a of the recognition sample and the training samples with different fonts by using the following formula,
Figure 151043DEST_PATH_IMAGE001
Figure 599342DEST_PATH_IMAGE002
……;
D. character text font recognition
Selecting a contrast threshold a1、a2… …, 0.4-0.6, and outputs the text and font of the corresponding recognized character.
2. An OCR-based character recognition method of similar characters according to claim 1, wherein the extraction of character pixel feature information is performed on the preprocessed gray-scale image in step B, and each character is cut into 96 × 96 pixels by horizontal segmentation and vertical segmentation.
3. The OCR-based character recognition method of similar font information according to claim 1, wherein said training sample n in step C is 16 fonts of 3755 characters in national standard primary font library.
4. The method as claimed in claim 1, wherein the step D is performed by selecting a threshold a for comparison1、a2… …, the closest training sample to 0.5, outputs the text and font of the corresponding recognized character.
5. The method according to claim 1, wherein said character recognition module in step C uses a google-inclusion-v 4 framework to split the 5 x 5 two-dimensional convolution kernel into 1 x 5 and 5 x 1 one-dimensional convolution kernels.
CN201811211186.XA 2018-10-17 2018-10-17 OCR (optical character recognition) -based character similarity recognition method Active CN109447055B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811211186.XA CN109447055B (en) 2018-10-17 2018-10-17 OCR (optical character recognition) -based character similarity recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811211186.XA CN109447055B (en) 2018-10-17 2018-10-17 OCR (optical character recognition) -based character similarity recognition method

Publications (2)

Publication Number Publication Date
CN109447055A CN109447055A (en) 2019-03-08
CN109447055B true CN109447055B (en) 2022-05-03

Family

ID=65547338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811211186.XA Active CN109447055B (en) 2018-10-17 2018-10-17 OCR (optical character recognition) -based character similarity recognition method

Country Status (1)

Country Link
CN (1) CN109447055B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110443269A (en) * 2019-06-17 2019-11-12 平安信托有限责任公司 A kind of document comparison method and device
CN110781898A (en) * 2019-10-21 2020-02-11 南京大学 Unsupervised learning method for Chinese character OCR post-processing
CN111626281B (en) * 2020-04-27 2022-12-02 国家电网有限公司 Chinese annotation information identification method and system for paper image map based on adaptive learning
CN111860317A (en) * 2020-07-20 2020-10-30 青岛特利尔环保集团股份有限公司 Boiler operation data acquisition method, system, equipment and computer medium
CN116597453A (en) * 2023-05-16 2023-08-15 暗物智能科技(广州)有限公司 Shape near word single word recognition method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0498978A1 (en) * 1991-02-13 1992-08-19 International Business Machines Corporation Mechanical recognition of characters in cursive script
CN1979529A (en) * 2005-12-09 2007-06-13 佳能株式会社 Optical character recognization
CN101331520A (en) * 2005-12-19 2008-12-24 微软公司 Stroke contrast in font hinting
CN101782896A (en) * 2009-01-21 2010-07-21 汉王科技股份有限公司 PDF character extraction method combined with OCR technology
CN102707222A (en) * 2012-05-15 2012-10-03 中国电子科技集团公司第五十四研究所 Abnormal frequency point identification method based on character string comparison
CN104462068A (en) * 2013-09-12 2015-03-25 北大方正集团有限公司 Character conversion system and method
CN105335689A (en) * 2014-08-06 2016-02-17 阿里巴巴集团控股有限公司 Character recognition method and apparatus
CN106611174A (en) * 2016-12-29 2017-05-03 成都数联铭品科技有限公司 OCR recognition method for unusual fonts

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0498978A1 (en) * 1991-02-13 1992-08-19 International Business Machines Corporation Mechanical recognition of characters in cursive script
CN1979529A (en) * 2005-12-09 2007-06-13 佳能株式会社 Optical character recognization
CN101331520A (en) * 2005-12-19 2008-12-24 微软公司 Stroke contrast in font hinting
CN101782896A (en) * 2009-01-21 2010-07-21 汉王科技股份有限公司 PDF character extraction method combined with OCR technology
CN102707222A (en) * 2012-05-15 2012-10-03 中国电子科技集团公司第五十四研究所 Abnormal frequency point identification method based on character string comparison
CN104462068A (en) * 2013-09-12 2015-03-25 北大方正集团有限公司 Character conversion system and method
CN105335689A (en) * 2014-08-06 2016-02-17 阿里巴巴集团控股有限公司 Character recognition method and apparatus
CN106611174A (en) * 2016-12-29 2017-05-03 成都数联铭品科技有限公司 OCR recognition method for unusual fonts

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Language Indentification: How to Distinguish Similar Languages?;Nikola Ljubesic等;《2007 29th International Conference on Information Technology Interfaces》;20070808;541-546 *
工业生产线标签字符识别系统的设计与实现;周凤香;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20140115(第1期);I138-1973 *
血袋字符高速识别系统的研究;杨富元;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20141115(第11期);I138-372 *
视频超分辨率重建技术在人脸识别中的应用;杨振罡;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20120515(第5期);I138-1410 *

Also Published As

Publication number Publication date
CN109447055A (en) 2019-03-08

Similar Documents

Publication Publication Date Title
CN109447055B (en) OCR (optical character recognition) -based character similarity recognition method
Naz et al. Segmentation techniques for recognition of Arabic-like scripts: A comprehensive survey
Hazra et al. Optical character recognition using KNN on custom image dataset
US11790675B2 (en) Recognition of handwritten text via neural networks
Dutta et al. Towards accurate handwritten word recognition for Hindi and Bangla
Fasha et al. A hybrid deep learning model for arabic text recognition
Shabbir et al. Optical character recognition system for Urdu words in Nastaliq font
Cascianelli et al. Learning to read L’Infinito: handwritten text recognition with synthetic training data
Mursari et al. The effectiveness of image preprocessing on digital handwritten scripts recognition with the implementation of OCR Tesseract
Lakshmi et al. An optical character recognition system for printed Telugu text
Naz et al. An Ocr system for printed Nasta'liq script: A segmentation based approach
Mishra et al. Oriya Character recognition using neural networks
Aravinda et al. Template matching method for Kannada handwritten recognition based on correlation analysis
Al Ghamdi A novel approach to printed Arabic optical character recognition
US9092688B2 (en) Assisted OCR
Kadi Isolated arabic characters recognition using a robust method against noise and scaling based on the «hough transform»
Chowdhury et al. Bengali handwriting recognition and conversion to editable text
Oprean et al. Handwritten word preprocessing for database adaptation
Barbuti et al. An innovative character recognition for ancient book and archival materials: A segmentation and self-learning based approach
Choksi et al. Hindi optical character recognition for printed documents using fuzzy k-nearest neighbor algorithm: a problem approach in character segmentation
Rao et al. Orthographic properties based Telugu text recognition using hidden Markov models
Asthana et al. Handwritten Multiscript Pin Code Recognition System having Multiple hidden layers using Back Propagation Neural Network
Faruque et al. Bangla optical character recognition from printed text using Tesseract Engine
Frincu et al. Comparing ML OCR Engines on Texts from 19 th Century Written in the Romanian Transitional Script
Kaur Classification of printed and handwritten Gurmukhi text using labeling and segmentation technique

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 730000 Zhangsutan 553, Chengguan District, Lanzhou City, Gansu Province

Applicant after: China Power World Wide Information Technology Co.,Ltd.

Address before: 730000 Zhangsutan 553, Chengguan District, Lanzhou City, Gansu Province

Applicant before: GANSU WANWEI CO.

GR01 Patent grant
GR01 Patent grant