CN109447055B - OCR (optical character recognition) -based character similarity recognition method - Google Patents
OCR (optical character recognition) -based character similarity recognition method Download PDFInfo
- Publication number
- CN109447055B CN109447055B CN201811211186.XA CN201811211186A CN109447055B CN 109447055 B CN109447055 B CN 109447055B CN 201811211186 A CN201811211186 A CN 201811211186A CN 109447055 B CN109447055 B CN 109447055B
- Authority
- CN
- China
- Prior art keywords
- character
- font
- recognition
- ocr
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/22—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
Abstract
The invention relates to the technical field of computers, in particular to the field of pattern recognition and deep learning, and more particularly relates to a method for recognizing similar characters based on OCR (optical character recognition). The traditional font identification mode is changed, both the character text and the font can be identified, the text identification accuracy is greatly improved through multi-sample comparison and threshold screening, and the character font is effectively identified. The method is particularly suitable for character recognition of similar font and font, and realizes double accurate recognition of font and font. The characters are cut into 96 × 96 pixels by horizontal segmentation and vertical segmentation, so that the extraction of pixel characteristic information is facilitated, the mutual interference between adjacent characters is avoided, and the recognition efficiency is effectively improved.
Description
Technical Field
The invention relates to the technical field of computers, in particular to the field of pattern recognition and deep learning, and more particularly relates to a method for recognizing similar characters based on OCR (optical character recognition).
Background
Optical Character Recognition (OCR) is a method of converting an image file printed on paper into a text file by combining an Optical technology and a computer technology, and OCR can be used for automatic scanning and long-term storage of bills such as bank bills, large-volume document, file files, tax receipts and the like.
OCR recognition is usually a technical measure of recognition rate, recognition speed, layout understanding, and layout reconstruction. The technology has better recognition rate to common characters, but has certain technical problems in the field of Chinese characters with rich structures and fonts, particularly in the case of similar fonts, such as: the characters such as (noon, dry), (run, bubble, cannon) and the like have the problems of low recognition efficiency and low precision. In addition, in the prior art, the same font and different fonts of the character cannot be judged, errors are very easy to occur when the same font and different fonts of the character are identified, the identification results are different after repeated identification for many times, manual intervention and error correction are sometimes needed, and the identification accuracy is greatly reduced.
Disclosure of Invention
The invention provides a method for recognizing similar characters based on OCR (optical character recognition) font, which has high recognition rate, high recognition speed and high precision.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for recognizing similar characters based on OCR (optical character recognition) font comprises the following steps:
A. raw OCR image pre-processing
Performing text correction on the oblique characters, removing noise in the picture, and converting the image contrast and Gamma correction into a gray image;
B. image text detection
Extracting character pixel characteristic information of the preprocessed gray level image, and extracting the character pixel characteristic information by adopting a CNN neural network to convert the character pixel characteristic information into a characteristic vector in a single hot code form, wherein the characteristic vector is used as a basis for identifying the character pixel characteristic information of a character identification module;
C. identifying computations
Using different fonts of a standard word stock as training samples n, and recording each different font of the standard word stock as n1、n2… …, calculating Euclidean distance D of each font of the training samplen1、Dn2……,The character recognition module adopts a google-inclusion-v 4 framework, recognizes characters of an image to be recognized as a recognition sample p, and calculates the Euclidean distance D of the recognition sample pPCalculating a comparison threshold value a of the recognition sample and the training samples with different fonts by using the following formula,、……;
D. character text font recognition
Selecting a contrast threshold a1、a2… …, 0.4-0.6, and outputs the text and font of the corresponding recognized character.
And in the step B, extracting character pixel characteristic information of the preprocessed gray-scale image, and cutting each character into 96 × 96 pixels by horizontal segmentation and vertical segmentation.
And C, using 16 fonts of 3755 characters in the national standard primary word stock as the training sample n.
Selecting a contrast threshold a in the step D1、a2… …, the closest training sample to 0.5, outputs the text and font of the corresponding recognized character.
And in the step C, the character recognition module adopts a google-inclusion-v 4 framework to split the 5 x 5 two-dimensional convolution kernel into 1 x 5 and 5 x 1 one-dimensional convolution kernels.
The invention has the beneficial effects that:
1. the traditional font identification mode is changed, both the character text and the font can be identified, the text identification accuracy is greatly improved through multi-sample comparison and threshold screening, and the character font is effectively identified. The method is particularly suitable for character recognition of similar font and font, and realizes double accurate recognition of font and font.
2. The characters are cut into 96 × 96 pixels by horizontal segmentation and vertical segmentation, so that the extraction of pixel characteristic information is facilitated, the mutual interference between adjacent characters is avoided, and the recognition efficiency is effectively improved.
3. The invention is in contrast threshold a1、a2… …, a training sample closest to 0.5 is selected, and the corresponding text and font of the recognized character are output, so that the recognition accuracy is improved, and the manual intervention error correction is avoided.
4. The character recognition module adopts a google-inclusion-v 4 framework to split a two-dimensional convolution kernel of 5 x 5 into a one-dimensional convolution kernel of 1 x 5 and 5 x 1, so that not only is overfitting prevented, but also the nonlinear expansion capability is increased, and the character feature diversity is reserved.
Drawings
FIG. 1 is a schematic view of the recognition of the present invention.
Detailed Description
A method for recognizing similar characters based on OCR (optical character recognition) font comprises the following steps:
A. raw OCR image pre-processing
Performing text correction on the oblique characters, removing noise in the picture, and converting the image contrast and Gamma correction into a gray image;
B. image text detection
Extracting character pixel characteristic information of the preprocessed gray level image, and extracting the character pixel characteristic information by adopting a CNN neural network to convert the character pixel characteristic information into a characteristic vector in a single hot code form, wherein the characteristic vector is used as a basis for identifying the character pixel characteristic information of a character identification module;
C. identifying computations
Using different fonts of a standard word stock as training samples n, and recording each different font of the standard word stock as n1、n2… …, calculating Euclidean distance D of each font of the training samplen1、Dn2……,The character recognition module adopts a google-inclusion-v 4 framework, recognizes characters of an image to be recognized as a recognition sample p, and calculates the Euclidean distance D of the recognition sample pPCalculating the pair of the recognition sample and the training sample with different fonts by using the following formulaThe ratio of the threshold value a to the threshold value a,、……;
D. character text font recognition
Selecting a contrast threshold a1、a2… …, 0.4-0.6, and outputs the text and font of the corresponding recognized character.
And in the step B, extracting character pixel characteristic information of the preprocessed gray-scale image, and cutting each character into 96 × 96 pixels by horizontal segmentation and vertical segmentation.
And C, using 16 fonts of 3755 characters in the national standard primary word stock as the training sample n.
Selecting a contrast threshold a in the step D1、a2… …, the closest training sample to 0.5, outputs the text and font of the corresponding recognized character.
And in the step C, the character recognition module adopts a google-inclusion-v 4 framework to split the 5 x 5 two-dimensional convolution kernel into 1 x 5 and 5 x 1 one-dimensional convolution kernels.
The test was carried out with three-letter dry sones as cases:
setting a font interference item at noon;
setting a font interference item black body and a simulated Song song;
the test method is as follows: screening the Song body trunk, the black body trunk and the imitated Song body trunk; song in, black in, or imitation of Song in; 9 pictures of Song body noon, black noon, imitation Song noon, etc.; the results are manually identified as dry 3 cases, 3 cases and 3 noon 3 cases;
a ZOL software download network Hanwang OCR free Chinese edition and a starting point software park network orc software v8.1 are adopted to carry out a plurality of comparison tests, and the specific comparison results are as follows:
the invention | Hanwang OCR | orc software v8.1 | |||
For the | Dry | 3 cases, in 3 cases, 3 |
4 cases in dry, 2 cases in noon and 3 cases in noon | Dry 3 cases, in 3 cases, 3 noon | |
For the | Dry | 3 cases, in 3 cases, 3 |
5 dry cases, 3 cases and 1 |
1 dry, 4 in 4 and 4 at noon | |
The | Dry | 3 cases, in 3 cases, 3 | Dry | 3 cases, in 3 cases, 3 |
2 dry cases, 2 cases and 5 noon cases |
Result analysis from single picture character pattern recognition, character pattern texts can be recognized by the method, a Hanwang OCR free Chinese version of a ZOL software download network and orc software v8.1 of a starting point software park network, but the prior art has instability, interference items have certain influence on the existing recognition software, the recognition result is unstable, and manual intervention and error correction are needed. The contrast threshold of the invention for 9 pictures is selected from one closest to 0.5 in 0.4-0.6, if the pictures are seriously unclear or can not be effectively identified, the contrast threshold can fall between 0.1-0.3 or 0.7-0.9, and automatic error correction prompt is realized.
Compared with the technology (CN 201580029025.7) applied by Google corporation for distributed optical character recognition and distributed machine language translation, the invention judges the font by the proximity of a contrast threshold value of 0.5. The technology of the reference CN201580029025.7 cannot judge the font.
In conclusion, the scheme is particularly suitable for character recognition of similar character patterns and similar character fonts, and double accurate recognition of the character patterns and the character fonts is realized. In addition, the method of the invention is convenient to implant with the existing software, and greatly reduces the difficulty of software development identification on the basis of ensuring the identification efficiency.
Claims (5)
1. A method for recognizing similar characters based on OCR (optical character recognition) font is characterized by comprising the following steps:
A. raw OCR image pre-processing
Performing text correction on the oblique characters, removing noise in the picture, and converting the image contrast and Gamma correction into a gray image;
B. image text detection
Extracting character pixel characteristic information of the preprocessed gray level image, and extracting the character pixel characteristic information by adopting a CNN neural network to convert the character pixel characteristic information into a characteristic vector in a single hot code form, wherein the characteristic vector is used as a basis for identifying the character pixel characteristic information of a character identification module;
C. identifying computations
Using different fonts of a standard word stock as training samples n, and recording each different font of the standard word stock as n1、n2… …, calculating Euclidean distance D of each font of the training samplen1、Dn2……,The character recognition module adopts a google-inclusion-v 4 framework, recognizes characters of an image to be recognized as a recognition sample p, and calculates the Euclidean distance D of the recognition sample pPCalculating a comparison threshold value a of the recognition sample and the training samples with different fonts by using the following formula,、……;
D. character text font recognition
Selecting a contrast threshold a1、a2… …, 0.4-0.6, and outputs the text and font of the corresponding recognized character.
2. An OCR-based character recognition method of similar characters according to claim 1, wherein the extraction of character pixel feature information is performed on the preprocessed gray-scale image in step B, and each character is cut into 96 × 96 pixels by horizontal segmentation and vertical segmentation.
3. The OCR-based character recognition method of similar font information according to claim 1, wherein said training sample n in step C is 16 fonts of 3755 characters in national standard primary font library.
4. The method as claimed in claim 1, wherein the step D is performed by selecting a threshold a for comparison1、a2… …, the closest training sample to 0.5, outputs the text and font of the corresponding recognized character.
5. The method according to claim 1, wherein said character recognition module in step C uses a google-inclusion-v 4 framework to split the 5 x 5 two-dimensional convolution kernel into 1 x 5 and 5 x 1 one-dimensional convolution kernels.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811211186.XA CN109447055B (en) | 2018-10-17 | 2018-10-17 | OCR (optical character recognition) -based character similarity recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811211186.XA CN109447055B (en) | 2018-10-17 | 2018-10-17 | OCR (optical character recognition) -based character similarity recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109447055A CN109447055A (en) | 2019-03-08 |
CN109447055B true CN109447055B (en) | 2022-05-03 |
Family
ID=65547338
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811211186.XA Active CN109447055B (en) | 2018-10-17 | 2018-10-17 | OCR (optical character recognition) -based character similarity recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109447055B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110443269A (en) * | 2019-06-17 | 2019-11-12 | 平安信托有限责任公司 | A kind of document comparison method and device |
CN110781898A (en) * | 2019-10-21 | 2020-02-11 | 南京大学 | Unsupervised learning method for Chinese character OCR post-processing |
CN111626281B (en) * | 2020-04-27 | 2022-12-02 | 国家电网有限公司 | Chinese annotation information identification method and system for paper image map based on adaptive learning |
CN111860317A (en) * | 2020-07-20 | 2020-10-30 | 青岛特利尔环保集团股份有限公司 | Boiler operation data acquisition method, system, equipment and computer medium |
CN116597453A (en) * | 2023-05-16 | 2023-08-15 | 暗物智能科技(广州)有限公司 | Shape near word single word recognition method |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0498978A1 (en) * | 1991-02-13 | 1992-08-19 | International Business Machines Corporation | Mechanical recognition of characters in cursive script |
CN1979529A (en) * | 2005-12-09 | 2007-06-13 | 佳能株式会社 | Optical character recognization |
CN101331520A (en) * | 2005-12-19 | 2008-12-24 | 微软公司 | Stroke contrast in font hinting |
CN101782896A (en) * | 2009-01-21 | 2010-07-21 | 汉王科技股份有限公司 | PDF character extraction method combined with OCR technology |
CN102707222A (en) * | 2012-05-15 | 2012-10-03 | 中国电子科技集团公司第五十四研究所 | Abnormal frequency point identification method based on character string comparison |
CN104462068A (en) * | 2013-09-12 | 2015-03-25 | 北大方正集团有限公司 | Character conversion system and method |
CN105335689A (en) * | 2014-08-06 | 2016-02-17 | 阿里巴巴集团控股有限公司 | Character recognition method and apparatus |
CN106611174A (en) * | 2016-12-29 | 2017-05-03 | 成都数联铭品科技有限公司 | OCR recognition method for unusual fonts |
-
2018
- 2018-10-17 CN CN201811211186.XA patent/CN109447055B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0498978A1 (en) * | 1991-02-13 | 1992-08-19 | International Business Machines Corporation | Mechanical recognition of characters in cursive script |
CN1979529A (en) * | 2005-12-09 | 2007-06-13 | 佳能株式会社 | Optical character recognization |
CN101331520A (en) * | 2005-12-19 | 2008-12-24 | 微软公司 | Stroke contrast in font hinting |
CN101782896A (en) * | 2009-01-21 | 2010-07-21 | 汉王科技股份有限公司 | PDF character extraction method combined with OCR technology |
CN102707222A (en) * | 2012-05-15 | 2012-10-03 | 中国电子科技集团公司第五十四研究所 | Abnormal frequency point identification method based on character string comparison |
CN104462068A (en) * | 2013-09-12 | 2015-03-25 | 北大方正集团有限公司 | Character conversion system and method |
CN105335689A (en) * | 2014-08-06 | 2016-02-17 | 阿里巴巴集团控股有限公司 | Character recognition method and apparatus |
CN106611174A (en) * | 2016-12-29 | 2017-05-03 | 成都数联铭品科技有限公司 | OCR recognition method for unusual fonts |
Non-Patent Citations (4)
Title |
---|
Language Indentification: How to Distinguish Similar Languages?;Nikola Ljubesic等;《2007 29th International Conference on Information Technology Interfaces》;20070808;541-546 * |
工业生产线标签字符识别系统的设计与实现;周凤香;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20140115(第1期);I138-1973 * |
血袋字符高速识别系统的研究;杨富元;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20141115(第11期);I138-372 * |
视频超分辨率重建技术在人脸识别中的应用;杨振罡;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20120515(第5期);I138-1410 * |
Also Published As
Publication number | Publication date |
---|---|
CN109447055A (en) | 2019-03-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109447055B (en) | OCR (optical character recognition) -based character similarity recognition method | |
Naz et al. | Segmentation techniques for recognition of Arabic-like scripts: A comprehensive survey | |
Hazra et al. | Optical character recognition using KNN on custom image dataset | |
US11790675B2 (en) | Recognition of handwritten text via neural networks | |
Dutta et al. | Towards accurate handwritten word recognition for Hindi and Bangla | |
Fasha et al. | A hybrid deep learning model for arabic text recognition | |
Shabbir et al. | Optical character recognition system for Urdu words in Nastaliq font | |
Cascianelli et al. | Learning to read L’Infinito: handwritten text recognition with synthetic training data | |
Mursari et al. | The effectiveness of image preprocessing on digital handwritten scripts recognition with the implementation of OCR Tesseract | |
Lakshmi et al. | An optical character recognition system for printed Telugu text | |
Naz et al. | An Ocr system for printed Nasta'liq script: A segmentation based approach | |
Mishra et al. | Oriya Character recognition using neural networks | |
Aravinda et al. | Template matching method for Kannada handwritten recognition based on correlation analysis | |
Al Ghamdi | A novel approach to printed Arabic optical character recognition | |
US9092688B2 (en) | Assisted OCR | |
Kadi | Isolated arabic characters recognition using a robust method against noise and scaling based on the «hough transform» | |
Chowdhury et al. | Bengali handwriting recognition and conversion to editable text | |
Oprean et al. | Handwritten word preprocessing for database adaptation | |
Barbuti et al. | An innovative character recognition for ancient book and archival materials: A segmentation and self-learning based approach | |
Choksi et al. | Hindi optical character recognition for printed documents using fuzzy k-nearest neighbor algorithm: a problem approach in character segmentation | |
Rao et al. | Orthographic properties based Telugu text recognition using hidden Markov models | |
Asthana et al. | Handwritten Multiscript Pin Code Recognition System having Multiple hidden layers using Back Propagation Neural Network | |
Faruque et al. | Bangla optical character recognition from printed text using Tesseract Engine | |
Frincu et al. | Comparing ML OCR Engines on Texts from 19 th Century Written in the Romanian Transitional Script | |
Kaur | Classification of printed and handwritten Gurmukhi text using labeling and segmentation technique |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 730000 Zhangsutan 553, Chengguan District, Lanzhou City, Gansu Province Applicant after: China Power World Wide Information Technology Co.,Ltd. Address before: 730000 Zhangsutan 553, Chengguan District, Lanzhou City, Gansu Province Applicant before: GANSU WANWEI CO. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |