CN102592121B - Method and system for judging leakage recognition based on OCR (Optical Character Recognition) - Google Patents

Method and system for judging leakage recognition based on OCR (Optical Character Recognition) Download PDF

Info

Publication number
CN102592121B
CN102592121B CN2011104463653A CN201110446365A CN102592121B CN 102592121 B CN102592121 B CN 102592121B CN 2011104463653 A CN2011104463653 A CN 2011104463653A CN 201110446365 A CN201110446365 A CN 201110446365A CN 102592121 B CN102592121 B CN 102592121B
Authority
CN
China
Prior art keywords
word
ocr
image
rectangle frame
unicom
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2011104463653A
Other languages
Chinese (zh)
Other versions
CN102592121A (en
Inventor
兰荣春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Founder International Co Ltd
Original Assignee
Founder International Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Founder International Co Ltd filed Critical Founder International Co Ltd
Priority to CN2011104463653A priority Critical patent/CN102592121B/en
Publication of CN102592121A publication Critical patent/CN102592121A/en
Application granted granted Critical
Publication of CN102592121B publication Critical patent/CN102592121B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a method and a system for judging leakage recognition based on OCR (Optical Character Recognition) and relates to the technical field of recognizing and processing of computer characters. The method comprises the following steps of: firstly, carrying out image growing on a character rectangular coordinate provided by the OCR from four sides of a rectangle to the outside and taking adjacent dark spots of a rectangular frame as pixel points of the character to obtain a list comprising all pixel points of one character; secondly, deleting all pixel points of the character until the processing of all characters is finished to obtain a leakage recognition region of the OCR; then carrying out scanning through a communicated region and filtering the image and noise grains; and finally, outputting the leakage recognition characters of the OCR. According to the method and the system disclosed by the invention, heavy work for finding leakage recognition intervals of books by processors is reduced; and the leakage recognition region of the OCR is obtained by automatically searching the recognized image and further the leakage recognition characters are output.

Description

A kind of OCR leaks and knows determination methods and system
Technical field
The present invention relates to computword identifying processing technical field, relate in particular to a kind of OCR based on the image-region growth algorithm and leak knowledge determination methods and system.
Background technology
Along with the development of computer technology and digitizing technique, need to be by electronizations such as traditional paper book, document, newspapers.By these physical entity data in the conversion process of electronic data, the books of papery, document, newspaper finally are converted to electronic data (TXT, WORD, the e-file of the forms such as PDF), inevitably use character recognition technology (OCR technology, Optical Character Recognition, optical character identification).
Characteristics based on Digitalizing Books processing, require the effective image content " not lose ", and present have a lot of OCR engines, as Han Wang, and ABBYY, Wen Tong etc.Although these OCR technology are comparative maturity, problem is known in the leakage that can not meet actual demand, particularly OCR in application process fully.
The reason that OCR leak to know is mainly:
1. be identified the reason of document itself, as the Ink Problems of printing, or it is of the remote past or damaged to be identified document, and it is unintelligible that the handwriting is blurred etc.;
2. the problem existed during scanned document, as the quality of scanning, the resolution of image etc., easily cause strokes of characters irregular, for follow-up correct identification causes obstacle;
3.OCR the word coordinate that technology itself provides can not surround whole word, also can cause Lou and know.
After OCR leaks and knows, will cause the loss of the effective content of the page, adopt now the method for searching the Lou Shi district by the word after identification and the artificial contrast of former figure, efficiency is low.
Therefore, in the OCR identifying, the correctness of OCR recognition image and efficiency are no doubt important, but how to find the leakage knowledge interval of OCR, are also very important.Yet there are no the technical solution and the pertinent literature that propose for the leakage knowledge problem in OCR identification.
Summary of the invention
For existing deficiency in prior art, the object of the present invention is to provide a kind of interval OCR leakage knowledge determination methods and system based on the image-region growth algorithm of leakage knowledge that can find fast in the OCR process.
For achieving the above object, the technical solution used in the present invention is as follows:
A kind of OCR leaks and knows determination methods, comprises the following steps:
(1) data input: the input original image, described original image is binary map;
(2) original image of input carried out to OCR identification, the output recognition result;
(3) leak and know judgement: according to the output recognition result in step (2), the judgement leakage is known interval;
(4) the UNICOM district of knowing in interval, filtering image and noise are leaked in search;
(5) output is leaked and is known word.
Further, the output recognition result described in step (2) comprises that the word and the rectangular coordinates thereof that identify are rectangle frame.
Further, in step (3), it is as follows that interval method is known in the judgement leakage: the word to output is processed one by one, judge that strokes of characters is whether all in rectangle frame, if not, according to the rectangular coordinates of this word of exporting, outwards to do image from rectangle four limits and increase, stain that will be adjacent with rectangle frame is also regarded the pixel of this word as, obtain all pixels of this word, delete all pixels of this word from image; If so, continue to process next word, until complete the processing of all words, finally just obtained the leakage in the image and known interval.
Further, when the rectangle frame to word do to increase, concrete grammar is: from the rectangle frame edge, meet with word, be connected effective pixel points just to extending out limit, until, without effective pixel points, this has just obtained the new border of this word.
Further again, when the word rectangle frame is outwards increased, according to pixels to put pointwise and outwards increase, the proportional control of growth is in 50%.
Further, in step (4), during search UNICOM district, the black pixel point that all leakages that obtain in step (3) are known in interval is scanned on every side, obtains all UNICOMs district;
Then according to the size of UNICOM's district's area and word rectangle frame area, regard UNICOM's district's area as image much larger than the zone of word rectangle frame area, regard UNICOM's district's area as noise much smaller than the zone of word rectangle frame area, then these images and noise are filtered.
Further, regard image as by being greater than word rectangle frame area 4-8 times of above UNICOM district.Regard the UNICOM district be less than below word rectangle frame area 1/8-1/16 as noise.
Further, in step (5), will regard leakage knowledge word as close to the UNICOM zone of word rectangle frame area and be exported.
A kind of OCR leaks and knows the judgement system, comprises with lower device:
Data input device, for inputting original image, described original image is binary map;
The OCR recognition device, carry out OCR identification for the original image to input, the output recognition result, and described recognition result comprises that the word and the rectangular coordinates thereof that identify are rectangle frame;
Leak and know judgment means, for the word to output, process one by one, judge that strokes of characters is whether all in rectangle frame, if not, according to the rectangular coordinates of this word of exporting, outwards doing image from rectangle four limits increases, stain that will be adjacent with rectangle frame is also regarded the pixel of this word as, obtains all pixels of this word, deletes all pixels of this word from image; If so, continue to process next word, until complete the processing of all words, finally just obtained the leakage in the image and known interval;
UNICOM's area searching device, Lou know the UNICOM district in interval for searching for;
Image and noise filtration unit, Lou know image and the noise in interval for filtering;
Output unit, Lou know word for exporting.
Effect of the present invention is: adopt method and system of the present invention, image after identifying by automatic search, can find fast Lou know interval, for artificial amended record or again identification prepare, substituted in books digitizing process by manually searching the hard work while knowing interval with Louing.
The accompanying drawing explanation
The word rectangular coordinates schematic diagram that Fig. 1 provides for existing OCR technology;
The structural drawing that Fig. 2 is system of the present invention;
The process flow diagram that Fig. 3 is the method for the invention;
Schematic diagram before Fig. 4 a, Fig. 4 b are respectively the word rectangle frame of character " A " is done to growth, after growth;
Fig. 5 is word and the word rectangle frame schematic diagram after OCR identification;
Fig. 6 is for increasing after processing and delete the textbox pixel and obtain the De Loushi district the character rectangle frame.
Embodiment
Core of the present invention is the leakage knowledge judgment technology in OCR identification.The word rectangular coordinates of using OCR to provide is outwards done image and is increased from rectangle four limits, stain that will be adjacent with rectangle frame is also regarded the pixel of word as, obtains the list of all pixels of a word.Then delete all pixels of this word, until complete the processing of all words, obtain OCR De Loushi district, then pass through UNICOM's domain scanning, filtering image and noise, OCR leaks the word output of knowing the most at last.
Below in conjunction with the drawings and specific embodiments, describe the present invention.
As shown in Figure 1, the word rectangular coordinates schematic diagram that Fig. 1 provides for existing OCR technology, process if press the OCR rectangular coordinates fully, to have more outer two points of frame, stay two points after causing program to remove the rectangle frame content, by next time identification or manual entry, treated as radix point or punctuate, mark of emphasis etc.
As shown in Figure 2, a kind of OCR leaks and knows the judgement system, comprises with lower device:
(1) data input device 11, and for inputting original image, described original image is binary map;
(2) the OCR recognition device 12, for the original image to input, carry out OCR identification, the output recognition result, and described recognition result comprises word (comprising symbol) and the rectangular coordinates (rectangle frame) thereof identified;
(3) leak and know judgment means 13, for the word to output, process one by one, judge that strokes of characters is whether all in rectangle frame, if not, according to the rectangular coordinates of this word of exporting, outwards doing image from rectangle four limits increases, stain that will be adjacent with rectangle frame is also regarded the pixel of this word as, obtains all pixels of this word, deletes all pixels of this word from image; If so, continue to process next word, until complete the processing of all words, finally just obtained the leakage in the image and known interval;
(4) UNICOM's area searching device 14, Lou know the UNICOM district in interval for searching for;
(5) image and noise filtration unit 15, Lou know image and the noise in interval for filtering;
(6) output unit 16, for exporting, Lou know word.
In Fig. 3, as shown in step S21-S25, a kind of OCR leaks and knows determination methods, comprises the following steps:
(1) data input: the input original image, described original image is binary map;
Described original image can be the binary map of obtaining through scanning device scanning, can be also the binary map of the image that absorbs by camera head.
(2) original image of input carried out to OCR identification, the output recognition result;
Carry out OCR when identification at the original image to input or word, can adopt existing OCR recognition technology.The recognition result of output comprises word and the rectangular coordinates (rectangle frame) thereof identified.
(3) leak and know judgement: for the recognition result in step (2), word to output is processed one by one, judge that strokes of characters is whether all in rectangle frame, if not, according to the rectangular coordinates of this word of exporting, outwards doing image from rectangle four limits increases, stain that will be adjacent with rectangle frame is also regarded the pixel of this word as, obtains all pixels of this word, deletes all pixels of this word from image; If so, continue to process next word, until complete the processing of all words, finally just obtained the leakage in the image and known interval.
When the rectangle frame to word do to increase, concrete grammar is: from the rectangle frame edge, meet with word, be connected effective pixel points just to extending out limit, until, without effective pixel points, this has just obtained the new border of this word.
When the word rectangle frame is outwards increased, according to pixels to put pointwise and outwards increase, the ratio of growth is controlled in 50% usually.
As shown in Figure 4, Fig. 4 a is for making the schematic diagram before increasing to the word rectangle frame of character " A ", and the stroke of word rectangle frame and character intersects, and the word outer rectangular frame still has the partial pixel point to belong to the part of character " A ".Fig. 4 b is for the schematic diagram that the growth expansion obtains is made in left and right up and down in the word rectangle frame border of character " A ", now in the whole word rectangle frame after growth of character " A ".
After text pixel point in the word rectangle frame being increased to processing and deleted image in all word rectangle frames, just obtained the leakage in the image and known interval.
As shown in Figure 5, Fig. 5 is word and the word rectangle frame after OCR identification, and 51,52 is to leak the character of knowing in the OCR process.
After the rectangle frame of all characters in Fig. 5 increases and processes and delete the textbox pixel, the leakage obtained is known interval, as shown in Figure 6, wherein, the black of light color " entirely " top is also to leak to know pixel, but after the textbox of " entirely " is done to region growing, can judge the part that this stain belongs to " entirely " word, be not to leak to know interval.
(4) the UNICOM district of knowing in interval, filtering image and noise are leaked in search;
During search UNICOM district, the black pixel point that all leakages that obtain in step (3) are known in interval is scanned on every side, obtains all UNICOMs district;
Size according to UNICOM's district's area and word rectangle frame area, regard UNICOM's district's area as image much larger than the zone of word rectangle frame area, regard image as such as usually being set in the UNICOM district be greater than more than word rectangle frame area 4-8 times, concrete numerical value will be determined according to actual conditions.Regard UNICOM's district's area as noise much smaller than the zone of word rectangle frame area, such as the UNICOM district that will be less than below word rectangle frame area 1/8-1/16 regards noise as, concrete numerical value also will be determined according to actual conditions.Then these images and noise are filtered.
(5) output is leaked and is known word;
UNICOM district in step (4) is regarded as to leak close to the UNICOM zone of word rectangle frame and knows word and exported, for follow-up artificial amended record or again identification prepare.
According to above-described embodiment, can find out, adopt method of the present invention, image after identifying by automatic search, can find fast the leakage after OCR identifies to know interval, for follow-up artificial amended record or again identification prepare, substituted in books digitizing process by manually searching the hard work while knowing interval with Louing.
It should be noted that; above-mentioned specific embodiment is only exemplary; under above-mentioned instruction of the present invention, those skilled in the art can carry out various improvement and distortion on the basis of above-described embodiment, and these improvement or distortion drop in protection scope of the present invention.It will be understood by those skilled in the art that top specific descriptions are just in order to explain purpose of the present invention, not for limiting the present invention.Protection scope of the present invention is limited by claim and equivalent thereof.

Claims (8)

1. an OCR leaks and knows determination methods, comprises the following steps:
(1) data input: the input original image, described original image is binary map;
(2) original image of input carried out to OCR identification, the output recognition result; Described output recognition result comprises that the word and the rectangular coordinates thereof that identify are rectangle frame;
(3) leak and know judgement: according to the output recognition result in step (2), the judgement leakage is known interval; It is as follows that interval method is known in the judgement leakage: the word to output is processed one by one, judge that strokes of characters is whether all in rectangle frame, if not, according to the rectangular coordinates of this word of exporting, outwards doing image from rectangle four limits increases, stain that will be adjacent with rectangle frame is also regarded the pixel of this word as, obtain all pixels of this word, delete all pixels of this word from image; If so, continue to process next word, until complete the processing of all words, finally just obtained the leakage in the image and known interval;
(4) the UNICOM district of knowing in interval, filtering image and noise are leaked in search;
(5) output is leaked and is known word.
2. a kind of OCR as claimed in claim 1 leaks and knows determination methods, it is characterized in that, when the rectangle frame to word is done to increase, concrete grammar is: from the rectangle frame edge, meet with word, be connected effective pixel points just to extending out limit, until, without effective pixel points, this has just obtained the new border of this word.
3. a kind of OCR as claimed in claim 2 leaks and knows determination methods, and it is characterized in that: when the word rectangle frame is outwards increased, according to pixels put pointwise and outwards increase, the proportional control of growth is in 50%.
4. described a kind of OCR as arbitrary as claim 1 to 3 leaks and knows determination methods, it is characterized in that: in step (4), when the UNICOM district of knowing in interval is leaked in search, the black pixel point that all leakages that obtain in step (3) are known in interval is scanned on every side, obtains all UNICOMs district;
Then according to the size of UNICOM's district's area and word rectangle frame area, regard UNICOM's district's area as image much larger than the zone of word rectangle frame area, regard UNICOM's district's area as noise much smaller than the zone of word rectangle frame area, then these images and noise are filtered.
5. a kind of OCR as claimed in claim 4 leaks and knows determination methods, it is characterized in that: will be greater than word rectangle frame area 4-8 times of above UNICOM district and regard image as.
6. a kind of OCR as claimed in claim 4 leaks and knows determination methods, it is characterized in that: will be less than the following UNICOM district of word rectangle frame area 1/8-1/16 and regard noise as.
7. a kind of OCR as claimed in claim 4 leaks and knows determination methods, it is characterized in that: in step (5), will regard leakage knowledge word as close to the UNICOM zone of word rectangle frame area and be exported.
8. an OCR leaks and knows the judgement system, comprises with lower device:
Data input device, for inputting original image, described original image is binary map;
The OCR recognition device, carry out OCR identification for the original image to input, the output recognition result, and described recognition result comprises that the word and the rectangular coordinates thereof that identify are rectangle frame;
Leak and know judgment means, for the word to output, process one by one, judge that strokes of characters is whether all in rectangle frame, if not, according to the rectangular coordinates of this word of exporting, outwards doing image from rectangle four limits increases, stain that will be adjacent with rectangle frame is also regarded the pixel of this word as, obtains all pixels of this word, deletes all pixels of this word from image; If so, continue to process next word, until complete the processing of all words, finally just obtained the leakage in the image and known interval;
UNICOM's area searching device, Lou know the UNICOM district in interval for searching for;
Image and noise filtration unit, Lou know image and the noise in interval for filtering;
Output unit, Lou know word for exporting.
CN2011104463653A 2011-12-28 2011-12-28 Method and system for judging leakage recognition based on OCR (Optical Character Recognition) Expired - Fee Related CN102592121B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011104463653A CN102592121B (en) 2011-12-28 2011-12-28 Method and system for judging leakage recognition based on OCR (Optical Character Recognition)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011104463653A CN102592121B (en) 2011-12-28 2011-12-28 Method and system for judging leakage recognition based on OCR (Optical Character Recognition)

Publications (2)

Publication Number Publication Date
CN102592121A CN102592121A (en) 2012-07-18
CN102592121B true CN102592121B (en) 2013-12-04

Family

ID=46480735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011104463653A Expired - Fee Related CN102592121B (en) 2011-12-28 2011-12-28 Method and system for judging leakage recognition based on OCR (Optical Character Recognition)

Country Status (1)

Country Link
CN (1) CN102592121B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929843B (en) * 2012-09-14 2015-10-14 《中国学术期刊(光盘版)》电子杂志社有限公司 A kind of method that word is adapted system and adapted
CN104537026B (en) * 2014-12-22 2018-08-24 福建亿榕信息技术有限公司 Archives of paper quality document handling method based on local cache
CN104765815B (en) * 2015-04-03 2016-11-09 北京奇虎科技有限公司 A kind of method and apparatus identifying search keyword
CN106372632B (en) * 2016-08-23 2019-04-16 山西同方知网数字出版技术有限公司 A method of the leakage based on OCR is known text and is detected automatically
CN108875737B (en) * 2018-06-11 2022-06-21 四川骏逸富顿科技有限公司 Method and system for detecting whether check box is checked in paper prescription document

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
IL115971A (en) * 1995-11-14 1997-01-10 Razon Moshe Computer stereo vision system and method
DE19820353C2 (en) * 1998-05-07 2001-06-13 Ibm Method and device for recognizing a pattern on a template
CN101398894B (en) * 2008-06-17 2011-12-07 浙江师范大学 Automobile license plate automatic recognition method and implementing device thereof

Also Published As

Publication number Publication date
CN102592121A (en) 2012-07-18

Similar Documents

Publication Publication Date Title
CN102592121B (en) Method and system for judging leakage recognition based on OCR (Optical Character Recognition)
CN106960208B (en) Method and system for automatically segmenting and identifying instrument liquid crystal number
TWI536277B (en) Form identification method and device
US8693790B2 (en) Form template definition method and form template definition apparatus
CN103034856B (en) The method of character area and device in positioning image
CN109977723A (en) Big bill picture character recognition methods
CN105654072A (en) Automatic character extraction and recognition system and method for low-resolution medical bill image
CN108146093B (en) Method for removing bill seal
CN112183038A (en) Form identification and typing method, computer equipment and computer readable storage medium
CN109766749A (en) A kind of detection method of the bending table line for financial statement
CN110309806B (en) Gesture recognition system and method based on video image processing
CN104978576A (en) Character identification method and device thereof
CN109409378A (en) A kind of digitalized processing method of Nahsi Dongba Confucian classics
CN110516673A (en) Ancient Books in Yi Language character detection method based on connected component and regression equation character segmentation
CN109886257A (en) Using the method for deep learning correction invoice picture segmentation result in a kind of OCR system
CN115273115A (en) Document element labeling method and device, electronic equipment and storage medium
CN111881769A (en) Method and system for table labeling
CN111445402B (en) Image denoising method and device
CN113793264B (en) Archive image processing method and system based on convolution model and electronic equipment
CN108985287A (en) Notebook paper and classification icon-based programming method
CN106022246A (en) Difference-based patterned-background print character extraction system and method
WO2001093188A1 (en) Method for processing document, recorded medium on which document processing program is recorded and document processor
CN111814780A (en) Bill image processing method, device and equipment and storage medium
CN112560820B (en) Table detection method and device
CN116129456B (en) Method and system for identifying and inputting property rights and interests information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20131204

Termination date: 20141228

EXPY Termination of patent right or utility model