CN102411707A - Method and device for identifying text in picture - Google Patents

Method and device for identifying text in picture Download PDF

Info

Publication number
CN102411707A
CN102411707A CN2011103394942A CN201110339494A CN102411707A CN 102411707 A CN102411707 A CN 102411707A CN 2011103394942 A CN2011103394942 A CN 2011103394942A CN 201110339494 A CN201110339494 A CN 201110339494A CN 102411707 A CN102411707 A CN 102411707A
Authority
CN
China
Prior art keywords
picture
module
pixel
exist
chinese version
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011103394942A
Other languages
Chinese (zh)
Inventor
张国威
陈晓鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CENTURY DRAGON INFORMATION NETWORK Co Ltd
Original Assignee
CENTURY DRAGON INFORMATION NETWORK Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CENTURY DRAGON INFORMATION NETWORK Co Ltd filed Critical CENTURY DRAGON INFORMATION NETWORK Co Ltd
Priority to CN2011103394942A priority Critical patent/CN102411707A/en
Publication of CN102411707A publication Critical patent/CN102411707A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Character Input (AREA)

Abstract

The invention provides a method and a device for identifying text in a picture, which comprises the steps that: first, the interference of the picture to be processed is removed; horizontal or vertical pixels of the picture with the interference being removed are calculated; whether a periodic law exists is judged; and if the periodic law exists, the picture is provided with the text. The invention can greatly reduce the burden of a processor so that the text detection of the picture is applied to an email filter system, can be applied to a plurality of occasions on which whether a large amount of text exists in the picture is identified, such as email scanning and quick scanning of short messages and multimedia messages, so that the scanning can detect whether the picture is provided with the text.

Description

A kind of recognition methods of picture Chinese version and recognition device
Technical field
The present invention relates to image processing techniques, relate in particular to the recognition technology of picture Chinese version.
Background technology
Spam spreads unchecked in recent years on the internet, and the spam of past based on plain text uses the mode that literal filters or key word is marked just to can be good at filtering.But nearest spammer understands the mode that arrives literal easily by after the anti-garbage mail system interception, then adopts the mode of paste text in picture, and order is at a complete loss as to what to do based on the anti-garbage mail system of textual scan filtration.Therefore, must survey, check whether have a large amount of Word messages in picture the picture in the mail.
Traditional method that picture is carried out literal identification is OCR (Optical Character Recognition, an optical character identification), and the OCR mode has been accomplished very high literal discrimination now through development for a long time.But traditional OCR mode lot of consumption processor resource needs the long time just can carry out literal identification, and the sort processor expense is that the high concurrent anti-rubbish mail server institute that is connected is unaffordable with time overhead.
Summary of the invention
The present invention provides a kind of recognition methods and recognition device of picture Chinese version, can reduce the burden of processor preferably.
The recognition methods of picture Chinese version provided by the invention comprises step:
Pending picture is gone to disturb processing;
With the said picture after going to disturb in the horizontal direction or the pixel of vertical direction add up;
Judge whether to exist periodic rule,, then have literal in the picture if exist.
Use existing OCR technology can find whether have literal in the picture, and can literal be identified.But OCR technology recognition speed is very low, and in this application of Spam filtering, need not discern the particular content of text, only need know that whether to have a large amount of literal in the picture just passable.The relative OCR of the present invention is technological, can greatly reduce the burden of processor, makes picture text detection application to mail filtering system.The present invention can be applied to whether exist in the multiple identification picture occasion of a large amount of literal, for example is applied in the quick scanning of mail scanning, note, multimedia message, above-mentioned scanning can be detected whether have text in the picture.
Description of drawings
Fig. 1 is the process flow diagram of embodiment 1;
Fig. 2 is the picture example after a gray scale is handled;
Fig. 3 is the picture after the binary conversion treatment;
Fig. 4 is the distribution plan after the statistical study;
Fig. 5 is the difference profile figure after the difference processing;
Fig. 6 is another gray scale picture diagrammatic sketch after handling;
Fig. 7 is the picture after the binary conversion treatment;
Fig. 8 is the distribution plan after the statistical study;
Fig. 9 is the difference profile figure after the difference processing;
Figure 10 is the logic diagram of the recognition device of picture Chinese version.
Embodiment
Target of the present invention is to consume the little resource of trying one's best to judge whether have a large amount of Word messages in the picture fast.Set forth treatment scheme of the present invention with the lower part through example.
Embodiment 1: Fig. 1 is the process flow diagram of present embodiment, at first pending picture is gone to disturb processing (step 1);
To the picture after going to disturb in the horizontal direction or the pixel of vertical direction add up (step 2);
Judge that whether the distribution curve after the statistics exists periodic rule, if exist, then has literal (step 3) in the picture.
Embodiment 2: as the optimization to embodiment 1, provide in the present embodiment for removing to change the method that color causes interference in the picture, before execution in step 1, earlier picture has been carried out gray scale and handle, picture is become the gray scale picture.As an embodiment, can average after the rgb value addition to whole each pixel of pictures earlier, become the gray scale picture.
Embodiment 3: the method for going to disturb processing in the step 1 among embodiment 1 and the embodiment 2, can adopt the method that has existed in the multiple prior art, and for example judgement, maximum color block areas judgement etc. are selected in average binary conversion treatment, largest connected zone.
As a preferred embodiment; The invention provides the method for going to disturb processing through the average binaryzation: the value of each pixel is averaged; Then each pixel and mean value are made comparisons, if be higher than mean value be set at 1, sub-averagely be set at 0.
Embodiment 4: though can find out that there is periodic regularity in distribution curve in the step 3 among above-mentioned each embodiment; But for can be more recognition cycle rule easily; As further optimization to above-mentioned each embodiment; Present embodiment further is optimized processing to distribution curve after step 3, make the regularity of distribution more obvious, the method for optimization process for example difference processing, peak value binaryzation, to the former numeral number formulary etc. of making even.
Embodiment 5: present embodiment is a most preferred embodiment of the present invention, and we obtain a colour picture from spam, and picture has Word message, and causes interference through the conversion color.Owing to be colour picture, handle the removal interference component so want earlier picture to be carried out gray scale.The method of handling is to average after the rgb value addition to whole each pixel of pictures earlier, becomes the gray scale picture.Fig. 2 is the picture after gray scale is handled.
Picture after gray scale is handled is a missing color, and the gray scale of literal is different, can not whether have the identification of literal like this, therefore need carry out the average binary conversion treatment.The method of handling is, the value of each pixel is averaged, and then each pixel and mean value made comparisons; If what be higher than mean value is set at 1; Sub-averagely be set at 0, so just removed the interference of background horizontal line, also make literal more outstanding simultaneously through the average binaryzation.Fig. 3 is a picture after treatment.
Then, to through the picture after the average binaryzation pixel of every row being done the addition (only being example with the horizontal direction in the present embodiment) on the horizontal direction, its meaning is the quantity of the point of every behavior 1.Numerical value after every capable addition is generated the horizontal pixel number of spots distribution plan like Fig. 4.
Can find out that from the horizontal distribution of Fig. 4 there is fixing periodic regularity in distribution curve.As long as picture is through after the above-mentioned processing, if the distribution curve that exists this fixing periodic law to fluctuate, the possibility that then has text in the picture is high.Though it has been seen in that among the figure four that there is periodic regularity in distribution curve,, need do difference processing to the data of the every point-to-point transmission of Fig. 4 in order to let computer program recognition cycle rule more easily.Generate the difference profile figure of Fig. 5 after the difference.
Fig. 5 difference profile figure can find obvious periodic property spike, and whether these spikes can go identification to exist through program easily.Through above-mentioned processing, strive for survival at obvious periodic property spike if find difference profile, then picture very likely has a large amount of Word messages.
As a comparison, lift a picture example word explanation that does not have text again.Treatment scheme is consistent with above-mentioned steps, repeats no more.Fig. 9 is the distribution plan after the last difference processing.Can see that distribution do not have any rule near white noise.
Corresponding with said method, the present invention also provides the recognition device of picture Chinese version, and is shown in figure 10, comprising: remove interference module, statistical module and discrimination module.Go interference module to realize going among above-mentioned each embodiment the function of disturbing; Statistical module is realized among above-mentioned each embodiment function that the picture after going to disturb is added up; Discrimination module realizes differentiating whether there is periodic regularity among above-mentioned each embodiment, to judge whether to exist the function of a large amount of literal.
In addition,, can also comprise the gray scale module, realize the picture before going to disturb is carried out the function that gray scale is handled, and optimal module, realize optimization process function, for example difference processing etc. the distribution curve after the statistics as preferred embodiment.
The foregoing description is a preferred implementation of the present invention; But embodiment of the present invention is not restricted to the described embodiments; Other are any not to deviate from modification, the modification done under spirit of the present invention and the principle, substitute, combination, simplify; All should be the substitute mode of equivalence, all should be included within protection scope of the present invention.

Claims (12)

1. the recognition methods of a picture Chinese version is characterized in that, comprises step:
Pending picture is gone to disturb processing;
With the said picture after going to disturb in the horizontal direction or the pixel of vertical direction add up;
Judge whether to exist periodic rule,, then have literal in the picture if exist.
2. the recognition methods of picture Chinese version according to claim 1 is characterized in that, goes to disturb through following steps and handles:
Said picture is carried out the average binary conversion treatment.
3. the recognition methods of picture Chinese version according to claim 2 is characterized in that, the process of said binary conversion treatment is following:
The value of each pixel on the said picture is averaged, then each pixel and mean value is made comparisons, if be higher than this mean value be set at 1, sub-averagely be set at 0.
4. the recognition methods of picture Chinese version according to claim 1 is characterized in that, picture is carried out conventional average binary conversion treatment also comprise step before:
Said picture is carried out gray scale to be handled.
5. the recognition methods of picture Chinese version according to claim 1 and 2 is characterized in that, judges whether to exist periodic rule according to the distribution plan after the statistics.
6. the recognition methods of picture Chinese version according to claim 1 and 2 is characterized in that, the step that said pixel is added up and judging whether exists and also comprises step between the step of periodic rule:
Distribution plan to after the said statistics is done difference processing;
Survey difference profile figure and whether have periodic spike existence.
7. the recognition methods of picture Chinese version according to claim 1; It is characterized in that; Judge whether to exist the step of periodic rule also to comprise afterwards:, then to judge that according to the regularity of distribution in said cycle the picture Chinese words accounts for the general ratio of picture area if exist.
8. the recognition device of a picture Chinese version is characterized in that, comprising:
Go interference module, be used for picture is gone to disturb processing,
Statistical module, be used for to the picture after going to disturb in the horizontal direction or the pixel of vertical direction add up;
Discrimination module is used to judge whether to exist periodic rule, and confirms whether have literal in the picture according to judged result.
9. the recognition device of picture Chinese version according to claim 8; It is characterized in that; The said interference module of going goes to disturb through said picture being carried out the average binary conversion treatment, and said average binary conversion treatment is used for the value of each pixel on the said picture is averaged, and then each pixel and mean value is made comparisons; If what be higher than this mean value is set at 1, sub-averagely be set at 0.
10. the recognition device of picture Chinese version according to claim 8 is characterized in that, also comprises the gray scale processing module, be used for that the picture after going interference module to handle is carried out gray scale and handle, and the picture after will handling sends to said statistical module.
11. according to Claim 8 or the recognition device of 9 described picture Chinese versions, it is characterized in that said discrimination module is used for judging whether to exist periodic rule according to the distribution plan after the statistical module counts.
12. according to Claim 8 or the recognition device of 9 described picture Chinese versions, it is characterized in that, also comprise optimal module, be used for the distribution plan after the said statistics is done difference processing, and difference profile figure is sent to discrimination module;
Said discrimination module is surveyed difference profile figure and whether is had periodic spike existence, to judge whether to exist periodic rule.
CN2011103394942A 2011-10-31 2011-10-31 Method and device for identifying text in picture Pending CN102411707A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011103394942A CN102411707A (en) 2011-10-31 2011-10-31 Method and device for identifying text in picture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011103394942A CN102411707A (en) 2011-10-31 2011-10-31 Method and device for identifying text in picture

Publications (1)

Publication Number Publication Date
CN102411707A true CN102411707A (en) 2012-04-11

Family

ID=45913774

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011103394942A Pending CN102411707A (en) 2011-10-31 2011-10-31 Method and device for identifying text in picture

Country Status (1)

Country Link
CN (1) CN102411707A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104766301A (en) * 2015-01-13 2015-07-08 台州职业技术学院 Monochrome collecting algorithm based on image
CN105247509A (en) * 2013-03-11 2016-01-13 微软技术许可有限责任公司 Detection and reconstruction of east asian layout features in a fixed format document
CN106022246A (en) * 2016-05-16 2016-10-12 浙江大学 Difference-based patterned-background print character extraction system and method
US9928225B2 (en) 2012-01-23 2018-03-27 Microsoft Technology Licensing, Llc Formula detection engine

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1685358A (en) * 2002-07-31 2005-10-19 里昂中央理工学院 Method and system for automatically locating text areas in an image
CN1790377A (en) * 2004-12-17 2006-06-21 佳能株式会社 Reverse character recognition method, quick and accurate block sorting method and text line generation method
CN101398894A (en) * 2008-06-17 2009-04-01 浙江师范大学 Automobile license plate automatic recognition method and implementing device thereof
CN101436299A (en) * 2008-11-19 2009-05-20 哈尔滨工业大学 Method for detecting natural scene image words
CN101615252A (en) * 2008-06-25 2009-12-30 中国科学院自动化研究所 A kind of method for extracting text information from adaptive images
CN102081731A (en) * 2009-11-26 2011-06-01 中国移动通信集团广东有限公司 Method and device for extracting text from image
CN102184390A (en) * 2011-05-17 2011-09-14 姜雨枫 Container number-orientated character image identification method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1685358A (en) * 2002-07-31 2005-10-19 里昂中央理工学院 Method and system for automatically locating text areas in an image
CN1790377A (en) * 2004-12-17 2006-06-21 佳能株式会社 Reverse character recognition method, quick and accurate block sorting method and text line generation method
CN101398894A (en) * 2008-06-17 2009-04-01 浙江师范大学 Automobile license plate automatic recognition method and implementing device thereof
CN101615252A (en) * 2008-06-25 2009-12-30 中国科学院自动化研究所 A kind of method for extracting text information from adaptive images
CN101436299A (en) * 2008-11-19 2009-05-20 哈尔滨工业大学 Method for detecting natural scene image words
CN102081731A (en) * 2009-11-26 2011-06-01 中国移动通信集团广东有限公司 Method and device for extracting text from image
CN102184390A (en) * 2011-05-17 2011-09-14 姜雨枫 Container number-orientated character image identification method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郑瑞东: "利用网页特征识别不良图像网页", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9928225B2 (en) 2012-01-23 2018-03-27 Microsoft Technology Licensing, Llc Formula detection engine
CN105247509A (en) * 2013-03-11 2016-01-13 微软技术许可有限责任公司 Detection and reconstruction of east asian layout features in a fixed format document
US10127221B2 (en) 2013-03-11 2018-11-13 Microsoft Technology Licensing, Llc Detection and reconstruction of East Asian layout features in a fixed format document
CN105247509B (en) * 2013-03-11 2018-11-23 微软技术许可有限责任公司 It detects and reconstructs the East Asia spatial layout feature in fixed-format document
CN104766301A (en) * 2015-01-13 2015-07-08 台州职业技术学院 Monochrome collecting algorithm based on image
CN104766301B (en) * 2015-01-13 2017-07-28 金斯科 A kind of monochromatic gathering algorithm based on image
CN106022246A (en) * 2016-05-16 2016-10-12 浙江大学 Difference-based patterned-background print character extraction system and method
CN106022246B (en) * 2016-05-16 2019-05-21 浙江大学 A kind of decorative pattern background printed matter Word Input system and method based on difference

Similar Documents

Publication Publication Date Title
CN101122953B (en) Picture words segmentation method
US8391599B1 (en) Systems and methods for adaptive binarization of an image
US20200026969A1 (en) Image detection methods and apparatus
US8965123B2 (en) System and method for processing image for identifying alphanumeric characters present in a series
US7706614B2 (en) System and method for identifying text-based SPAM in rasterized images
CN103955660B (en) Method for recognizing batch two-dimension code images
CN103366170A (en) Image binarization processing device and method thereof
JP5455038B2 (en) Image processing apparatus, image processing method, and program
CN1916934A (en) Gray scale image cleaning system and method
CN1275190C (en) Method and device for correcting image askew
JP2016224914A (en) Document image binarization method
CN102882838A (en) Authentication method and system applying verification code mechanism
US7711192B1 (en) System and method for identifying text-based SPAM in images using grey-scale transformation
CN102411707A (en) Method and device for identifying text in picture
CN109101810B (en) Character verification code recognition method based on OCR technology
EP3265960B1 (en) Methods for categorizing input images for use e.g. as a gateway to authentication systems
CN103020634A (en) Segmentation method and device for recognizing identifying codes
CN1198238C (en) Image processor and method for producing binary image by multi-stage image
CN105551044B (en) A kind of picture control methods and device
CN100377169C (en) Method for picture binaryzation
US8731284B2 (en) Method and system for detecting image spam
CN102982331A (en) Method for identifying character in image
CN1489745A (en) Method for identifying image and device for realizing same
CN110147765A (en) A kind of image processing method and device
CN103034855A (en) Method for identifying character zone in picture

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20120411