CN104715248B - A kind of recognition methods to email advertisement picture - Google Patents

A kind of recognition methods to email advertisement picture Download PDF

Info

Publication number
CN104715248B
CN104715248B CN201510121822.XA CN201510121822A CN104715248B CN 104715248 B CN104715248 B CN 104715248B CN 201510121822 A CN201510121822 A CN 201510121822A CN 104715248 B CN104715248 B CN 104715248B
Authority
CN
China
Prior art keywords
picture
coordinate system
text block
axis
virtual coordinate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510121822.XA
Other languages
Chinese (zh)
Other versions
CN104715248A (en
Inventor
许广彬
徐慧灵
纪春来
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huayun Industrial Internet Co ltd
Original Assignee
Wuxi Huayun Data Technology Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Huayun Data Technology Service Co Ltd filed Critical Wuxi Huayun Data Technology Service Co Ltd
Priority to CN201510121822.XA priority Critical patent/CN104715248B/en
Publication of CN104715248A publication Critical patent/CN104715248A/en
Application granted granted Critical
Publication of CN104715248B publication Critical patent/CN104715248B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention provides a kind of recognition methods to email advertisement picture, including:Text block orientation is determined after being pre-processed after picture in S1, extraction mail;S2, virtual coordinate system is established according to text block orientation;S3, binaryzation data of each text block in virtual coordinate system in picture are calculated separately;S4, statistics picture in text block size and quantity;S5, judge whether picture is advertising pictures according to given threshold.By obtaining projection of the text in picture block in virtual coordinate system and calculating binaryzation data, effectively by the size and quantity of the text block in statistics picture and it can judge whether picture is advertising pictures according to given threshold, considerably improve the extraction effect to the word in the advertising pictures in spam, strong antijamming capability, and reduce the load of server.

Description

A kind of recognition methods to email advertisement picture
Technical field
The present invention relates to spam treatment technology and technical field of network security more particularly to a kind of to email advertisement figure The recognition methods of piece.
Background technology
In the spam in the annual whole world, picture category spam quantity occupies 50% or more of spam total amount. So for picture spam mails identification technology there is an urgent need for upgrading and updating, so as to more effectively to picture category spam into Row identification, improves spam filtering rate.
In the prior art, usually using optical character identification (OCR) to realize to the content of text comprising advertising pictures It extracts, ad content is judged whether by content, to realize the identification of spam.So-called optical character identification is usual Using generally by the computer software referred to as OCR engine come it is that script is printed in paper, microfilm or other media, Typewriting, hand-written or other text writings digital pictures are handled, and from described image generate machine it is recognizable and Editable text.The digital picture of the document handled by OCR engine may include the image of multipage writing material.It will be by OCR The image for the text that engine is handled can be obtained by various imaging methods, including use image scanner to capture text Digital picture.However this technical solution is there are computationally intensive, in advertising pictures Word Input effect is undesirable, False Rate It is higher, and to being added interference character or vertical setting of types shows content etc. by spam putting person treated spam The technological deficiencies such as recognition effect is bad.
In view of this, it is necessary to the recognition methods in the prior art to email advertisement picture be improved, to solve Above-mentioned technology flaw.
Invention content
It is an object of the invention to disclose a kind of recognition methods to email advertisement picture, improve to the picture comprising word The effect for carrying out Word Input effectively identifies the spam comprising advertising pictures to realize, while reducing clothes The load of business device, while improving anti-interference ability of the server in filtering spam mail.
For achieving the above object, the present invention provides a kind of recognition methods to email advertisement picture, including it is following Step:
Text block orientation is determined after being pre-processed after picture in S1, extraction mail;
S2, virtual coordinate system is established according to text block orientation;
S3, binaryzation data of each text block in virtual coordinate system in picture are calculated separately;
S4, statistics picture in text block size and quantity;
S5, judge whether picture is advertising pictures according to given threshold.
As a further improvement on the present invention, the pretreatment in step S1 includes frame processing, inverse processing, removal background Processing, binary conversion treatment, noise reduction process.
As a further improvement on the present invention, step S2 is specially:The projection fastened in virtual coordinates according to image content As a result continuity establishes matched virtual coordinate system for picture.
As a further improvement on the present invention, step S3 is specially:By each text block in picture relative to virtual coordinates axis Polar axis project, if coordinate points have foreground pixel be labeled as black, otherwise be labeled as white.
As a further improvement on the present invention, step S4 is specially:Binaryzation data in picture are sat relative to virtual The polar axis of mark system carries out independent projection process, and shorthand text block and the width of non-legible text block fastened along virtual coordinates are high Value, and preserved to server database after counting respective numbers.
As a further improvement on the present invention, server database includes MySQL database, oracle database.
As a further improvement on the present invention, virtual coordinate system includes an axis virtual coordinate system, two axis virtual coordinate systems.
As a further improvement on the present invention, two axis virtual coordinate systems are non-just including two axis orthographic virtual coordinate systems, two axis Hand over virtual coordinate system.
As a further improvement on the present invention, the given threshold in step S5 is specially:The range of writing text number of blocks T It is 50 to 300, writing text block area summation accounts for picture area percentage ranging from 50 to 100, non-legible text number of blocks model Enclose is 0 to 2T.
Compared with prior art, the beneficial effects of the invention are as follows:By obtaining text in picture block in virtual coordinate system Projection and calculate binaryzation data, can be effectively by counting the size and quantity and according to setting of the text block in picture Whether threshold decision picture is advertising pictures, considerably improves the effect of the extraction to the word in the advertising pictures in spam Fruit, strong antijamming capability, and reduce the load of server.
Description of the drawings
Fig. 1 is a kind of schematic diagram of recognition methods to email advertisement picture of the invention;
Fig. 2 is a kind of type picture extracted from mail;
Fig. 2 is is carried out the picture generated after the pretreatment of step S2 by Fig. 3;
Fig. 4 is another type picture extracted from mail;
Fig. 4 is is carried out the picture generated after the pretreatment of step S2 by Fig. 5;
Fig. 6 be Fig. 3 by foreground pixel labeled as black projection result carry out Continuity Analysis so that it is determined that The schematic diagram in ranks direction;
Fig. 7 be Fig. 5 by foreground pixel labeled as black projection result carry out Continuity Analysis so that it is determined that The schematic diagram in ranks direction;
Fig. 8 is the schematic diagram that independent projection process is carried out to the first row text block in picture shown in Fig. 7;
Fig. 9 is the signal according to projection result schematic diagram recording text block wide high level and text number of blocks shown in fig. 8 Figure.
Specific implementation mode
The present invention is described in detail for each embodiment shown in below in conjunction with the accompanying drawings, but it should explanation, these Embodiment is not limitation of the present invention, those of ordinary skill in the art according to function, method made by these embodiments, Or the equivalent transformation in structure or replacement, all belong to the scope of protection of the present invention within.
In the present embodiment, a kind of recognition methods to email advertisement picture, the recognition methods include the following steps:
Step S1, text block orientation is determined after being pre-processed after the picture in extraction mail.The pretreatment includes Frame processing, inverse processing, removal background process, binary conversion treatment, noise reduction process.
Frame processing be in order to judge whether picture has frame, if there is frame then by cut remove outside picture and/ Or internal frame.Inverse processing is to calculate the foreground and/or background colour in picture.It is to pass through meter to remove background process The background colour for obtaining picture is calculated, and is removed it;Simultaneously to the picture progress foreground of inverse processing and exchanging for background colour.Such as Comprising background interferences factors such as landscape or personages in fruit picture, then according to the whole of the picture extracted from mail in step 1 Body style or pixel color Distribution value situation remove the disturbing factors such as personage's background or background scenery.Binary conversion treatment is basis The configuration of computer operates, using Error Compensation Algorithm, to carrying out whole two according to the picture extracted from mail in step 1 Value is handled.The file of picture by binary conversion treatment is very small, convenient for the computer later stage whether be advertising pictures to it into Row judges.Noise reduction process carries out noise reduction process specifically by double background filter methods to the picture that computer extracts, to drop Noise in low picture calculates caused harmful effect to the identification of later stage advertising pictures.
Join shown in Fig. 2 and Fig. 3, Fig. 4 and Fig. 5, Fig. 2 be generated after the pretreatment that inverse is handled it is as shown in Figure 3 pre- Handling result.Fig. 4 is that pre-processed results as shown in Figure 5 are generated after the pretreatment that frame is handled.
Step S2, virtual coordinate system is established according to text block orientation.
In order to determine the size and quantity of text in picture block, need to determine text block included in image content first Orientation.Such as Fig. 2 is horizontal cross arrangement and vertical longitudinal arrangement respectively with the text block in Fig. 4.
Join shown in Fig. 6, step S2 is specially:According to the continuity for the projection result that image content is fastened in virtual coordinates, Matched virtual coordinate system is established for picture.The virtual coordinate system includes an axis virtual coordinate system, two axis virtual coordinate systems, two axis Virtual coordinate system includes two axis orthographic virtual coordinate systems, the nonopiate virtual coordinate system of two axis.
Specifically, if the word in advertising pictures is rendered as a laterally arrangement or when longitudinal arrangement, only An axis virtual coordinate system (transverse direction) or an axis virtual coordinate system (vertical) are established according to the text block orientation in picture.
If the word in advertising pictures is rendered as a plurality of laterally arrangement or a plurality of vertical arrangement, two axis are being established just Virtual coordinate system is handed over, and the polar axis of horizontal direction is defined as X-axis, the polar axis of vertical direction is defined as Y-axis.
If the word in picture is imaged oblique arrangement, need to establish void by the rotation processing with textual image Quasi-coordinate system.It is realized especially by following technical scheme.
Step S11:Picture is established into reference axis according to the wide high natural direction of picture, label vertical direction is X-axis, horizontal Direction is Y-axis.Calculate high point, extremely low point of the picture in X-axis, pole far point, pole near point in Y-axis;Wherein,
High point is the maximum point of numerical value in X-direction;
Extremely low point is the point of numerical value minimum in X-direction;
Pole far point is the maximum point of numerical value in Y direction;
Pole near point is the point of numerical value minimum in Y direction.
Step 12:Extreme value deviation tdev=20px is set, high point set, low spot set, far point set, near point collection are calculated It closes.Calculation is as follows:
X-direction is less than or equal to the point of tdev apart from high point in picture, is recorded as high point set h;
X-direction is more than or equal to the point of tdev apart from extremely low point in picture, is recorded as low spot set l;
Y direction is less than or equal to the point of tdev apart from pole far point in picture, is recorded as far point set f;
Y direction is more than or equal to the point of tdev apart from pole near point in picture, is recorded as near point set n.
Step 13:The width for calculating high point set, low spot set, is recorded as hw, lw respectively.Calculate far point set, near point Roll-in altitude is recorded as fh, nh respectively.
Step 14:Judge whether picture content of text is an axis orthogonal graph:The orthogonal decision threshold v11=20 of an axis is set, V12=80, determination method are as follows:
If hw, lw, which meet, is less than or equal to v11, and fh or nh is more than or equal to v12, then judges that picture is that an axis is orthogonal;
If fh, nh, which meet, is less than or equal to v11, and hw or lw is more than or equal to v12, then judges that picture is that an axis is orthogonal.
It can be used directly if picture is that an axis is orthogonal, need not continue to handle, otherwise, into next step.
Step 15:Judge whether picture content of text is two axis orthogonal graphs:The orthogonal decision threshold v2=80 of two axis is set, is sentenced It is as follows to determine method:
If hw or lw, which meets, is more than or equal to v2, judge that picture is that two axis are orthogonal;
If fh or nh, which meets, is more than or equal to v2, judge that picture is that two axis are orthogonal.
If picture is two axis orthogonal graphs, needs not continue to handle, otherwise, redirect execution next step.
Step 16:Calculate the angle of inclination of the nonopiate picture content of text of two axis:High point, pole far point is taken to calculate picture The angle of inclination of content of text.
Step 17:According to angle of inclination, rotation process is carried out to picture, becomes two axis orthogonal graphs.
Step S3, binaryzation data of each text block in virtual coordinate system in picture are calculated separately, and are specially:It will figure Each text block is projected relative to the polar axis of virtual coordinates axis in piece, and black is labeled as if coordinate points have foreground pixel, Otherwise it is labeled as white.
Join shown in Fig. 6 and Fig. 7, obtained picture is projected in two axis orthographic virtual coordinate systems after pretreatment Afterwards, if occurring writing text block in image content, black region is will appear on projecting direction, if image content In when there is null, space, English, number (i.e. " non-legible text block "), then will appear white area on projecting direction Domain.
Then execute step S4, text block in statistics picture size and quantity, and be specially:To the two-value in picture The polar axis for changing data relative to virtual coordinate system carries out independent projection process, the edge of shorthand text block and non-legible text block The wide high level that virtual coordinates are fastened, and preserved to server database after counting respective numbers.Specifically, the server database Including MySQL according to library, oracle database, and more preferably MySQL database.Join shown in Fig. 8, if during a certain text block is Word then typically appears as the width projection in X-direction more than English or the projection width of number, and in Y direction Standoff height is more than the standoff height of English or number, to the type of the text block in picture realize it is efficient judge with Screening, and progressively or column by column to the text block in picture carries out independent projection process.
In the present embodiment, region that wider black region is writing text block is labeled as (during i.e. text block is Text), relatively narrow black region is the region (i.e. text block is English or number) of non-legible text block, other white areas For the region (region i.e. without any middle word) of non-legible text block.
In conjunction with reference to shown in Fig. 9, it should be noted that the present invention can both be thrown along X-axis line by line from top to bottom first Shadow is projected line by line from bottom to up along X-axis;While the present invention also can by column be projected from top to bottom along Y-axis or edge Y-axis is projected by column from bottom to up, to realize the size and quantity that count the text block in picture to picture.
Join shown in Fig. 9, in executing step S5, can judge whether picture is advertising pictures according to given threshold.In step Given threshold in S5 is specially:Ranging from the 50 to 300 of writing text number of blocks T, writing text block area summation account for picture Area percentage ranging from 50 to 100, non-legible text block quantitative range are 0 to 2T.
Statistics is completed to all text blocks (including writing text block and non-legible text block) in virtual coordinate system Afterwards, it can judge whether the picture extracted from mail is advertising pictures according to statistical result.Specifically, in this implementation In mode, the width range of text block is 20px-40px, and text block altitude range is 35px-60px.
Advertising pictures included in mail are accurately identified by the invention it is possible to realize, discrimination reaches 99.99%, to which the mail recognition that will include the advertising pictures is spam.This recognition methods can be applied to hair spam In engine, to improve identification, filtering, intercepting efficiency to spam.
The series of detailed descriptions listed above only for the present invention feasible embodiment specifically Bright, they are all without departing from equivalent implementations made by technical spirit of the present invention not to limit the scope of the invention Or change should all be included in the protection scope of the present invention.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Profit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent requirements of the claims Variation is included within the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.
In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art The other embodiment being appreciated that.

Claims (7)

1. a kind of recognition methods to email advertisement picture, which is characterized in that the recognition methods includes the following steps:
Text block orientation is determined after being pre-processed after picture in S1, extraction mail;
S2, virtual coordinate system is established according to text block orientation;
S3, binaryzation data of each text block in virtual coordinate system in picture are calculated separately;
S4, statistics picture in text block size and quantity;
S5, judge whether picture is advertising pictures according to given threshold;
Pretreatment in the step S1 includes frame processing, inverse processing, removes background process, at binary conversion treatment, noise reduction Reason;
The step S4 is specially:Polar axis to the binaryzation data in picture relative to virtual coordinate system carries out at independent projection Reason, the wide high level of shorthand text block and non-legible text block fastened along virtual coordinates, and preserved after counting respective numbers To server database.
2. recognition methods according to claim 1, which is characterized in that the step S2 is specially:Existed according to image content The continuity for the projection result that virtual coordinates are fastened establishes matched virtual coordinate system for picture.
3. recognition methods according to claim 1, which is characterized in that the step S3 is specially:By each text in picture Block is projected relative to the polar axis of virtual coordinates axis, and black is labeled as if coordinate points have foreground pixel, is otherwise labeled as White.
4. recognition methods according to claim 1, which is characterized in that the server database include MySQL database, Oracle database.
5. recognition methods according to any one of claim 1 to 4, which is characterized in that the virtual coordinate system includes one Axis virtual coordinate system, two axis virtual coordinate systems.
6. recognition methods according to claim 5, which is characterized in that the two axis virtual coordinate system includes the orthogonal void of two axis Quasi-coordinate system, the nonopiate virtual coordinate system of two axis.
7. recognition methods according to claim 1, which is characterized in that the given threshold in the step S5 is specially:Text Ranging from the 50 to 300 of word text number of blocks T, writing text block area summation account for picture area percentage ranging from 50 to 100, Non-legible text block quantitative range is 0 to 2T.
CN201510121822.XA 2015-03-19 2015-03-19 A kind of recognition methods to email advertisement picture Active CN104715248B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510121822.XA CN104715248B (en) 2015-03-19 2015-03-19 A kind of recognition methods to email advertisement picture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510121822.XA CN104715248B (en) 2015-03-19 2015-03-19 A kind of recognition methods to email advertisement picture

Publications (2)

Publication Number Publication Date
CN104715248A CN104715248A (en) 2015-06-17
CN104715248B true CN104715248B (en) 2018-10-23

Family

ID=53414559

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510121822.XA Active CN104715248B (en) 2015-03-19 2015-03-19 A kind of recognition methods to email advertisement picture

Country Status (1)

Country Link
CN (1) CN104715248B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399161A (en) * 2018-03-06 2018-08-14 平安科技(深圳)有限公司 Advertising pictures identification method, electronic device and readable storage medium storing program for executing
CN111753675B (en) * 2020-06-08 2024-03-26 北京天空卫士网络安全技术有限公司 Picture type junk mail identification method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102542290A (en) * 2011-12-22 2012-07-04 国家计算机网络与信息安全管理中心 Junk mail image recognition method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8924484B2 (en) * 2002-07-16 2014-12-30 Sonicwall, Inc. Active e-mail filter with challenge-response

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102542290A (en) * 2011-12-22 2012-07-04 国家计算机网络与信息安全管理中心 Junk mail image recognition method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"图像垃圾邮件中文本区域的自动提取方法";程红蓉等;《解放军理工大学学报(自然科学版)》;20090630;第10卷(第3期);论文第1.1-1.5节,图3 *

Also Published As

Publication number Publication date
CN104715248A (en) 2015-06-17

Similar Documents

Publication Publication Date Title
CN104751142B (en) A kind of natural scene Method for text detection based on stroke feature
US20190188528A1 (en) Text detection method and apparatus, and storage medium
CN106709866B (en) method and device for removing grid watermark from certificate photo and method and device for verifying human face
CN111353497B (en) Identification method and device for identity card information
CN110033471B (en) Frame line detection method based on connected domain analysis and morphological operation
CN105719243B (en) Image processing apparatus and method
CN110008954A (en) A kind of complex background text image extracting method and system based on multi threshold fusion
CN103824373B (en) A kind of bill images amount of money sorting technique and system
CN110390643B (en) License plate enhancement method and device and electronic equipment
CN104463161A (en) Color document image segmentation and binarization using automatic inpainting
CN105868708A (en) Image object identifying method and apparatus
CN102956029B (en) Image processing apparatus and image processing method
Kim et al. Deep edge-aware interactive colorization against color-bleeding effects
CN113688688A (en) Completion method of table lines in picture and identification method of table in picture
CN110889311A (en) Financial electronic facsimile document identification system and method
CN104715248B (en) A kind of recognition methods to email advertisement picture
Yu et al. Image and video dehazing using view-based cluster segmentation
CN111259891A (en) Method, device, equipment and medium for identifying identity card in natural scene
CN106331746A (en) Method and device for identifying position of watermark in video file
CN109741273A (en) A kind of mobile phone photograph low-quality images automatically process and methods of marking
CN111814673A (en) Method, device and equipment for correcting text detection bounding box and storage medium
CN114862861A (en) Lung lobe segmentation method and device based on few-sample learning
Lu et al. A shadow removal method for tesseract text recognition
CN110807457A (en) OSD character recognition method, device and storage device
Nair et al. A two phase denoising approach to remove uneven illumination from ancient note book images

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 214000, science and software park, Binhu District, Jiangsu, Wuxi 6

Patentee after: Huayun data holding group Co.,Ltd.

Address before: 214000, science and software park, Binhu District, Jiangsu, Wuxi 6

Patentee before: WUXI CHINAC DATA TECHNICAL SERVICE Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20221109

Address after: Room 316, Government Affairs Service Center, No. 1, Renmin Road, Pingshang Town, Lingang Economic Development Zone, Linyi City, Shandong Province, 276000

Patentee after: Huayun Industrial Internet Co.,Ltd.

Address before: No. 6 Science and Education Software Park, Binhu District, Wuxi City, Jiangsu Province

Patentee before: Huayun data holding group Co.,Ltd.