Invention content
It is an object of the invention to disclose a kind of recognition methods to email advertisement picture, improve to the picture comprising word
The effect for carrying out Word Input effectively identifies the spam comprising advertising pictures to realize, while reducing clothes
The load of business device, while improving anti-interference ability of the server in filtering spam mail.
For achieving the above object, the present invention provides a kind of recognition methods to email advertisement picture, including it is following
Step:
Text block orientation is determined after being pre-processed after picture in S1, extraction mail;
S2, virtual coordinate system is established according to text block orientation;
S3, binaryzation data of each text block in virtual coordinate system in picture are calculated separately;
S4, statistics picture in text block size and quantity;
S5, judge whether picture is advertising pictures according to given threshold.
As a further improvement on the present invention, the pretreatment in step S1 includes frame processing, inverse processing, removal background
Processing, binary conversion treatment, noise reduction process.
As a further improvement on the present invention, step S2 is specially:The projection fastened in virtual coordinates according to image content
As a result continuity establishes matched virtual coordinate system for picture.
As a further improvement on the present invention, step S3 is specially:By each text block in picture relative to virtual coordinates axis
Polar axis project, if coordinate points have foreground pixel be labeled as black, otherwise be labeled as white.
As a further improvement on the present invention, step S4 is specially:Binaryzation data in picture are sat relative to virtual
The polar axis of mark system carries out independent projection process, and shorthand text block and the width of non-legible text block fastened along virtual coordinates are high
Value, and preserved to server database after counting respective numbers.
As a further improvement on the present invention, server database includes MySQL database, oracle database.
As a further improvement on the present invention, virtual coordinate system includes an axis virtual coordinate system, two axis virtual coordinate systems.
As a further improvement on the present invention, two axis virtual coordinate systems are non-just including two axis orthographic virtual coordinate systems, two axis
Hand over virtual coordinate system.
As a further improvement on the present invention, the given threshold in step S5 is specially:The range of writing text number of blocks T
It is 50 to 300, writing text block area summation accounts for picture area percentage ranging from 50 to 100, non-legible text number of blocks model
Enclose is 0 to 2T.
Compared with prior art, the beneficial effects of the invention are as follows:By obtaining text in picture block in virtual coordinate system
Projection and calculate binaryzation data, can be effectively by counting the size and quantity and according to setting of the text block in picture
Whether threshold decision picture is advertising pictures, considerably improves the effect of the extraction to the word in the advertising pictures in spam
Fruit, strong antijamming capability, and reduce the load of server.
Specific implementation mode
The present invention is described in detail for each embodiment shown in below in conjunction with the accompanying drawings, but it should explanation, these
Embodiment is not limitation of the present invention, those of ordinary skill in the art according to function, method made by these embodiments,
Or the equivalent transformation in structure or replacement, all belong to the scope of protection of the present invention within.
In the present embodiment, a kind of recognition methods to email advertisement picture, the recognition methods include the following steps:
Step S1, text block orientation is determined after being pre-processed after the picture in extraction mail.The pretreatment includes
Frame processing, inverse processing, removal background process, binary conversion treatment, noise reduction process.
Frame processing be in order to judge whether picture has frame, if there is frame then by cut remove outside picture and/
Or internal frame.Inverse processing is to calculate the foreground and/or background colour in picture.It is to pass through meter to remove background process
The background colour for obtaining picture is calculated, and is removed it;Simultaneously to the picture progress foreground of inverse processing and exchanging for background colour.Such as
Comprising background interferences factors such as landscape or personages in fruit picture, then according to the whole of the picture extracted from mail in step 1
Body style or pixel color Distribution value situation remove the disturbing factors such as personage's background or background scenery.Binary conversion treatment is basis
The configuration of computer operates, using Error Compensation Algorithm, to carrying out whole two according to the picture extracted from mail in step 1
Value is handled.The file of picture by binary conversion treatment is very small, convenient for the computer later stage whether be advertising pictures to it into
Row judges.Noise reduction process carries out noise reduction process specifically by double background filter methods to the picture that computer extracts, to drop
Noise in low picture calculates caused harmful effect to the identification of later stage advertising pictures.
Join shown in Fig. 2 and Fig. 3, Fig. 4 and Fig. 5, Fig. 2 be generated after the pretreatment that inverse is handled it is as shown in Figure 3 pre-
Handling result.Fig. 4 is that pre-processed results as shown in Figure 5 are generated after the pretreatment that frame is handled.
Step S2, virtual coordinate system is established according to text block orientation.
In order to determine the size and quantity of text in picture block, need to determine text block included in image content first
Orientation.Such as Fig. 2 is horizontal cross arrangement and vertical longitudinal arrangement respectively with the text block in Fig. 4.
Join shown in Fig. 6, step S2 is specially:According to the continuity for the projection result that image content is fastened in virtual coordinates,
Matched virtual coordinate system is established for picture.The virtual coordinate system includes an axis virtual coordinate system, two axis virtual coordinate systems, two axis
Virtual coordinate system includes two axis orthographic virtual coordinate systems, the nonopiate virtual coordinate system of two axis.
Specifically, if the word in advertising pictures is rendered as a laterally arrangement or when longitudinal arrangement, only
An axis virtual coordinate system (transverse direction) or an axis virtual coordinate system (vertical) are established according to the text block orientation in picture.
If the word in advertising pictures is rendered as a plurality of laterally arrangement or a plurality of vertical arrangement, two axis are being established just
Virtual coordinate system is handed over, and the polar axis of horizontal direction is defined as X-axis, the polar axis of vertical direction is defined as Y-axis.
If the word in picture is imaged oblique arrangement, need to establish void by the rotation processing with textual image
Quasi-coordinate system.It is realized especially by following technical scheme.
Step S11:Picture is established into reference axis according to the wide high natural direction of picture, label vertical direction is X-axis, horizontal
Direction is Y-axis.Calculate high point, extremely low point of the picture in X-axis, pole far point, pole near point in Y-axis;Wherein,
High point is the maximum point of numerical value in X-direction;
Extremely low point is the point of numerical value minimum in X-direction;
Pole far point is the maximum point of numerical value in Y direction;
Pole near point is the point of numerical value minimum in Y direction.
Step 12:Extreme value deviation tdev=20px is set, high point set, low spot set, far point set, near point collection are calculated
It closes.Calculation is as follows:
X-direction is less than or equal to the point of tdev apart from high point in picture, is recorded as high point set h;
X-direction is more than or equal to the point of tdev apart from extremely low point in picture, is recorded as low spot set l;
Y direction is less than or equal to the point of tdev apart from pole far point in picture, is recorded as far point set f;
Y direction is more than or equal to the point of tdev apart from pole near point in picture, is recorded as near point set n.
Step 13:The width for calculating high point set, low spot set, is recorded as hw, lw respectively.Calculate far point set, near point
Roll-in altitude is recorded as fh, nh respectively.
Step 14:Judge whether picture content of text is an axis orthogonal graph:The orthogonal decision threshold v11=20 of an axis is set,
V12=80, determination method are as follows:
If hw, lw, which meet, is less than or equal to v11, and fh or nh is more than or equal to v12, then judges that picture is that an axis is orthogonal;
If fh, nh, which meet, is less than or equal to v11, and hw or lw is more than or equal to v12, then judges that picture is that an axis is orthogonal.
It can be used directly if picture is that an axis is orthogonal, need not continue to handle, otherwise, into next step.
Step 15:Judge whether picture content of text is two axis orthogonal graphs:The orthogonal decision threshold v2=80 of two axis is set, is sentenced
It is as follows to determine method:
If hw or lw, which meets, is more than or equal to v2, judge that picture is that two axis are orthogonal;
If fh or nh, which meets, is more than or equal to v2, judge that picture is that two axis are orthogonal.
If picture is two axis orthogonal graphs, needs not continue to handle, otherwise, redirect execution next step.
Step 16:Calculate the angle of inclination of the nonopiate picture content of text of two axis:High point, pole far point is taken to calculate picture
The angle of inclination of content of text.
Step 17:According to angle of inclination, rotation process is carried out to picture, becomes two axis orthogonal graphs.
Step S3, binaryzation data of each text block in virtual coordinate system in picture are calculated separately, and are specially:It will figure
Each text block is projected relative to the polar axis of virtual coordinates axis in piece, and black is labeled as if coordinate points have foreground pixel,
Otherwise it is labeled as white.
Join shown in Fig. 6 and Fig. 7, obtained picture is projected in two axis orthographic virtual coordinate systems after pretreatment
Afterwards, if occurring writing text block in image content, black region is will appear on projecting direction, if image content
In when there is null, space, English, number (i.e. " non-legible text block "), then will appear white area on projecting direction
Domain.
Then execute step S4, text block in statistics picture size and quantity, and be specially:To the two-value in picture
The polar axis for changing data relative to virtual coordinate system carries out independent projection process, the edge of shorthand text block and non-legible text block
The wide high level that virtual coordinates are fastened, and preserved to server database after counting respective numbers.Specifically, the server database
Including MySQL according to library, oracle database, and more preferably MySQL database.Join shown in Fig. 8, if during a certain text block is
Word then typically appears as the width projection in X-direction more than English or the projection width of number, and in Y direction
Standoff height is more than the standoff height of English or number, to the type of the text block in picture realize it is efficient judge with
Screening, and progressively or column by column to the text block in picture carries out independent projection process.
In the present embodiment, region that wider black region is writing text block is labeled as (during i.e. text block is
Text), relatively narrow black region is the region (i.e. text block is English or number) of non-legible text block, other white areas
For the region (region i.e. without any middle word) of non-legible text block.
In conjunction with reference to shown in Fig. 9, it should be noted that the present invention can both be thrown along X-axis line by line from top to bottom first
Shadow is projected line by line from bottom to up along X-axis;While the present invention also can by column be projected from top to bottom along Y-axis or edge
Y-axis is projected by column from bottom to up, to realize the size and quantity that count the text block in picture to picture.
Join shown in Fig. 9, in executing step S5, can judge whether picture is advertising pictures according to given threshold.In step
Given threshold in S5 is specially:Ranging from the 50 to 300 of writing text number of blocks T, writing text block area summation account for picture
Area percentage ranging from 50 to 100, non-legible text block quantitative range are 0 to 2T.
Statistics is completed to all text blocks (including writing text block and non-legible text block) in virtual coordinate system
Afterwards, it can judge whether the picture extracted from mail is advertising pictures according to statistical result.Specifically, in this implementation
In mode, the width range of text block is 20px-40px, and text block altitude range is 35px-60px.
Advertising pictures included in mail are accurately identified by the invention it is possible to realize, discrimination reaches
99.99%, to which the mail recognition that will include the advertising pictures is spam.This recognition methods can be applied to hair spam
In engine, to improve identification, filtering, intercepting efficiency to spam.
The series of detailed descriptions listed above only for the present invention feasible embodiment specifically
Bright, they are all without departing from equivalent implementations made by technical spirit of the present invention not to limit the scope of the invention
Or change should all be included in the protection scope of the present invention.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie
In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power
Profit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent requirements of the claims
Variation is included within the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.
In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped
Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should
It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art
The other embodiment being appreciated that.