A kind of method and method for judging rubbish mail removing mail interference information
Technical field
The present invention relates to anti-spam technologies field more particularly to a kind of methods for removing mail interference information, and
A kind of method for judging rubbish mail of method based on the removal spam interference information.
Background technique
With the development of internet, spam endangers more and more big caused by user.It is usually wrapped in spam
It includes and promotes mail or with pornographic or other flames mail.For this purpose, occurring a variety of anti-spam in the prior art
The identification of mail and filter method and background server strobe utility.
The method of the anti-rubbish mail of mainstream specifically includes that (1) optical character identification sends out (OCR) at present, passes through realization
Content comprising advertising pictures or plain text is extracted, ad content is judged whether by content, to realize rubbish
The identification of mail, but this technology expense caused by computer is larger.(2) the mail-detection technology based on MD5 verification,
By the way that the character string of random length is executed hash operations, it is converted into the value of shorter regular length.Due to any two difference
The MD5 value of character string is not identical, therefore can judge whether two character strings are identical by comparing the MD5 value of two character strings.
But this anti-spam technologies are non-critical to Mail Contents identical, any variation Shi Douhui occur leads to the difference of MD5 value,
Whether it is the judgement of spam to the mail and executes filtering and intercept operation to seriously affects.
Meanwhile anti-spam technologies in the prior art are directly to the preset text or figure for including in mail
Piece is scanned detection, certainly will cause to be also required to execute above-mentioned inspection or filter operation to the mail normally sent in this way, because
This will increase the computing cost of background server or web page search engine.It is therefore proposed that a kind of to may be identified rubbish
The pretreated method of mail progress just seems and is highly desirable, to avoid the judgement that all mails are carried out with spam of blindness
Operation, interception and delete operation, and improve the intercepting efficiency to spam.
If in addition, since interference character is added or to spam content in spam publisher in spam
Display mode carry out rearrangement, then existing anti-garbage mail system is difficult the spam regarding as spam,
Thus greatly reduce the intercepting efficiency to spam.
In view of this, it is necessary to the pretreatment side in the prior art for interference information included in spam
Method is improved, to solve the above problems.
Summary of the invention
It is an object of the invention to disclose it is a kind of remove spam interference information method, with to avoid blindness to institute
There is mail to carry out decision, interception and the delete operation of spam, and improves the intercepting efficiency to spam;The present invention
Another goal of the invention a kind of method for judging rubbish mail is being disclosed, the mail comprising interference information is determined as to improve
The efficiency of spam improves interception and filter efficiency to spam.
To realize said one goal of the invention, the present invention provides a kind of methods for removing mail interference information, comprising:
S1, the html content for including in mail is obtained;
S2, execute building DOM Document Object Model to html content, and to DOM Document Object Model execute it is following at least one or
It by html Content Transformation is text information after person's a variety of disturbance ecologies processing, the disturbance ecology processing includes: color block interference
Identifying processing, the processing of font size disturbance ecology, the processing of table disturbance ecology;
S3, by treated, text information carries out recombining contents.
As a further improvement of the present invention, the disturbance ecology processing in the step S2 further include: sensitive word interference is known
Other places reason.
It as a further improvement of the present invention, is text information by html Content Transformation in the step S2 specifically: right
Html content deletes label, to extract the text information in html content.
As a further improvement of the present invention, the sensitive word disturbance ecology processing includes: capitalization and lowercase
Conversion processing, the conversion processing of specification character and non-standard character, the conversion processing of letter and number.
As a further improvement of the present invention, the conversion processing of the capitalization and lowercase specifically: to text
Information is checked character by character, and in II code value of the ASC of character at [65,90], II code value of ASC of the character is increased by 32.
As a further improvement of the present invention, the conversion processing of the specification character and non-standard character specifically: to text
This information is checked character by character, and the data value of non-standard character included in text information is encoded according to Unicode
Table is revised as specification character.
As a further improvement of the present invention, the conversion processing of the letter and number specifically: text information is carried out
It checks, and number is replaced with letter character by character according to II code table of ASC.
As a further improvement of the present invention, building DOM Document Object Model packet is executed to html content in the step S2
Include following steps:
S21, multiple labels as input content and are parsed into using html content;
S22, each label is constructed into dom tree shape model, the dom tree shape model includes several flag nodes, wherein often
A flag node includes the attribute information to match with the flag node;
S23, traversal dom tree shape model, extract the segmentation being mutually matched in html content with the attribute information of flag node
Information;
Text information in S24, each segment information of extraction, and matched row is selected according to the attribute information of flag node
Column sequence, to form continuous text information.
As a further improvement of the present invention, the step S21 specifically: to html content according to preset mark
Note rule carries out traversal parsing operation, then by label generator distinguishing mark and passes to dom tree shape model builder.
As a further improvement of the present invention, the marking convention includes html content initial position label, html content
End position label, categorical attribute, Property Name, attribute value.
As a further improvement of the present invention, the categorical attribute includes font size attribute, font inclination attribute, font
Horizontal arrangement difference attribute, font vertical arrangement difference attribute, RGB difference attribute, gray value differences value attribute, color saturation category
Property, contrast difference attribute.
As a further improvement of the present invention, the step S3 the following steps are included:
S31, using encoded translated device, to treated, text information carries out code conversion;
S32, the header information according to the coding section of setting length as the text information after recombining contents, middle part information
And trailer information;
S33, header information, middle part information and trailer information are sequentially arranged into the text after the recombining contents in continuous state
This information.
As a further improvement of the present invention, after the step S33, further includes: to the text information after recombining contents
Execute following one or more kinds of operations:
Remove the operation of space markings;
Remove the operation of carriage return label;
The operation of removal line feed label;Wherein,
The operation of the removal space markings specifically: the text information after recombining contents is checked character by character, and internally
The character that II code value of ASC in text information after bulk density group is 32 executes delete operation;
The operation of the removal carriage return label specifically: the text information after recombining contents is checked character by character, and internally
The character that II code value of ASC is 10 in text information after bulk density group executes delete operation;
The operation of the removal line feed label specifically: the text information after recombining contents is checked character by character, and internally
The character that II code value of ASC is 13 in text information after bulk density group executes delete operation;
It as a further improvement of the present invention, further include being held using encoder to html content before the step S1 is executed
The step of processing of row code conversion is to be converted into Unicode coding.
To realize another above-mentioned invention, the present invention also provides a kind of method for judging rubbish mail, including it is any of the above-described
The method of the item invention removal mail interference information;And
Text information after recombining contents is compared with keywords database set in database, and determine whether for
Spam.
As a further improvement of the present invention, the database includes oracle database, DB2 database, Postgre
SQL database, Microsoft SQL Server database, Microsoft Access database or MySQL database.
Compared with prior art, it is wrapped the beneficial effects of the present invention are: can be realized through the invention in spam
The interference information contained is efficiently separated with text information, can accurately identify the various interference letter artificially encapsulated in spam
Breath determines whether spam to mail and provides accurate foundation, and can effectively improve and block to spam to be subsequent
Cut and filter effect.
Detailed description of the invention
Fig. 1 is a kind of flow chart for the method for removing mail interference information of the present invention;
Fig. 2 is the untreated preceding schematic diagram comprising html contents of interference informations such as English, numbers;
Fig. 3 is to execute at sensitive word disturbance ecology to interference informations such as English, numbers in html content illustrated in fig. 2
Schematic diagram after reason;
Fig. 4 is the schematic diagram of the untreated preceding html content comprising different size font interference information;
Fig. 5 is to be based at font size disturbance ecology after executing building DOM Document Object Model to html content in step S22
Dom tree shape model when reason;
Fig. 6 is the schematic diagram of the untreated preceding html content comprising different colours block interference information;
Fig. 7 is to be based on the processing of color block disturbance ecology after executing building DOM Document Object Model to html content in step S22
When dom tree shape model;
Fig. 8 is the schematic diagram of the untreated preceding html content comprising table interference information;
When Fig. 9 is in step S22 to after html content execution building DOM Document Object Model based on the processing of table disturbance ecology
Dom tree shape model;
Figure 10 is showing before carrying out recombining contents to treated text information obtained in Fig. 5 or Fig. 7 or Fig. 9
It is intended to;
Figure 11 is showing after carrying out recombining contents to treated text information obtained in Fig. 5 or Fig. 7 or Fig. 9
It is intended to;
Figure 12 is the schematic diagram carried out before recombining contents to the text information comprising the meaningless character interference information of content;
Figure 13 is after carrying out recombining contents to the text information for including the meaningless character interference information of content in Figure 12
Schematic diagram.
Specific embodiment
The present invention is described in detail for each embodiment shown in reference to the accompanying drawing, but it should be stated that, these
Embodiment is not limitation of the present invention, those of ordinary skill in the art according to these embodiments made by function, method,
Or equivalent transformation or substitution in structure, all belong to the scope of protection of the present invention within.
Intercepting system or interception software in the prior art to spam is carrying out spam to targeted mails
In decision process, the interference information that spammer adds in mail can not be removed, to significantly impact
Filter effect or even some non-spam email to spam can also be identified as spam.This specification concrete mode
Each application scenarios shown by part or specific implementation process base be only to example summary of the invention, therefore can not be
Protection scope of the present invention or invention objective are constituted and limited.
Join shown in Fig. 1, in the present embodiment, the method for the removal mail interference information, comprising the following steps: S1, obtain
Take the html content for including in mail;S2, building DOM Document Object Model is executed to html content, and DOM Document Object Model is executed
It by html Content Transformation is text information after following at least one or a variety of disturbance ecology processing, the disturbance ecology processing packet
It includes: the processing of color block disturbance ecology, the processing of font size disturbance ecology, the processing of table disturbance ecology;S3, will treated text
Information carries out recombining contents.
If mail obtains HTML content there are HTML, and HTML information is converted to text information, conversion method
For to HTML content carry out delete label operation, that is, delete "<tag name attribute name=attribute value></tag name>"
Equal html label information, remaining content is Mail Contents.If HTML is not present in mail, believed using the plain text of mail
Cease the content as mail.
The interference cases generally occurred in spam include:
The interfering with each other of English alphabet " I " and Arabic numerals " 1 ", English alphabet " O " (containing capitalization or small letter) and Ah
The interfering with each other of Arabic numbers " 1 ", the interference of background color, the interference of text information arrangement mode.If occurring people in mail
For English alphabet " I " and Arabic numerals " 1 " are replaced, thus be mingled in continuous Chinese or English and
In order to avoid being volleyed by anti-rubbish mail, then the mail comprising these contents is likely to spam.It is existing in order to improve
The interception and filter effect of some anti-garbage mail systems or anti-rubbish mail to these spams, it is necessary to dry to these
It disturbs and is removed, and extract the text information that mail is really recorded, then pass through anti-rubbish mail system in the prior art
System or anti-spam software carry out interception and filter operation, and spam is prevented to be sent in server.
In the present embodiment, the disturbance ecology processing in the step S2 is also wrapped other than above-mentioned three kinds of interference processing
It includes: the processing of sensitive word disturbance ecology.In step S2 by html Content Transformation be text information specifically: to html content delete mark
Label, to extract the text information in html content.
Join shown in Fig. 2 and Fig. 3, in the present embodiment, sensitive word disturbance ecology processing includes: capitalization and small letter
Conversion processing, the conversion processing of specification character and non-standard character, the conversion processing of letter and number of letter.
Wherein, the conversion processing of capitalization and lowercase specifically: text information is checked character by character, and
II code value of ASC of character increases by 32 at [65,90], by II code value of ASC of the character.It is changed to [97,122], completes
The conversion of English alphabet capital and small letter.
The conversion processing of specification character and non-standard character specifically: text information is checked character by character, and will be literary
The data value of non-standard character included in this information is revised as specification character according to Unicode coding schedule.For example, it examines
Looking into input text byte sequence is 0x2776, it may be assumed thatIt is then converted into ASC code 49, i.e. number 1.
In the present embodiment, the definition of the non-standard character is true expressed by the people based on the increase interference information
Meaning and select.Such as expression Arabic numerals " 1 ", non-standard character include but is not limited to " (1) ", " (1) ", " 1. ",
"Ⅰ","ⅰ","1".The conversion processing of letter and number specifically: text information is checked character by character, and according to ASC II
Code table is replaced number with letter.
For example, II numerical value of alphabetical ASC to be converted: 105 (i), 111 (o), the two is distinguished in II code table of ASC
Corresponding II code value of target number ASC is 49 (1), 48 (0).Therefore, it can be determined by searching for numerical value in II code table of ASC
Corresponding relationship each other, and alphabetical " i " is replaced with Arabic numerals " 1 ", by alphabetical " o " and Arabic numerals
" 0 " is replaced.
In the present embodiment, in step S2 to html content execute building DOM Document Object Model the following steps are included:
S21, multiple labels as input content and are parsed into using html content;Wherein, step S21 specifically: to text
This information carries out traversal parsing operation according to preset marking convention, then by label generator distinguishing mark and passes to
Dom tree shape model builder.
The marking convention includes html content initial position label, html end of text position mark, categorical attribute, category
Property title, attribute value.
The categorical attribute include font size attribute, font inclination attribute, font horizontal arrangement difference attribute, font hang down
In line cloth difference attribute, RGB difference attribute, gray value differences value attribute, color saturation attribute, contrast difference attribute.
S22, each label is constructed into dom tree shape model, the dom tree shape model includes several flag nodes, wherein often
A flag node includes the attribute information to match with the flag node;
S23, traversal dom tree shape model, extract the segmentation being mutually matched in html content with the attribute information of flag node
Information;
Text information in S24, each segment information of extraction, and matched row is selected according to the attribute information of flag node
Column sequence, to form continuous text information.
Preferably, in the present embodiment, the step S3 the following steps are included:
S31, using encoded translated device, to treated, text information carries out code conversion;
S32, the header information according to the coding section of setting length as the text information after recombining contents, middle part information
And trailer information;
S33, header information, middle part information and trailer information are sequentially arranged into the text after the recombining contents in continuous state
This information.
After the step S33, further includes: execute following one or more kinds of behaviour to the text information after recombining contents
Make:
Remove the operation of space markings;
Remove the operation of carriage return label;
The operation of removal line feed label;Wherein,
The operation of the removal space markings specifically: the text information after recombining contents is checked character by character, and internally
The character that II code value of ASC in text information after bulk density group is 32 executes delete operation.
The operation of the removal carriage return label specifically: the text information after recombining contents is checked character by character, and internally
The character that II code value of ASC is 10 in text information after bulk density group executes delete operation;
The operation of the removal line feed label specifically: the text information after recombining contents is checked character by character, and internally
The character that II code value of ASC is 13 in text information after bulk density group executes delete operation;
The step S1 further includes executing code conversion processing to html content using encoder to be converted into before executing
The step of Unicode is encoded.
Next interference behaviour is removed than the interference occurred in more typical html content to three kinds in conjunction with Fig. 4 to Fig. 9
The specific implementation process of work is described in detail.
Join Fig. 4 to Fig. 5 shown in, the situation be marking convention based on font size attribute when extract html content in text
The detailed process of information.
As shown in figure 4, occurring two row texts in Fig. 4.For simplify indicate, in the present specification with " A, B, C, D, E, F,
G " refers to Chinese text, and " XX " is used to refer to Arabic numerals.The first row " ABCDF hair ", the second row " ticket Q1980021XX ";
Wherein the font of " ABCD " and " EFG " are smaller, attribute: font-size:6px, the font of " hair " and " ticket Q1980021XX " compared with
Greatly, attribute: font-size:20px.
As shown in figure 5, in the present embodiment, parsing HTML includes three steps: marking, tree building and extraction need
Information.The code of above-mentioned detailed process is as follows:
Next, traversal dom tree shape model, extracts point being mutually matched in html content with the attribute information of flag node
Segment information.Four segment informations " ABCD ", " hair ", " EFG ", " ticket Q1980021XX " are obtained, please specifically be join shown in Figure 10.
It finally executes step S24, extract the text information in each segment information, and according to the attribute information of flag node
Select it is matched put in order, with formed be in continuous text information, please specifically join shown in Figure 11.
Join Fig. 6 to Fig. 7 shown in, the situation be marking convention based on RGB difference attribute when extract html content in text
The detailed process of information.
Under such interference cases, step S23 specifically: traversal dom tree shape model extracts text information.According to CSS layer
The rule of folded pattern, generates the final attribute of each flag node, (is defaulted as according to the value of attribute color in flag node
RGB:000000) and the value of attribute background-color (being defaulted as RGB:FFFFFF), extract in html content in order
The text information of flag node.According to determination method, it is determined whether the extraction of Yao Jinhang text information.Calculate foreground background colour
Euclidean distance (d12), wherein x1y1z1 indicate foreground rgb value, x2y2z2 indicate background colour rgb value, d12 indicate Europe
Formula is compared apart from calculated result, by it with the color critical value V being set, shown in the following formula of calculation formula (1):
The text information in html content is extracted if d12 < V, is not otherwise extracted in the text information in html content
Hold.
Certainly, those skilled in the art can reasonable prediction arrive, may be based on gray value differences value attribute, color saturation category
Property, one or more marking convention such as contrast difference attribute extract text information, details are not described herein.
The code of above-mentioned detailed process is as follows:
Set distance critical value is V=100, calculates Euclidean distance:
Text " ABCD " Euclidean distance d12=147;
Text " hair " Euclidean distance d12=0;
Text " EFG " Euclidean distance d12=147;
Text " Q1980021XX " Euclidean distance d12=0.
Next, traversal dom tree shape model, extracts point being mutually matched in html content with the attribute information of flag node
Segment information.Four segment informations " ABCD ", " hair ", " EFG ", " ticket Q1980021XX " are obtained, please specifically be join shown in Figure 10.
It finally executes step S24, extract the text information in each segment information, and according to the attribute information of flag node
Select it is matched put in order, to form continuous text information.It please specifically join shown in Figure 11.
Join shown in Fig. 8 to Fig. 9, which handles html content executive table disturbance ecology and obtain the tool of text information
Body process.
As shown in figure 8, " ABCD ", " hair ", " EFG ", " ticket Q1980021XX " are respectively encapsulated in a table (table)
In.Each " tr " mark indicates that a line, the line number of record form are 1-M in table mark.Each " td " mark in table mark
Know the column represented in row, recording every ranks number is 1-N.
The code of above-mentioned detailed process is as follows:
Next, traversal dom tree shape model, extracts point being mutually matched in html content with the attribute information of flag node
Segment information.Four segment informations " ABCD ", " hair ", " EFG ", " ticket Q1980021XX " are obtained, please specifically be join shown in Figure 10.
It finally executes step S24, extract the text information in each segment information, and according to the attribute information of flag node
Select it is matched put in order, to form continuous text information, please specifically join shown in Figure 11.
Join shown in Figure 12 and Figure 13, there is disclosed to the text information progress comprising the meaningless character interference information of content
The detailed process of recombining contents.
For the text information of the meaningless character interference of content, needs to recombinate html content, be conducive to analysis and place
Reason, specific steps are as follows:
By Mail Contents code conversion, target character integrates as Unicode coding;
Extraction Unicode encodes section and is used as text for the character (II code value of ASC) of [48,57], [65,90], [97,122]
The header information of text information after this recombination.
Extract text after the character (Chinese character) that Unicode coding section is [13312,40895] is recombinated as text
The middle part information of information.
Extraction Unicode coding section is after the character (other language characters) of [40960,55215] is recombinated as text
The trailer information of text information.
It is a space symbol that text information after recombination, which is replaced continuous space,;
Remove after recombination two Chinese characters (either English alphabet or Hiragana or Japanese pieces in text information
Assumed name or German) between space;
Text information after recombination is deleted into carriage return and line feed symbol.
Wherein, the operation of the removal space markings specifically: the text information after recombining contents is checked character by character, and
The character for being 32 to II code value of ASC in the text information after recombining contents executes delete operation.
The operation of the removal carriage return label specifically: the text information after recombining contents is checked character by character, and internally
The character that II code value of ASC is 10 in text information after bulk density group executes delete operation;
The operation of the removal line feed label specifically: the text information after recombining contents is checked character by character, and internally
The character that II code value of ASC is 13 in text information after bulk density group executes delete operation.By header information, middle part information and tail portion
Information is sequentially arranged into the recombining contents text in continuous state.
The final processing result of html content in Figure 12 are as follows:
"d!) ss as (v, v, v, v) b b z,.C c s c v c m c x Q Q:a e c n z n!2011 07 13 your the good existing invoices of our company can externally act on behalf of building advertisement and consult
The Liu Sheng in need that please contact that pays the bill after sale etc. can be tested is ask if any bothering see forgiving date Shao ".
It can be realized through the invention and interference information included in spam efficiently separated with text information, it can
The various interference informations artificially encapsulated in accurate identification spam determine whether spam to mail and provide to be subsequent
Accurate foundation.
This specification also discloses a kind of method for judging rubbish mail, and the side of mail interference information is removed including above-mentioned
Method;And the text information after recombining contents is compared with keywords database set in database, and determine whether for
Spam.
Preferably, which includes oracle database, DB2 database, Postgre SQL database, Microsoft
SQL Server database, MicrosoftAccess database or MySQL database, and further preferably MySQL data
Library.By the above method, interception and filter effect to spam can be effectively improved.
The series of detailed descriptions listed above only for feasible embodiment of the invention specifically
Protection scope bright, that they are not intended to limit the invention, it is all without departing from equivalent implementations made by technical spirit of the present invention
Or change should all be included in the protection scope of the present invention.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie
In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power
Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims
Variation is included within the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.
In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped
Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should
It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art
The other embodiments being understood that.