CN106227808B

CN106227808B - A kind of method and method for judging rubbish mail removing mail interference information

Info

Publication number: CN106227808B
Application number: CN201610584290.8A
Authority: CN
Inventors: 徐慧灵; 纪春来
Original assignee: Xiamen Rong Neng Technology Co Ltd
Current assignee: Huayun Data Co ltd
Priority date: 2016-07-22
Filing date: 2016-07-22
Publication date: 2019-04-05
Anticipated expiration: 2036-07-22
Also published as: CN106227808A

Abstract

Present invention discloses a kind of method and method for judging rubbish mail for removing mail interference information, the method for the removal mail interference information includes: to obtain the html content for including in mail；Building document opposite direction model is executed to html content, and executing after following at least one or a variety of disturbance ecologies processing document opposite direction model by html Content Transformation is text information, the disturbance ecology processing includes: the processing of color block disturbance ecology, the processing of font size disturbance ecology, the processing of table disturbance ecology；By treated, text information carries out recombining contents.Through the invention, it can be realized and interference information included in spam is efficiently separated with text information, it can accurately identify the various interference informations artificially encapsulated in spam, it spam is determine whether to mail provides accurate foundation to be subsequent, and can effectively improve interception and the filter effect to spam.

Description

A kind of method and method for judging rubbish mail removing mail interference information

Technical field

The present invention relates to anti-spam technologies field more particularly to a kind of methods for removing mail interference information, and A kind of method for judging rubbish mail of method based on the removal spam interference information.

Background technique

With the development of internet, spam endangers more and more big caused by user.It is usually wrapped in spam It includes and promotes mail or with pornographic or other flames mail.For this purpose, occurring a variety of anti-spam in the prior art The identification of mail and filter method and background server strobe utility.

The method of the anti-rubbish mail of mainstream specifically includes that (1) optical character identification sends out (OCR) at present, passes through realization Content comprising advertising pictures or plain text is extracted, ad content is judged whether by content, to realize rubbish The identification of mail, but this technology expense caused by computer is larger.(2) the mail-detection technology based on MD5 verification, By the way that the character string of random length is executed hash operations, it is converted into the value of shorter regular length.Due to any two difference The MD5 value of character string is not identical, therefore can judge whether two character strings are identical by comparing the MD5 value of two character strings. But this anti-spam technologies are non-critical to Mail Contents identical, any variation Shi Douhui occur leads to the difference of MD5 value, Whether it is the judgement of spam to the mail and executes filtering and intercept operation to seriously affects.

Meanwhile anti-spam technologies in the prior art are directly to the preset text or figure for including in mail Piece is scanned detection, certainly will cause to be also required to execute above-mentioned inspection or filter operation to the mail normally sent in this way, because This will increase the computing cost of background server or web page search engine.It is therefore proposed that a kind of to may be identified rubbish The pretreated method of mail progress just seems and is highly desirable, to avoid the judgement that all mails are carried out with spam of blindness Operation, interception and delete operation, and improve the intercepting efficiency to spam.

If in addition, since interference character is added or to spam content in spam publisher in spam Display mode carry out rearrangement, then existing anti-garbage mail system is difficult the spam regarding as spam, Thus greatly reduce the intercepting efficiency to spam.

In view of this, it is necessary to the pretreatment side in the prior art for interference information included in spam Method is improved, to solve the above problems.

Summary of the invention

It is an object of the invention to disclose it is a kind of remove spam interference information method, with to avoid blindness to institute There is mail to carry out decision, interception and the delete operation of spam, and improves the intercepting efficiency to spam；The present invention Another goal of the invention a kind of method for judging rubbish mail is being disclosed, the mail comprising interference information is determined as to improve The efficiency of spam improves interception and filter efficiency to spam.

To realize said one goal of the invention, the present invention provides a kind of methods for removing mail interference information, comprising:

S1, the html content for including in mail is obtained；

S2, execute building DOM Document Object Model to html content, and to DOM Document Object Model execute it is following at least one or It by html Content Transformation is text information after person's a variety of disturbance ecologies processing, the disturbance ecology processing includes: color block interference Identifying processing, the processing of font size disturbance ecology, the processing of table disturbance ecology；

S3, by treated, text information carries out recombining contents.

As a further improvement of the present invention, the disturbance ecology processing in the step S2 further include: sensitive word interference is known Other places reason.

It as a further improvement of the present invention, is text information by html Content Transformation in the step S2 specifically: right Html content deletes label, to extract the text information in html content.

As a further improvement of the present invention, the sensitive word disturbance ecology processing includes: capitalization and lowercase Conversion processing, the conversion processing of specification character and non-standard character, the conversion processing of letter and number.

As a further improvement of the present invention, the conversion processing of the capitalization and lowercase specifically: to text Information is checked character by character, and in II code value of the ASC of character at [65,90], II code value of ASC of the character is increased by 32.

As a further improvement of the present invention, the conversion processing of the specification character and non-standard character specifically: to text This information is checked character by character, and the data value of non-standard character included in text information is encoded according to Unicode Table is revised as specification character.

As a further improvement of the present invention, the conversion processing of the letter and number specifically: text information is carried out It checks, and number is replaced with letter character by character according to II code table of ASC.

As a further improvement of the present invention, building DOM Document Object Model packet is executed to html content in the step S2 Include following steps:

S21, multiple labels as input content and are parsed into using html content；

S22, each label is constructed into dom tree shape model, the dom tree shape model includes several flag nodes, wherein often A flag node includes the attribute information to match with the flag node；

S23, traversal dom tree shape model, extract the segmentation being mutually matched in html content with the attribute information of flag node Information；

Text information in S24, each segment information of extraction, and matched row is selected according to the attribute information of flag node Column sequence, to form continuous text information.

As a further improvement of the present invention, the step S21 specifically: to html content according to preset mark Note rule carries out traversal parsing operation, then by label generator distinguishing mark and passes to dom tree shape model builder.

As a further improvement of the present invention, the marking convention includes html content initial position label, html content End position label, categorical attribute, Property Name, attribute value.

As a further improvement of the present invention, the categorical attribute includes font size attribute, font inclination attribute, font Horizontal arrangement difference attribute, font vertical arrangement difference attribute, RGB difference attribute, gray value differences value attribute, color saturation category Property, contrast difference attribute.

As a further improvement of the present invention, the step S3 the following steps are included:

S31, using encoded translated device, to treated, text information carries out code conversion；

S32, the header information according to the coding section of setting length as the text information after recombining contents, middle part information And trailer information；

S33, header information, middle part information and trailer information are sequentially arranged into the text after the recombining contents in continuous state This information.

As a further improvement of the present invention, after the step S33, further includes: to the text information after recombining contents Execute following one or more kinds of operations:

Remove the operation of space markings；

Remove the operation of carriage return label；

The operation of removal line feed label；Wherein,

The operation of the removal space markings specifically: the text information after recombining contents is checked character by character, and internally The character that II code value of ASC in text information after bulk density group is 32 executes delete operation；

The operation of the removal carriage return label specifically: the text information after recombining contents is checked character by character, and internally The character that II code value of ASC is 10 in text information after bulk density group executes delete operation；

The operation of the removal line feed label specifically: the text information after recombining contents is checked character by character, and internally The character that II code value of ASC is 13 in text information after bulk density group executes delete operation；

It as a further improvement of the present invention, further include being held using encoder to html content before the step S1 is executed The step of processing of row code conversion is to be converted into Unicode coding.

To realize another above-mentioned invention, the present invention also provides a kind of method for judging rubbish mail, including it is any of the above-described The method of the item invention removal mail interference information；And

Text information after recombining contents is compared with keywords database set in database, and determine whether for Spam.

As a further improvement of the present invention, the database includes oracle database, DB2 database, Postgre SQL database, Microsoft SQL Server database, Microsoft Access database or MySQL database.

Compared with prior art, it is wrapped the beneficial effects of the present invention are: can be realized through the invention in spam The interference information contained is efficiently separated with text information, can accurately identify the various interference letter artificially encapsulated in spam Breath determines whether spam to mail and provides accurate foundation, and can effectively improve and block to spam to be subsequent Cut and filter effect.

Detailed description of the invention

Fig. 1 is a kind of flow chart for the method for removing mail interference information of the present invention；

Fig. 2 is the untreated preceding schematic diagram comprising html contents of interference informations such as English, numbers；

Fig. 3 is to execute at sensitive word disturbance ecology to interference informations such as English, numbers in html content illustrated in fig. 2 Schematic diagram after reason；

Fig. 4 is the schematic diagram of the untreated preceding html content comprising different size font interference information；

Fig. 5 is to be based at font size disturbance ecology after executing building DOM Document Object Model to html content in step S22 Dom tree shape model when reason；

Fig. 6 is the schematic diagram of the untreated preceding html content comprising different colours block interference information；

Fig. 7 is to be based on the processing of color block disturbance ecology after executing building DOM Document Object Model to html content in step S22 When dom tree shape model；

Fig. 8 is the schematic diagram of the untreated preceding html content comprising table interference information；

When Fig. 9 is in step S22 to after html content execution building DOM Document Object Model based on the processing of table disturbance ecology Dom tree shape model；

Figure 10 is showing before carrying out recombining contents to treated text information obtained in Fig. 5 or Fig. 7 or Fig. 9 It is intended to；

Figure 11 is showing after carrying out recombining contents to treated text information obtained in Fig. 5 or Fig. 7 or Fig. 9 It is intended to；

Figure 12 is the schematic diagram carried out before recombining contents to the text information comprising the meaningless character interference information of content；

Figure 13 is after carrying out recombining contents to the text information for including the meaningless character interference information of content in Figure 12 Schematic diagram.

Specific embodiment

The present invention is described in detail for each embodiment shown in reference to the accompanying drawing, but it should be stated that, these Embodiment is not limitation of the present invention, those of ordinary skill in the art according to these embodiments made by function, method, Or equivalent transformation or substitution in structure, all belong to the scope of protection of the present invention within.

Intercepting system or interception software in the prior art to spam is carrying out spam to targeted mails In decision process, the interference information that spammer adds in mail can not be removed, to significantly impact Filter effect or even some non-spam email to spam can also be identified as spam.This specification concrete mode Each application scenarios shown by part or specific implementation process base be only to example summary of the invention, therefore can not be Protection scope of the present invention or invention objective are constituted and limited.

Join shown in Fig. 1, in the present embodiment, the method for the removal mail interference information, comprising the following steps: S1, obtain Take the html content for including in mail；S2, building DOM Document Object Model is executed to html content, and DOM Document Object Model is executed It by html Content Transformation is text information after following at least one or a variety of disturbance ecology processing, the disturbance ecology processing packet It includes: the processing of color block disturbance ecology, the processing of font size disturbance ecology, the processing of table disturbance ecology；S3, will treated text Information carries out recombining contents.

If mail obtains HTML content there are HTML, and HTML information is converted to text information, conversion method For to HTML content carry out delete label operation, that is, delete "<tag name attribute name=attribute value></tag name>" Equal html label information, remaining content is Mail Contents.If HTML is not present in mail, believed using the plain text of mail Cease the content as mail.

The interference cases generally occurred in spam include:

The interfering with each other of English alphabet " I " and Arabic numerals " 1 ", English alphabet " O " (containing capitalization or small letter) and Ah The interfering with each other of Arabic numbers " 1 ", the interference of background color, the interference of text information arrangement mode.If occurring people in mail For English alphabet " I " and Arabic numerals " 1 " are replaced, thus be mingled in continuous Chinese or English and In order to avoid being volleyed by anti-rubbish mail, then the mail comprising these contents is likely to spam.It is existing in order to improve The interception and filter effect of some anti-garbage mail systems or anti-rubbish mail to these spams, it is necessary to dry to these It disturbs and is removed, and extract the text information that mail is really recorded, then pass through anti-rubbish mail system in the prior art System or anti-spam software carry out interception and filter operation, and spam is prevented to be sent in server.

In the present embodiment, the disturbance ecology processing in the step S2 is also wrapped other than above-mentioned three kinds of interference processing It includes: the processing of sensitive word disturbance ecology.In step S2 by html Content Transformation be text information specifically: to html content delete mark Label, to extract the text information in html content.

Join shown in Fig. 2 and Fig. 3, in the present embodiment, sensitive word disturbance ecology processing includes: capitalization and small letter Conversion processing, the conversion processing of specification character and non-standard character, the conversion processing of letter and number of letter.

Wherein, the conversion processing of capitalization and lowercase specifically: text information is checked character by character, and II code value of ASC of character increases by 32 at [65,90], by II code value of ASC of the character.It is changed to [97,122], completes The conversion of English alphabet capital and small letter.

The conversion processing of specification character and non-standard character specifically: text information is checked character by character, and will be literary The data value of non-standard character included in this information is revised as specification character according to Unicode coding schedule.For example, it examines Looking into input text byte sequence is 0x2776, it may be assumed thatIt is then converted into ASC code 49, i.e. number 1.

In the present embodiment, the definition of the non-standard character is true expressed by the people based on the increase interference information Meaning and select.Such as expression Arabic numerals " 1 ", non-standard character include but is not limited to " (1) ", " (1) ", " 1. ", "Ⅰ","ⅰ","1".The conversion processing of letter and number specifically: text information is checked character by character, and according to ASC II Code table is replaced number with letter.

For example, II numerical value of alphabetical ASC to be converted: 105 (i), 111 (o), the two is distinguished in II code table of ASC Corresponding II code value of target number ASC is 49 (1), 48 (0).Therefore, it can be determined by searching for numerical value in II code table of ASC Corresponding relationship each other, and alphabetical " i " is replaced with Arabic numerals " 1 ", by alphabetical " o " and Arabic numerals " 0 " is replaced.

In the present embodiment, in step S2 to html content execute building DOM Document Object Model the following steps are included:

S21, multiple labels as input content and are parsed into using html content；Wherein, step S21 specifically: to text This information carries out traversal parsing operation according to preset marking convention, then by label generator distinguishing mark and passes to Dom tree shape model builder.

The marking convention includes html content initial position label, html end of text position mark, categorical attribute, category Property title, attribute value.

The categorical attribute include font size attribute, font inclination attribute, font horizontal arrangement difference attribute, font hang down In line cloth difference attribute, RGB difference attribute, gray value differences value attribute, color saturation attribute, contrast difference attribute.

Preferably, in the present embodiment, the step S3 the following steps are included:

After the step S33, further includes: execute following one or more kinds of behaviour to the text information after recombining contents Make:

Remove the operation of space markings；

Remove the operation of carriage return label；

The operation of removal line feed label；Wherein,

The operation of the removal space markings specifically: the text information after recombining contents is checked character by character, and internally The character that II code value of ASC in text information after bulk density group is 32 executes delete operation.

The step S1 further includes executing code conversion processing to html content using encoder to be converted into before executing The step of Unicode is encoded.

Next interference behaviour is removed than the interference occurred in more typical html content to three kinds in conjunction with Fig. 4 to Fig. 9 The specific implementation process of work is described in detail.

Join Fig. 4 to Fig. 5 shown in, the situation be marking convention based on font size attribute when extract html content in text The detailed process of information.

As shown in figure 4, occurring two row texts in Fig. 4.For simplify indicate, in the present specification with " A, B, C, D, E, F, G " refers to Chinese text, and " XX " is used to refer to Arabic numerals.The first row " ABCDF hair ", the second row " ticket Q1980021XX "； Wherein the font of " ABCD " and " EFG " are smaller, attribute: font-size:6px, the font of " hair " and " ticket Q1980021XX " compared with Greatly, attribute: font-size:20px.

As shown in figure 5, in the present embodiment, parsing HTML includes three steps: marking, tree building and extraction need Information.The code of above-mentioned detailed process is as follows:

Next, traversal dom tree shape model, extracts point being mutually matched in html content with the attribute information of flag node Segment information.Four segment informations " ABCD ", " hair ", " EFG ", " ticket Q1980021XX " are obtained, please specifically be join shown in Figure 10.

It finally executes step S24, extract the text information in each segment information, and according to the attribute information of flag node Select it is matched put in order, with formed be in continuous text information, please specifically join shown in Figure 11.

Join Fig. 6 to Fig. 7 shown in, the situation be marking convention based on RGB difference attribute when extract html content in text The detailed process of information.

Under such interference cases, step S23 specifically: traversal dom tree shape model extracts text information.According to CSS layer The rule of folded pattern, generates the final attribute of each flag node, (is defaulted as according to the value of attribute color in flag node RGB:000000) and the value of attribute background-color (being defaulted as RGB:FFFFFF), extract in html content in order The text information of flag node.According to determination method, it is determined whether the extraction of Yao Jinhang text information.Calculate foreground background colour Euclidean distance (d12), wherein x1y1z1 indicate foreground rgb value, x2y2z2 indicate background colour rgb value, d12 indicate Europe Formula is compared apart from calculated result, by it with the color critical value V being set, shown in the following formula of calculation formula (1):

The text information in html content is extracted if d12 < V, is not otherwise extracted in the text information in html content Hold.

Certainly, those skilled in the art can reasonable prediction arrive, may be based on gray value differences value attribute, color saturation category Property, one or more marking convention such as contrast difference attribute extract text information, details are not described herein.

The code of above-mentioned detailed process is as follows:

Set distance critical value is V=100, calculates Euclidean distance:

Text " ABCD " Euclidean distance d12=147；

Text " hair " Euclidean distance d12=0；

Text " EFG " Euclidean distance d12=147；

Text " Q1980021XX " Euclidean distance d12=0.

It finally executes step S24, extract the text information in each segment information, and according to the attribute information of flag node Select it is matched put in order, to form continuous text information.It please specifically join shown in Figure 11.

Join shown in Fig. 8 to Fig. 9, which handles html content executive table disturbance ecology and obtain the tool of text information Body process.

As shown in figure 8, " ABCD ", " hair ", " EFG ", " ticket Q1980021XX " are respectively encapsulated in a table (table) In.Each " tr " mark indicates that a line, the line number of record form are 1-M in table mark.Each " td " mark in table mark Know the column represented in row, recording every ranks number is 1-N.

The code of above-mentioned detailed process is as follows:

It finally executes step S24, extract the text information in each segment information, and according to the attribute information of flag node Select it is matched put in order, to form continuous text information, please specifically join shown in Figure 11.

Join shown in Figure 12 and Figure 13, there is disclosed to the text information progress comprising the meaningless character interference information of content The detailed process of recombining contents.

For the text information of the meaningless character interference of content, needs to recombinate html content, be conducive to analysis and place Reason, specific steps are as follows:

By Mail Contents code conversion, target character integrates as Unicode coding；

Extraction Unicode encodes section and is used as text for the character (II code value of ASC) of [48,57], [65,90], [97,122] The header information of text information after this recombination.

Extract text after the character (Chinese character) that Unicode coding section is [13312,40895] is recombinated as text The middle part information of information.

Extraction Unicode coding section is after the character (other language characters) of [40960,55215] is recombinated as text The trailer information of text information.

It is a space symbol that text information after recombination, which is replaced continuous space,；

Remove after recombination two Chinese characters (either English alphabet or Hiragana or Japanese pieces in text information Assumed name or German) between space；

Text information after recombination is deleted into carriage return and line feed symbol.

Wherein, the operation of the removal space markings specifically: the text information after recombining contents is checked character by character, and The character for being 32 to II code value of ASC in the text information after recombining contents executes delete operation.

The operation of the removal line feed label specifically: the text information after recombining contents is checked character by character, and internally The character that II code value of ASC is 13 in text information after bulk density group executes delete operation.By header information, middle part information and tail portion Information is sequentially arranged into the recombining contents text in continuous state.

The final processing result of html content in Figure 12 are as follows:

"d！) ss as (v, v, v, v) b b z,.C c s c v c m c x Q Q:a e c n z n！2011 07 13 your the good existing invoices of our company can externally act on behalf of building advertisement and consult The Liu Sheng in need that please contact that pays the bill after sale etc. can be tested is ask if any bothering see forgiving date Shao ".

It can be realized through the invention and interference information included in spam efficiently separated with text information, it can The various interference informations artificially encapsulated in accurate identification spam determine whether spam to mail and provide to be subsequent Accurate foundation.

This specification also discloses a kind of method for judging rubbish mail, and the side of mail interference information is removed including above-mentioned Method；And the text information after recombining contents is compared with keywords database set in database, and determine whether for Spam.

Preferably, which includes oracle database, DB2 database, Postgre SQL database, Microsoft SQL Server database, MicrosoftAccess database or MySQL database, and further preferably MySQL data Library.By the above method, interception and filter effect to spam can be effectively improved.

The series of detailed descriptions listed above only for feasible embodiment of the invention specifically Protection scope bright, that they are not intended to limit the invention, it is all without departing from equivalent implementations made by technical spirit of the present invention Or change should all be included in the protection scope of the present invention.

It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included within the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.

In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art The other embodiments being understood that.

Claims

1. a kind of method for removing mail interference information, which comprises the following steps:

S1, the html content for including in mail is obtained；

S2, building DOM Document Object Model is executed to html content, and DOM Document Object Model is executed following at least one or more It by html Content Transformation is text information after the processing of kind of disturbance ecology, the disturbance ecology processing includes: color block disturbance ecology Processing, the processing of font size disturbance ecology, the processing of table disturbance ecology；

S3, by treated, text information carries out recombining contents；

In the step S2 to html content execute building DOM Document Object Model the following steps are included:

S21, multiple labels as input content and are parsed into using html content；

S22, each label is constructed into dom tree shape model, the dom tree shape model includes several flag nodes, wherein each mark Note node includes the attribute information to match with the flag node；

S23, traversal dom tree shape model, extract the segment information being mutually matched in html content with the attribute information of flag node；

Text information in S24, each segment information of extraction, and it is suitable according to the matched arrangement of the attribute information of flag node selection Sequence, to form continuous text information.

2. the method for removal mail interference information according to claim 1, which is characterized in that the interference in the step S2 Identifying processing further include: sensitive word disturbance ecology processing.

3. the method for removal mail interference information according to claim 2, which is characterized in that by html in the step S2 Content Transformation is text information specifically: label is deleted to html content, to extract the text information in html content.

4. the method for removal mail interference information according to claim 3, which is characterized in that the sensitive word disturbance ecology Processing includes: the conversion processing of the conversion processing of capitalization and lowercase, specification character and non-standard character, letter and number The conversion processing of word.

5. the method for removal mail interference information according to claim 4, which is characterized in that the capitalization and small letter The conversion processing of letter specifically: text information is checked character by character, and in II code value of the ASC of character at [65,90], II code value of ASC of the character is increased by 32.

6. the method for removal mail interference information according to claim 4, which is characterized in that the specification character and non-rule The conversion processing of model character specifically: text information is checked character by character, and by non-standard included in text information The data value of character is revised as specification character according to Unicode coding schedule.

7. the method for removal mail interference information according to claim 4, which is characterized in that the letter turns with number Change processing specifically: text information is checked character by character, and number is replaced with letter according to II code table of ASC.

8. the method for removal mail interference information according to claim 1, which is characterized in that the step S21 specifically: Traversal parsing operation is carried out according to preset marking convention to html content, then by label generator distinguishing mark and is passed Pass dom tree shape model builder.

9. the method for removal mail interference information according to claim 8, which is characterized in that the marking convention includes Html content initial position label, html end of text position mark, categorical attribute, Property Name, attribute value.

10. the method for removal mail interference information according to claim 9, which is characterized in that the categorical attribute includes Font size attribute, font tilt attribute, font horizontal arrangement difference attribute, font vertical arrangement difference attribute, RGB difference category Property, gray value differences value attribute, color saturation attribute, contrast difference attribute.

11. it is according to claim 1 removal mail interference information method, which is characterized in that the step S3 include with Lower step:

S32, according to setting length coding section as the header information of the text information after recombining contents, middle part information and tail Portion's information；

S33, header information, middle part information and trailer information are sequentially arranged into the text envelope after the recombining contents in continuous state Breath.

12. the method for removal mail interference information according to claim 11, which is characterized in that after the step S33, Further include: following one or more kinds of operations are executed to the text information after recombining contents:

Remove the operation of space markings；

Remove the operation of carriage return label；

The operation of removal line feed label；Wherein,

The operation of the removal space markings specifically: the text information after recombining contents is checked character by character, and internal bulk density The character that II code value of ASC in text information after group is 32 executes delete operation；

The operation of the removal carriage return label specifically: the text information after recombining contents is checked character by character, and internal bulk density The character that II code value of ASC is 10 in text information after group executes delete operation；

The operation of the removal line feed label specifically: the text information after recombining contents is checked character by character, and internal bulk density The character that II code value of ASC is 13 in text information after group executes delete operation.

13. the method for removal mail interference information according to claim 1, which is characterized in that the step S1 executes it Before further include the steps that using encoder to html content execute code conversion processing be converted into Unicode coding.

14. a kind of method for judging rubbish mail, which is characterized in that dry including removal mail described in any of the above-described claim The method for disturbing information；And

Text information after recombining contents is compared with keywords database set in database, and is determined whether for rubbish Mail.

15. method for judging rubbish mail according to claim 14, which is characterized in that the database includes Oracle number According to library, DB2 database, Postgre SQL database, Microsoft SQL Server database, Microsoft Access Database or MySQL database.