CN106817297B

CN106817297B - A method of spam is identified by html tag

Info

Publication number: CN106817297B
Application number: CN201710043772.7A
Authority: CN
Inventors: 徐慧灵; 纪春来
Original assignee: Huayun Data (xiamen) Network Co Ltd
Current assignee: Huayun Data Co ltd
Priority date: 2017-01-19
Filing date: 2017-01-19
Publication date: 2019-11-26
Anticipated expiration: 2037-01-19
Also published as: CN106817297A

Abstract

The present invention provides a kind of methods for identifying spam by html tag, comprising the following steps: S1, building describe table using the label that character describes label in HTML code；S2, sequence extract the label in the HTML code of spam, and describe table according to label and extract the verification data comprising multiple characters；S3, after receiving new mail, extract the HTML code of new mail, and table is described according to label, the label in the HTML code of new mail is translated into description data；S4, description data are compared with verification data, so that major general hits new mail corresponding to the description data of verification data and is determined as spam.In invention, it only needs to be compared and determine with verification data composed by the HTML code institute in the spam being previously set by describing data composed by the HTML code by new mail, the computing cost for reducing background server or web page search engine significantly simplifies the step of identifying spam.

Description

A method of spam is identified by html tag

Technical field

The present invention relates to anti-spam technologies field more particularly to a kind of sides that spam is identified by html tag Method.

Background technique

With the development of internet, spam endangers more and more big caused by user.It is usually wrapped in spam It includes and promotes mail or with pornographic or other flames mail.For this purpose, occurring a variety of anti-spam in the prior art The identification of mail and filter method and background server strobe utility.

The method of the anti-rubbish mail of mainstream specifically includes that at present

(1) optical character identification method (OCR) proposes the content comprising advertising pictures or plain text by realizing It takes, ad content is judged whether by content, thus realize the identification of spam, but this technology is caused by computer Expense is larger.

(2) the mail-detection technology based on MD5 verification is turned by the way that the character string of random length is executed hash operations Change the value of shorter regular length into.It, can be by comparing two since the MD5 value of any two kinds of characters string is not identical Whether the MD5 value of character string is identical to judge two character strings.But this anti-spam technologies are non-critical to Mail Contents Identical, any variation Shi Douhui occur leads to the difference of MD5 value, to seriously affect to whether the mail is that spam is sentenced Determine and execute filtering and intercepts operation.

(3) prior art being filtered based on Bayes classifier to spam, related patents can refer to middle promulgated by the State Council Bright patent CN200510135603.3, Chinese invention patent CN200410063953.9, Chinese invention patent CN200510087762.0, Chinese invention patent CN200510082282.5 etc..But in use Bayes classifier to postal When part is classified, need in advance to model spam, and classify to subsequent mail items according to model, therefore existing anti- There are complex steps and the lower defects of reliability by anti-spam technology.

Meanwhile anti-spam technologies in the prior art are directly to packet in mail (it is mainly the mail of html format) The preset text or picture contained is scanned detection, certainly will cause to be also required to execute to the mail normally sent in this way Above-mentioned inspection perhaps filter operation therefore will increase the computing cost of background server or web page search engine.

In view of this, it is necessary to the recognition methods in the prior art to spam is improved, it is above-mentioned to solve Problem.

Summary of the invention

It is an object of the invention to disclose a kind of method for identifying spam by html tag, to realize to HTML The spam of format is effectively identified, the computing cost of background server or web page search engine is reduced, and simplifies identification The step of spam.

For achieving the above object, the present invention provides a kind of method for identifying spam by html tag, packets Include following steps:

S1, building describe table using the label that character describes label in HTML code；

S2, sequence extract the label in the HTML code of spam, and describe table according to label and extract comprising multiple words The verification data of symbol；

S3, after receiving new mail, extract the HTML code of new mail, and table is described for new mail according to label Label in HTML code translates into description data；

S4, description data are compared with verification data, so that major general hits corresponding to the description data of verification data New mail be determined as spam.

As a further improvement of the present invention, the stem of at least matching verification data is put in order or tail portion arrangement is suitable New mail corresponding to the description data of sequence is determined as spam.

As a further improvement of the present invention, it includes several records that the label, which describes table, and the record is by check number According to and verification data length information composition；

If describing resulting description data after table translates the label in the HTML code of new mail using label Length information is equal with the verification length information of data, then new mail is determined as spam.

As a further improvement of the present invention, the method also includes: it is described verification data length information value model It encloses for [m, n], if describing resulting description data after table translates the label in the HTML code of new mail using label Length information be located at verification data length information value range in, then new mail is determined as spam.

As a further improvement of the present invention, the upper limit n of the value range of the length information of the verification data takes 100, The lower limit m of value range takes 20.

As a further improvement of the present invention, the character is digital, letter or one of II yard of ASC or two The mechanized data of kind any of the above combination.

As a further improvement of the present invention, the byte length of the character is fixed.

As a further improvement of the present invention, the method also includes by with verification data registration be higher than 80% description New mail corresponding to data is determined as spam.

As a further improvement of the present invention, the method also includes the description data scanning of new mail is verified data, If number of characters included in the segment that description data and verification data coincide is higher than the character quantity that description data are included 80%, then the new mail is determined as spam；Wherein, the segment that the description data and verification data coincide Length is more than or equal to 10 characters.

Compared with prior art, the beneficial effects of the present invention are: in invention, it is only necessary to by by the HTML generation of new mail Description data composed by code are compared with verification data composed by the HTML code institute in the spam being previously set, To determine whether new mail is spam, due to only needing that the very small character of byte number is compared in whole process, Therefore the computing cost for reducing background server or web page search engine significantly simplifies the step of identification spam Suddenly；Meanwhile which kind of interference no matter spammer make to the content of text of spam, finally mentions from the spam Whether the HTML code of taking-up all has consistency, therefore can be expeditiously that spam makes accurate judgement to mail.

Detailed description of the invention

Fig. 1 is a kind of method flow diagram that spam is identified by html tag of the present invention；

Fig. 2 is the envelope spam sample graph that is pre-entered in a computer；

Fig. 3 is HTML code corresponding to spam shown in Fig. 2；

Fig. 4 is that spammer illustrated in fig. 2 is judged as the quasi- of hair transmitted by another e-mail recipient The sample graph of the new mail of spam；

Fig. 5 is the HTML code of the quasi- new mail for being judged as spam shown in Fig. 4；

Fig. 6 is the sample graph of normal email；

Fig. 7 is the HTML code figure of normal email illustrated in fig. 6；

Fig. 8 describes table translation by label for the label in the HTML code of another envelope new mail to be determined and is formed by The verification data of head, middle part and tail portion and the spam pre-entered in description data distinguish similar schematic diagram；

Fig. 9 is that the label that further seals in the HTML code of new mail to be determined describes table translation by label and is formed by The whole similar schematic diagram of head, middle part and tail portion in description data and the verification data of spam pre-entered.

Specific embodiment

The present invention is described in detail for each embodiment shown in reference to the accompanying drawing, but it should be stated that, these Embodiment is not limitation of the present invention, those of ordinary skill in the art according to these embodiments made by function, method, Or equivalent transformation or substitution in structure, all belong to the scope of protection of the present invention within.

It please join a kind of a kind of specific embodiment party for the method that spam is identified by html tag of the present invention shown in FIG. 1 Formula.Method shown by the present invention is mainly used for carrying out the mail of html format the judgement of spam.In the present specification, Term " mail " is equal to term " Email ".User answers in the webpage or mail of computer or the equipment for accessing internet With received in software opposite end transmission mail, and in local computer or mail server to the new mail received into Row determines.

A kind of method for identifying rubbish by html tag of the present invention includes the following steps.

Firstly, building describes the description table of the label in HTML code using character.

HTML (hypertext markup language) is the language being most widely used on current network, and constitutes web document Dominant language.The descriptive text that html text is made of HTML command, HTML command can with comment, figure, animation, Sound, table, link etc..The structure of HTML includes head (Head), main body (Body) two large divisions, and wherein head description browses Information needed for device, and main body then includes the particular content to be illustrated.

In order to realize the judgement to spam, it is necessary first to define spam.Sentence relative to traditional spam Determine method, the present invention does not extract the text information or graph data of mail, and the HTML in the mail based on html format Frequency of occurrence, sequence or the arrangement regulation of label in code are counted and are formed for characterizing and describing the spam Rubbish attribute.Method shown by the present invention can run in mail server, and mail server connects database, for protecting Deposit the information such as description table.For patterned mail, it is made of with HTML code text (Text).And HTML code for The description of text has distinctive specification and sequence.Mail server or receive mail local server in mail into After row decoding, the HTML code of the mail can be shown.

In embodiments, the character is number, letter or one of II yard of ASC or two or more any Combined mechanized data.Preferably, in the present embodiment, the byte length of character is fixed, the character byte length Less than or equal to 3 standard byte length.Label as shown in table 1 below is consequently formed and describes table.Label, which describes table, can be reserved for In database associated by local mail server, and it is corresponding with character by administrator to describe the label in table to label Relationship and character type make an amendment.It is excessive for number of labels in the HTML code of the spam pre-entered in database If, 2 standard byte length, and not more than 3 standard words will be unified for the character digit of label tool corresponding relationship Save length.

Label

Character

Label

Character

Label

Character

Label

Character

A/AREA

a

FRAME

k

OBJECT

u

SUP

5

B

b

FRAMESET

l

OL

v

TABLE

6

BR

c

H1-H6

m

P

w

TD

7

center

d

HR

n

PRE

x

TEXTAREA

8

DD

e

IFRAME

o

SCRIPT

y

TR

9

DIV

f

IMG

p

SELECT

z

UL

0

DL

g

INPUT

q

SPAN

1

DT

h

LABEL

r

STRONG

2

FONT

i

LEGEND

s

STYLE

3

FORM

j

MAP

t

SUB

4

Table 1- label describes table

Since the transmission behavior of spam usually has certain regularity.Such as same sender is to different addressees In the mail of the same or different Mail Contents of the content that human hair is sent, the label in the HTML code of mail often has together One property keeps identical substantially.It therefore, whether can be that spam is made fast and accurately to mail for the description of label Determine.It is subsequent spam is described and the subsequent decision process to new mail in, by describing table with the label In corresponding character, character sequence and matching degree be compared, to judge whether new mail is spam.

Next, sequence extracts the label in the HTML code of spam, and table is described according to label and is extracted comprising more The verification data of a character.

Join shown in Fig. 2, Fig. 2 shows the envelope spam pre-entered, the HTML generations of spam illustrated in fig. 2 Code is as shown in Figure 2.Content is label in "<>".As a result, by the label in the HTML code in Fig. 2 according to the sequence of HTML code (shown in ginseng Fig. 3), is recorded after sequentially extracting multiple labels.The label of spam in Fig. 3 be " div, span, div,a,div".Then, the relationship that label shown in above-mentioned label and table 1 is described to label and character in table, by above-mentioned rubbish The label of mail is replaced as " f, 1, f, a, f ", so that will be extracted according to label statement table has the word that puts in order by multiple Data are verified composed by according with, and can will be verified in database associated by data deposit and local mail server.The database Including MySQL database, oracle database or DB2 database.

As described in Figure 2, sender's mail address of the spam pre-entered is " jichunlai@chinac.com ", Addressee's mail address is " jichunlai@chinac.com ", and Mail Contents are that " user test1, our company draws a bill in generation, net Location:It clicks", 2017-1-1 12:00:00.Wherein, " click " is a hyperlink, and may link to malice network address, yellow Network address or sale network address.

Join shown in Fig. 4 and Fig. 5, if the sender of the spam pre-entered is in different time to the same addressee The identical mail of content is had sent, then can be identified as spam.But in the present invention without to doubtful spam Content of text is determined, but the label in the HTML code based on mail and label describe description data corresponding in table Verification data corresponding to describing in table with the label in the HTML code of spam with label are compared to determine new stamp Whether part (shown in ginseng Fig. 4) is spam.

As shown in figure 4, sender's mail address of new mail is " jichunlai@chinac.com ", addressee's mail Location be " jichunlai@chinac.com ", Mail Contents be " user zhangsan, our company generation draw a bill, network address:It clicks", 2017-1-2 18:00:00.Wherein, " click " is a hyperlink, and may link to malice network address, yellow network address or pin Sell network address.

Join shown in Fig. 5, the label of the HTML code of the new mail is still " div, span, div, a, div ".Next, The relationship that the label of the label of the HTML code of above-mentioned mail and table 1 is described to label and character in table, by above-mentioned spam Label translated and replace with " f, 1, f, a, f ", thus obtain description data.The rubbish sent due to spammer Rubbish mail has formatted stationarity and certain rule.Therefore, the sender of same spam sends in different time The identical spam of content or same spammer are in the same or different time to transmitted by different addressees Label label shown by table 1 of HTML code of the identical spam of content describe to be formed by after table is translated Description data and the verification data of the spam pre-entered have identity or keep identical substantially.

Therefore, in the present embodiment, the HTML code of new mail can be extracted, and according to mark after receiving new mail It signs description table and the label in the HTML code of new mail is translated into description data, then carry out description data with verification data Compare, so that major general hits new mail corresponding to the description data of verification data and is determined as spam.In present embodiment In, the meaning of " hit " are as follows: verification data are identical as the character types and character arrangements sequence that describe in data.

Join Fig. 6 and envelope non-spam email shown in Fig. 7 (another envelope new mail) and its HTML generation of the non-spam email Code.In Fig. 7, label in the HTML code of the new mail is " div, div, div ", by with table 1 carry out to label into It is " f, f, f " that description data are formed by after row replacement or translation.This describes data and the rubbish illustrated in fig. 2 pre-entered The verification data " f, 1, f, a, f " of rubbish mail are different in the length of character, the distributing order of character, therefore can will be shown in Fig. 6 New mail out is determined as non-spam email.

The Mail Contents of spam would generally be pretended based on spammer or since spam is sent out It send the difference of time and time attribute or sending time in Mail Contents is caused to have differences, for example, being inserted in spam Enter and is sent out in the interference informations such as the picture unrelated with promotional component, audio file or Mail Contents as shown in Fig. 2 and Fig. 4 Send the difference of time.It but is same or similar due to needing the promotional component sent to e-mail recipient in spam 's.Therefore, the label in the HTML code of spam describes to be formed by after table is translated in description data according to label All identical or part similitudes can be showed.As long as the label of the HTML code of new mail to be determined is by shown in table one Label out describes to be formed by after table is translated in description data in a section continuous data slot and database in advance Verification data corresponding to the spam of input match substantially, the new mail can be determined as spam.Specifically, The similitude of the verification data of the description data and predefined spam of new mail or part similitude institute specific as follows It states.

The part positioned at head that the part similitude can be in description data describes data and matches with verification data (shown in ginseng Fig. 8), the part similitude are also possible to describe to describe data and check number positioned at the part at middle part in data According to matching (ginseng Fig. 8 shown in), the part which is also possible to describe to be located in data tail portion describe data with Verification data match (ginseng Fig. 8 shown in), and the stems of at least matching verification data puts in order or tail portion to realize New mail corresponding to the description data to put in order is determined as spam.

Specifically, if the label of the HTML code of an envelope new mail describes to be formd after table is translated according to label Description data as shown in Figure 8 " a, b ..., h, i ..., e, q, q ..., q, 6,0,0 ..., f ..., f, 5 ".If foregoing description number According to head " a, b ..., h, i " it is same or similar with the verification data of preset envelope spam, then can assert that this is new Mail is spam.Shown in similarly, if the description data of new mail middle part " e, q, q ..., q, 6,0,0 " and another envelope The verification data of spam are same or similar, then can assert that the new mail is spam；If the description data of new mail Tail portion " f ..., f, 5 " it is same or similar with the verification data of preset another envelope spam, then can assert the new stamp Part is spam.

Certainly, as long as being in continuous character institute group in new mail to be determined corresponding to description data illustrated in fig. 8 At son description data (unit relative to entire description data) and the rubbish that is inputted in database or server in advance If the verification data of rubbish mail are same or similar, so that it may which the new mail is determined as spam.

Therefore, by the above-mentioned means, can effectively exist to overcoming spammer that various interference informations are artificially added Various interference caused by carrying out in the decision process of spam using method shown by the present invention are guaranteeing to determine accurately Property while, as far as possible improve to pretend or encapsulation interference information spam carry out accurately identification and determine, show Landing reduces the computing cost of background server or web page search engine.

In the present embodiment, it includes several records which, which describes table, and the record is by verification data and check number According to length information composition.If describing resulting after table translates the label in the HTML code of new mail retouch using label The length information for stating data is equal with the verification length information of data, then new mail is determined as spam.

Specifically, the verification data of the label of the HTML code of preset spam are (i.e. shown in ginseng Fig. 2 to Fig. 5 " f, 1, f, a, f ") length information be 5, Fig. 4 and new mail illustrated in fig. 5 HTML code label description data (i.e. " f, 1, f, a, f ") length information be also be 5.Therefore, pass through pair of above-mentioned verification data and the length information for describing data Than that new mail shown in Fig. 4 can be determined as spam.In above-mentioned comparison procedure, computer is only needed to byte number Considerably less description data are compared with verification data, can be substantially reduced the computing cost of computer, be improved to rubbish postal The judgement efficiency of part.

In the present embodiment, the method that should identify spam by html tag further include: the verification data The value range of length information is [m, n], is translated if describing table using label to the label in the HTML code of new mail In the value range for the length information that the length information of resulting description data is located at verification data afterwards, then new mail is determined as Spam.Wherein, the upper limit n of the value range of the length information of the verification data takes 100, and the lower limit m of value range takes 20. It should be noted that the HTML code of the mail is by label shown by table one if an envelope mail is non-spam email Describing data made of the translation of description table can not be too long.Therefore, in conjunction with actual conditions, in the present embodiment, data are verified The upper limit n of length information takes 100；If too long, the calculating that will lead to description data in scanned for checkout data this processes is opened Pin.Meanwhile the lower limit m for verifying the length information of data be also impossible to it is too short.It is too short, it will lead to and new mail is being caused to judge by accident, To which non-spam email is determined as spam.

It recombines shown in Fig. 8, in order to further increase the recognition efficiency to new mail, in the present invention, this method is also Including will be determined as spam with new mail corresponding to description data of the verification data registration higher than 80%.Specifically, Both the label in the HTML code of new mail description data can be formed by as a whole and in advance after translation The entirety of the verification data of the spam of input is compared, and will be corresponding to description data of the whole registration higher than 80% New mail be determined as spam；It can also be by the conduct scanning element of the verification data of the spam pre-entered, with right Longer description data are scanned, if 80% of a segment description data or a segment description data in hit description data Character and the verification data of the spam pre-entered match or hit, then the new mail is determined as rubbish postal Part.

Join shown in Fig. 9, further includes by the description data scanning check number of new mail in the present embodiment further According to if number of characters included in the segment that description data and verification data coincide is higher than the number of characters that description data are included The 80% of amount, then be determined as spam for the new mail；Wherein, the segment that the description data and verification data coincide Length be more than or equal to 10 characters.

In in this situation, the three description data identified shown in Fig. 9 with three dotted line frames are mutually be overlapped with verification data The segment (i.e. " sub- description data ") of conjunction.It include at least ten character in each segment, and the byte number of character is identical.This theory In bright book, indicated to simplify, it is unified that the byte length of character is set as 1 standard byte length.Wherein, in three segments With the length of the character in the verification data of the spam pre-entered and put in order consistent.

It is assumed that verification data include 100 characters, and this 100 characters have specifically put in order.Work as local mail It is 90 by the length that label describes the description data that table is translated when server receives portion new mail to be determined A character.Description data sequential scan is verified into data；If (character quantity is greater than or waits one section of character in description data A continuous fragment in 10) hit verification data, and word included in the segment in data hit verification data is described The quantity registration of character included in verification data corresponding to the quantity of symbol and the spam pre-entered is 80% More than, then the new mail to be determined can be determined as spam.To realize before guaranteeing lower computing cost It puts, and can be improved the accuracy rate determined spam, prevent erroneous judgement.

The series of detailed descriptions listed above only for feasible embodiment of the invention specifically Protection scope bright, that they are not intended to limit the invention, it is all without departing from equivalent implementations made by technical spirit of the present invention Or change should all be included in the protection scope of the present invention.

It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included within the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.

In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art The other embodiments being understood that.

Claims

1. a kind of method for identifying spam by html tag, which comprises the following steps:

S2, sequence extract the label in the HTML code of spam, and describe table according to label and extract comprising multiple characters Verify data；

S3, after receiving new mail, extract the HTML code of new mail, and HTML generation of the table by new mail is described according to label Label in code translates into description data；

S4, description data are compared with verification data, so that it is new corresponding to the description data of major general's hit verification data Mail is determined as spam.

2. the method according to claim 1, wherein will at least matching verification data stem put in order or New mail corresponding to the description data that tail portion puts in order is determined as spam.

3. according to claim 1 or claim 2, which is characterized in that it includes several records, institute that the label, which describes table, Record is stated to be made of the length information of verification data and verification data；

If describing the length of resulting description data after table translates the label in the HTML code of new mail using label Information is equal with the verification length information of data, then new mail is determined as spam.

4. according to the method described in claim 3, it is characterized in that, the method also includes: it is described verification data length letter The value range of breath is [m, n], if describing gained after table translates the label in the HTML code of new mail using label Description data length information be located at verification data length information value range in, then new mail is determined as rubbish postal Part.

5. according to the method described in claim 4, it is characterized in that, it is described verification data length information value range it is upper Limit n takes 100, and the lower limit m of value range takes 20.

6. the method according to claim 1, wherein the character is in digital, letter or II yard of ASC The mechanized data of one or two kinds of any of the above combination.

7. the method according to claim 1, wherein the byte length of the character is fixed.

8. the method according to claim 1, wherein the method also includes being higher than with verification data registration New mail corresponding to 80% description data is determined as spam.

9. according to the method described in claim 8, it is characterized in that, the method also includes by the description data scanning of new mail Data are verified, are included if number of characters included in the segment that description data and verification data coincide is higher than description data The new mail is then determined as spam by the 80% of character quantity；Wherein, the description data coincide with verification data Segment length be more than or equal to 10 characters.