CN106817297A

CN106817297A - A kind of method that spam is recognized by html tag

Info

Publication number: CN106817297A
Application number: CN201710043772.7A
Authority: CN
Inventors: 徐慧灵; 纪春来
Original assignee: Wuxi Shangtong Cloud Technology Co Ltd
Current assignee: Huayun Data Co ltd
Priority date: 2017-01-19
Filing date: 2017-01-19
Publication date: 2017-06-09
Anticipated expiration: 2037-01-19
Also published as: CN106817297B

Abstract

The invention provides a kind of method that spam is recognized by html tag, comprise the following steps：The label that S1, structure describe label in HTML code using character describes table；S2, order extract the label in the HTML code of spam, and describe verification data of the table extraction comprising multiple characters according to label；S3, after new mail is received, extract the HTML code of new mail, and the label in HTML code of the table by new mail is described according to label translate into description data；S4, description data and verification data are compared, so that the new mail that major general hits corresponding to the description data of verification data is judged to spam.In invention, only need to be compared and judge by the description data for being constituted the HTML code of new mail and the verification data that HTML code was constituted in the spam being previously set, the computing cost of background server or web page search engine is significantly reduced, the step of recognizing spam is simplified.

Description

A kind of method that spam is recognized by html tag

Technical field

The present invention relates to anti-spam technologies field, more particularly to a kind of side that spam is recognized by html tag Method.

Background technology

With the development of internet, the harm that spam is caused to user is more and more big.Generally wrapped in spam Include and promote mail or the mail with pornographic or other flames.Therefore, occurring in that various anti-spam in the prior art The identification of mail and filter method and background server strobe utility.

The method of the anti-rubbish mail of current main flow mainly includes：

(1) optical character identification method (OCR), it passes through to realize putting forward the content comprising advertising pictures or plain text Take, ad content is judged whether by content, so as to realize the identification of spam, but what this technology was caused to computer Expense is larger.

(2) the mail-detection technology based on MD5 verifications, it performs hash operations, turns by by the character string of random length Change the value of shorter regular length into.Because the MD5 values of any two kinds of characters string are differed, therefore can be by comparing two Whether the MD5 values of character string are identical to judge two character strings.But this anti-spam technologies are to Mail Contents non-critical It is identical, occur all causing the difference of MD5 values during any change, so as to have a strong impact on to whether the mail is that spam is sentenced Determine and perform filtering and intercept operation.

(3) in the prior art filtered to spam based on Bayes classifier, Patents refer to middle promulgated by the State Council Bright patent CN200510135603.3, Chinese invention patent CN200410063953.9, Chinese invention patent CN200510087762.0, Chinese invention patent CN200510082282.5 etc..But, Bayes classifier is being used to postal , it is necessary to be modeled to spam in advance when part is classified, and subsequent mail items are classified according to model, thus it is existing anti- There is complex steps and the relatively low defect of reliability in anti-spam technology.

Meanwhile, anti-spam technologies of the prior art are directly to bag in mail (its mail for being mainly html format) The word set in advance or picture for containing are scanned detection, certainly will so cause the normal mail for sending is also required to perform Above-mentioned inspection or filter operation, therefore the computing cost of background server or web page search engine can be increased.

In view of this, it is necessary to which the recognition methods to spam of the prior art is improved, it is above-mentioned to solve Problem.

The content of the invention

It is an object of the invention to disclose a kind of method for recognizing spam by html tag, it is used to realize to HTML The spam of form is effectively recognized that the computing cost of reduction background server or web page search engine simplifies identification The step of spam.

For achieving the above object, the invention provides a kind of method that spam is recognized by html tag, bag Include following steps：

The label that S1, structure describe label in HTML code using character describes table；

S2, order extract the label in the HTML code of spam, and describe table according to label and extract comprising multiple words The verification data of symbol；

S3, after new mail is received, extract the HTML code of new mail, and table is described by new mail according to label Label in HTML code translates into description data；

S4, description data and verification data are compared, so that major general is hit corresponding to the description data of verification data New mail be judged to spam.

As a further improvement on the present invention, the stem of at least matching verification data is put in order or afterbody arrangement is suitable New mail corresponding to the description data of sequence is judged to spam.

As a further improvement on the present invention, the label describes table includes some records, and the record is by check number According to and verification data length information constitute；

If the description data of gained after being translated using the label that label is described in HTML code of the table to new mail Length information is equal with the length information of verification data, then new mail is judged into spam.

As a further improvement on the present invention, methods described also includes：The value model of the length information of the verification data It is [m, n] to enclose, if the description data obtained by after being translated using the label that label is described in HTML code of the table to new mail Length information be located at verification data length information span in, then new mail is judged to spam.

As a further improvement on the present invention, the upper limit n of the span of the length information of the verification data takes 100, The lower limit m of span takes 20.

As a further improvement on the present invention, the character is the one kind or two in II yard of numeral, letter or ASC Plant the mechanized data of any of the above combination.

As a further improvement on the present invention, the byte length of the character is fixed.

As a further improvement on the present invention, methods described is also included the description with verification data registration higher than 80% New mail corresponding to data is judged to spam.

As a further improvement on the present invention, methods described also include by the description data scanning verification data of new mail, If the number of characters included in the fragment that description data coincide with verification data is higher than the character quantity that description data are included 80%, then the new mail is judged to spam；Wherein, the fragment that the description data coincide with verification data Length is more than or equal to 10 characters.

Compared with prior art, the beneficial effects of the invention are as follows：In invention, it is only necessary to by by the HTML generations of new mail The description data that code is constituted are compared with the verification data that HTML code was constituted in the spam being previously set, To judge whether new mail is spam, due to only needing the character very small to byte number to compare in whole process, Therefore the computing cost of background server or web page search engine is significantly reduced, the step of identification spam is simplified Suddenly；Meanwhile, which kind of interference no matter spammer make to the content of text of spam, is finally carried from the spam Whether the HTML code of taking-up is respectively provided with uniformity, therefore can be expeditiously that spam makes accurate judgement to mail.

Brief description of the drawings

Fig. 1 is a kind of method flow diagram that spam is recognized by html tag of the present invention；

Fig. 2 is the envelope spam sample graph for being pre-entered in a computer；

Fig. 3 is the HTML code corresponding to the spam shown in Fig. 2；

Fig. 4 is that spammer illustrated in fig. 2 is judged as to the plan sent out transmitted by another e-mail recipient The sample graph of the new mail of spam；

Fig. 5 is judged as the HTML code of the new mail of spam for the plan shown in Fig. 4；

Fig. 6 is the sample graph of normal email；

Fig. 7 is the HTML code figure of normal email illustrated in fig. 6；

Fig. 8 describes what table translation was formed by the label in the HTML code of another envelope new mail to be determined by label Head, middle part and the afterbody schematic diagram similar to the verification data difference of the spam for pre-entering in description data；

Fig. 9 is that label in the HTML code of the new mail for further sealing to be determined describes table translation and formed by label Head, middle part and the afterbody schematic diagram all similar to the verification data of the spam for pre-entering in description data.

Specific embodiment

The present invention is described in detail for shown each implementation method below in conjunction with the accompanying drawings, but it should explanation, these Implementation method not limitation of the present invention, those of ordinary skill in the art according to these implementation method institutes works energy, method, Or equivalent transformation or replacement in structure, belong within protection scope of the present invention.

Please join a kind of specific embodiment party of the of the invention a kind of method that spam is recognized by html tag shown in Fig. 1 Formula.Method shown by the present invention is mainly used in carrying out the mail of html format the judgement of spam.In this manual, Term " mail " is equal to term " Email ".User should in the webpage or mail of computer or the equipment for accessing internet With the mail that opposite end sends is received in software, and the new mail for receiving is entered in local computer or mail server Row judges.

The present invention is a kind of to recognize that the method for rubbish is comprised the following steps by html tag.

First, the description table of the label described using character in HTML code is built.

HTML (HTML) is the language being most widely used on current network, is also to constitute web document Dominant language.The descriptive text that html text is made up of HTML command, HTML command can with comment, figure, animation, Sound, form, link etc..The structure of HTML includes head (Head), main body (Body) two large divisions, and wherein head description is browsed Information needed for device, and main body then includes the particular content to be illustrated.

In order to realize the judgement to spam, it is necessary first to define spam.Sentence relative to traditional spam Determine method, the present invention does not extract the text message or graph data of mail, and is based on the HTML in the mail of html format The occurrence number of the label in code, order or arrangement regulation are counted and formed for characterizing and describing the spam Rubbish attribute.Method shown by the present invention can run in mail server, mail server connection database, for protecting Deposit the information such as description table.For patterned mail, it is constituted by text (Text) and HTML code.And HTML code for The description of text has distinctive specification with order.Mail is entered in the home server of mail server or reception mail After row decoding, you can the HTML code of the mail is shown.

In embodiments, the character is the one or two kinds of any of the above in II yard of numeral, letter or ASC The mechanized data of combination.Preferably, in the present embodiment, the byte length of character is fixed, the character byte length Less than or equal to 3 standard byte length.It is consequently formed label as shown in table 1 below and describes table.Label describes table can be preserved In the database associated by local mail server, and the label described to label in table by keeper is corresponding with character Relation and character species make an amendment.It is excessive for number of labels in the HTML code of the spam pre-entered in database If, will be 2 standard byte length, and not more than 3 standard words with the character digit unification of label tool corresponding relation Section length.

Label

Character

Label

Character

Label

Character

Label

Character

A/AREA

a

FRAME

k

OBJECT

u

SUP

5

B

b

FRAMESET

l

OL

v

TABLE

6

BR

c

H1-H6

m

P

w

TD

7

center

d

HR

n

PRE

x

TEXTAREA

8

DD

e

IFRAME

o

SCRIPT

y

TR

9

DIV

f

IMG

p

SELECT

z

UL

0

DL

g

INPUT

q

SPAN

1

DT

h

LABEL

r

STRONG

2

FONT

i

LEGEND

s

STYLE

3

FORM

j

MAP

t

SUB

4

Table 1- labels describe table

Because the transmission behavior of spam generally has certain regularity.Such as same sender is to different addressees In the mail of the identical or different Mail Contents of content that people sends, the label in the HTML code of its mail often has same One property keeps identical substantially.Therefore, whether the description for label can be that spam is made fast and accurately to mail Judge.Subsequently spam is described and the follow-up decision process to new mail in, table is described by with the label In corresponding character, character sequence and matching degree be compared, to judge whether new mail is spam.

Next, order extracts the label in the HTML code of spam, and table extraction is described comprising many according to label The verification data of individual character.

Shown in ginseng Fig. 2, Fig. 2 shows the envelope spam for pre-entering, the HTML generations of spam illustrated in fig. 2 Code is as shown in Figure 2.“<>" in content be label.Thus, by the label in the HTML code in Fig. 2 according to HTML code order (shown in ginseng Fig. 3), is recorded after sequentially extracting multiple labels.The label of the spam in Fig. 3 for " div, span, div、a、div”.Then, the label shown in above-mentioned label and table 1 is described the relation of label and character in table, by above-mentioned rubbish It is " f, 1, f, a, f " that the label of mail is replaced, so that will be extracted according to label statement table that the word that puts in order had by multiple The constituted verification data of symbol, can by verification data be stored in in the database associated by local mail server.The database Including MySQL database, oracle database or DB2 database.

As described in Figure 2, sender's addresses of items of mail of the spam for pre-entering is " jichunlai@chinac.com ", Addressee's addresses of items of mail is " jichunlai@chinac.com ", and Mail Contents are for " user test1, our company draws a bill in generation, net Location：Click on", 2017-1-1 12:00:00.Wherein, " click " is a hyperlink, and may link to malice network address, yellow Network address or sale network address.

Ginseng Fig. 4 and Fig. 5 shown in, if the sender of the spam for pre-entering in different time to same addressee Content identical mail is have sent, then can be identified as spam.But, in the present invention without to doubtful spam Content of text judged, but label in the HTML code based on mail and label describe corresponding description data in table Corresponding verification data in table is described with the label in the HTML code of spam and label to compare to determine new stamp Whether part (shown in ginseng Fig. 4) is spam.

As shown in figure 4, sender's addresses of items of mail of new mail is " jichunlai@chinac.com ", addressee's mail ground Location is " jichunlai@chinac.com ", and Mail Contents are for " user zhangsan, our company draws a bill in generation, network address：Click on", 2017-1-2 18:00:00.Wherein, " click " is a hyperlink, and may link to malice network address, yellow network address or pin Sell network address.

Shown in ginseng Fig. 5, the label of the HTML code of the new mail is still " div, span, div, a, div ".Next, The label of the label of the HTML code of above-mentioned mail and table 1 is described the relation of label and character in table, by above-mentioned spam Label translated and replaced with " f, 1, f, a, f ", so as to obtain describe data.Due to the rubbish that spammer sends Rubbish mail has formatted stationarity and certain rule.Therefore, the sender of same spam sends in different time Content identical spam or same spammer are in identical or different time to transmitted by different addressees The label of HTML code of content identical spam describe what is formed after table is translated in the label shown by table 1 Description data have homogeneity or keep substantially identical with the verification data of the spam for pre-entering.

Therefore, in the present embodiment, the HTML code of new mail after new mail is received, can be extracted, and according to mark Sign description table and the label in the HTML code of new mail is translated into description data, then description data are carried out with verification data Compare, so that the new mail corresponding to the description data of major general's hit verification data is judged to spam.In present embodiment In, the implication of " hit " is：Verification data is identical with the character types and character arrangements order in description data.

An envelope non-spam email (another envelope new mail) shown in ginseng Fig. 6 and Fig. 7, and its non-spam email HTML generations Code.In the figure 7, the label in the HTML code of the new mail is " div, div, div ", and it by with table 1 enter label The description data formed after row replacement or translation are " f, f, f ".This describes data with the rubbish for pre-entering illustrated in fig. 2 The verification data " f, 1, f, a, f " of rubbish mail is different in the distributing order of the length of character, character, therefore can be by shown in Fig. 6 The new mail for going out is judged to non-spam email.

The Mail Contents of spam would generally be pretended based on spammer or due to spam hair Send the difference of time and cause time attribute or transmission time in Mail Contents to have differences, for example, being inserted in spam Enter hair in the interference informations such as the picture unrelated with promotional component, audio file or the Mail Contents as shown in Fig. 2 and Fig. 4 Send the difference of time.But because the promotional component for needing to be sent to e-mail recipient in spam is same or similar 's.Therefore, during the label in the HTML code of spam describes the description data formed after table is translated according to label All identical or part similitudes can be showed.As long as the label of the HTML code of new mail to be determined is by shown in table one A section continuous data slot is advance with database during the label for going out describes the description data formed after table is translated Verification data corresponding to the spam of input matches substantially, you can the new mail is judged into spam.Specifically, The description data of new mail and the similitude of the verification data of predefined spam or part similitude institute specific as follows State.

The part similitude can be that the part positioned at head described in data describes data and matches with verification data (shown in ginseng Fig. 8), the part similitude can also be that the part positioned at middle part described in data describes data and check number According to matching (shown in ginseng Fig. 8), the part similitude can also be describe to be located in data afterbody part describe data with Verification data matches (shown in ginseng Fig. 8), so as to realize putting in order the stem of at least matching verification data or afterbody The new mail corresponding to description data for putting in order is judged to spam.

If specifically, the label of the HTML code of an envelope new mail describes to be formd after table is translated according to label Description data as shown in Figure 8 " a, b ..., h, i ..., e, q, q ..., q, 6,0,0 ..., f ..., f, 5 ".If foregoing description number According to head " a, b ..., h, i " it is same or similar with the verification data of envelope spam set in advance, then can assert that this is new Mail is spam.It is similarly shown, if the description data of new mail middle part " e, q, q ..., q, 6,0,0 " with another envelope The verification data of spam is same or similar, then can assert that the new mail is spam；If the description data of new mail Afterbody " f ..., f, 5 " it is same or similar with the verification data of another envelope spam set in advance, then can assert the new stamp Part is spam.

Certainly, as long as being in continuous character institute group in the new mail to be determined described corresponding to data illustrated in fig. 8 Into son description data (relative to the unit of whole description data) and the rubbish being input into database or server in advance If the verification data of rubbish mail is same or similar, so that it may which the new mail is judged into spam.

Therefore, through the above way, can effectively to overcoming spammer artificially to add various interference informations to exist Caused various interference in the decision process of spam are carried out using the method shown by the present invention, is ensureing to judge accurate Property while, improve as far as possible and the spam for pretending or encapsulating interference information accurately recognized and judged, show Reduce the computing cost of background server or web page search engine with writing.

In the present embodiment, the label describes table includes some records, and the record is by verification data and check number According to length information constitute.If gained retouches after being translated using the label that label is described in HTML code of the table to new mail The length information for stating data is equal with the length information of verification data, then new mail is judged into spam.

Specifically, shown in ginseng Fig. 2 to Fig. 5, the verification data of the label of the HTML code of spam set in advance is (i.e. " f, 1, f, a, f ") length information for 5, Fig. 4 and the label of the HTML code of new mail illustrated in fig. 5 description data (i.e. " f, 1, f, a, f ") length information for also be 5.Therefore, it is right with the length information of description data by above-mentioned verification data Than, you can the new mail shown in Fig. 4 is judged to spam.In above-mentioned comparison procedure, computer is only needed to byte number Considerably less description data are contrasted with verification data, can substantially reduce the computing cost of computer, are improve to rubbish postal The judgement efficiency of part.

In the present embodiment, should recognize that the method for spam also included by html tag：The verification data The span of length information is [m, n], if describe the label in HTML code of the table to new mail using label translating The length information of the description data of gained is located in the span of the length information of verification data afterwards, then be judged to new mail Spam.Wherein, the upper limit n of the span of the length information of the verification data takes 100, and the lower limit m of span takes 20. If it should be noted that an envelope mail is non-spam email, the HTML code of the mail is by the label shown by table one The description data of description table translation can not possibly be long.Therefore, with reference to actual conditions, in the present embodiment, verification data The upper limit n of length information takes 100；If long, can cause to describe data to hold in the calculating of scanned for checkout data this processes Pin.Meanwhile, the lower limit m of the length information of verification data is also impossible to too short.It is too short, can cause to new mail cause erroneous judgement, So as to non-spam email is judged into spam.

Recombine shown in Fig. 8, in order to further improve the recognition efficiency to new mail, in the present invention, the method is also Including the new mail corresponding to the description data with verification data registration higher than 80% is judged into spam.Specifically, Both can using the label in the HTML code of new mail by the description data that are formed after translation as an entirety with it is advance The entirety of the verification data of the spam of input is compared, and is higher than corresponding to 80% description data by overall registration New mail be judged to spam；The verification data of the spam that will can also pre-enter as scanning element, with right Description data more long are scanned, if hitting the 80% of the segment description data or a segment description data described in data Character match or hit with the verification data of the spam for pre-entering, then the new mail is judged to rubbish postal Part.

Shown in ginseng Fig. 9, further, in the present embodiment, also including by the description data scanning check number of new mail According to if it is higher than the number of characters that description data are included to describe the number of characters included in the fragment that data coincide with verification data The 80% of amount, then be judged to spam by the new mail；Wherein, the fragment that the description data coincide with verification data Length be more than or equal to 10 characters.

In this situation, three description data of the three dotted line frame marks of use shown in Fig. 9 are mutually overlap with verification data The fragment (i.e. " sub- description data ") of conjunction.At least 10 characters are included in each fragment, and the byte number of character is identical.This theory In bright book, in order to simplify expression, the byte length of character is set as 1 standard byte length by unification.Wherein, in three fragments With the length of the character in the verification data of the spam for pre-entering and put in order consistent.

It is assumed that verification data include 100 characters, and this 100 characters have specifically put in order.Work as local mail When server receives a new mail to be determined, describe table to translate the length of the description data for obtaining by label is 90 Individual character.Data order scanned for checkout data will be described；If (character quantity is more than or waits one section of character in description data The word included in fragment in a continuous fragment in 10) hit verification data, and description data hit verification data The quantity registration of the quantity of symbol and the character included in the verification data corresponding to the spam for pre-entering is 80% More than, then the new mail to be determined can be judged to spam.It is achieved thereby that before relatively low computing cost is ensured Put, the accuracy rate judged spam can be improved again, prevent erroneous judgement.

Those listed above is a series of to be described in detail only for feasibility implementation method of the invention specifically Bright, they simultaneously are not used to limit the scope of the invention, all equivalent implementations made without departing from skill spirit of the present invention Or change should be included within the scope of the present invention.

It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be in other specific forms realized.Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power Profit requires to be limited rather than described above, it is intended that all in the implication and scope of the equivalency of claim by falling Change is included in the present invention.Any reference in claim should not be considered as the claim involved by limitation.

Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each implementation method is only wrapped Containing an independent technical scheme, this narrating mode of specification is only that for clarity, those skilled in the art should Specification an as entirety, the technical scheme in each embodiment can also be formed into those skilled in the art through appropriately combined May be appreciated other embodiment.

Claims

1. it is a kind of by html tag recognize spam method, it is characterised in that comprise the following steps：

S2, order extract the label in the HTML code of spam, and describe table according to label and extract comprising multiple characters Verification data；

S3, after new mail is received, extract the HTML code of new mail, and table is described by the HTML generations of new mail according to label Label in code translates into description data；

S4, description data and verification data are compared, so that major general hits new corresponding to the description data of verification data Mail is judged to spam.

2. method according to claim 1, it is characterised in that the stem of at least matching verification data is put in order or The new mail corresponding to description data that afterbody puts in order is judged to spam.

3. according to claim 1 or claim 2, it is characterised in that the label describes table includes some records, institute Record is stated to be made up of the length information of verification data and verification data；

If the length of the description data of gained after being translated using the label that label is described in HTML code of the table to new mail Information is equal with the length information of verification data, then new mail is judged into spam.

4. method according to claim 3, it is characterised in that methods described also includes：The length letter of the verification data The span of breath is [m, n], if gained after being translated using the label that label is described in HTML code of the table to new mail Description data length information be located at verification data length information span in, then new mail is judged to rubbish postal Part.

5. method according to claim 4, it is characterised in that the span of the length information of the verification data it is upper Limit n takes 100, and the lower limit m of span takes 20.

6. method according to claim 1, it is characterised in that during the character is II yard of numeral, letter or ASC The mechanized data of one or two kinds of any of the above combination.

7. method according to claim 1, it is characterised in that the byte length of the character is fixed.

8. method according to claim 1, it is characterised in that methods described also include will be higher than with verification data registration New mail corresponding to 80% description data is judged to spam.

9. method according to claim 8, it is characterised in that methods described is also included the description data scanning of new mail Verification data, if what the number of characters described included in the fragment that data coincide with verification data was included higher than description data The 80% of character quantity, then be judged to spam by the new mail；Wherein, the description data coincide with verification data Fragment length be more than or equal to 10 characters.