The content of the invention
It is an object of the invention to disclose a kind of method for recognizing spam by html tag, it is used to realize to HTML
The spam of form is effectively recognized that the computing cost of reduction background server or web page search engine simplifies identification
The step of spam.
For achieving the above object, the invention provides a kind of method that spam is recognized by html tag, bag
Include following steps:
The label that S1, structure describe label in HTML code using character describes table;
S2, order extract the label in the HTML code of spam, and describe table according to label and extract comprising multiple words
The verification data of symbol;
S3, after new mail is received, extract the HTML code of new mail, and table is described by new mail according to label
Label in HTML code translates into description data;
S4, description data and verification data are compared, so that major general is hit corresponding to the description data of verification data
New mail be judged to spam.
As a further improvement on the present invention, the stem of at least matching verification data is put in order or afterbody arrangement is suitable
New mail corresponding to the description data of sequence is judged to spam.
As a further improvement on the present invention, the label describes table includes some records, and the record is by check number
According to and verification data length information constitute;
If the description data of gained after being translated using the label that label is described in HTML code of the table to new mail
Length information is equal with the length information of verification data, then new mail is judged into spam.
As a further improvement on the present invention, methods described also includes:The value model of the length information of the verification data
It is [m, n] to enclose, if the description data obtained by after being translated using the label that label is described in HTML code of the table to new mail
Length information be located at verification data length information span in, then new mail is judged to spam.
As a further improvement on the present invention, the upper limit n of the span of the length information of the verification data takes 100,
The lower limit m of span takes 20.
As a further improvement on the present invention, the character is the one kind or two in II yard of numeral, letter or ASC
Plant the mechanized data of any of the above combination.
As a further improvement on the present invention, the byte length of the character is fixed.
As a further improvement on the present invention, methods described is also included the description with verification data registration higher than 80%
New mail corresponding to data is judged to spam.
As a further improvement on the present invention, methods described also include by the description data scanning verification data of new mail,
If the number of characters included in the fragment that description data coincide with verification data is higher than the character quantity that description data are included
80%, then the new mail is judged to spam;Wherein, the fragment that the description data coincide with verification data
Length is more than or equal to 10 characters.
Compared with prior art, the beneficial effects of the invention are as follows:In invention, it is only necessary to by by the HTML generations of new mail
The description data that code is constituted are compared with the verification data that HTML code was constituted in the spam being previously set,
To judge whether new mail is spam, due to only needing the character very small to byte number to compare in whole process,
Therefore the computing cost of background server or web page search engine is significantly reduced, the step of identification spam is simplified
Suddenly;Meanwhile, which kind of interference no matter spammer make to the content of text of spam, is finally carried from the spam
Whether the HTML code of taking-up is respectively provided with uniformity, therefore can be expeditiously that spam makes accurate judgement to mail.
Specific embodiment
The present invention is described in detail for shown each implementation method below in conjunction with the accompanying drawings, but it should explanation, these
Implementation method not limitation of the present invention, those of ordinary skill in the art according to these implementation method institutes works energy, method,
Or equivalent transformation or replacement in structure, belong within protection scope of the present invention.
Please join a kind of specific embodiment party of the of the invention a kind of method that spam is recognized by html tag shown in Fig. 1
Formula.Method shown by the present invention is mainly used in carrying out the mail of html format the judgement of spam.In this manual,
Term " mail " is equal to term " Email ".User should in the webpage or mail of computer or the equipment for accessing internet
With the mail that opposite end sends is received in software, and the new mail for receiving is entered in local computer or mail server
Row judges.
The present invention is a kind of to recognize that the method for rubbish is comprised the following steps by html tag.
First, the description table of the label described using character in HTML code is built.
HTML (HTML) is the language being most widely used on current network, is also to constitute web document
Dominant language.The descriptive text that html text is made up of HTML command, HTML command can with comment, figure, animation,
Sound, form, link etc..The structure of HTML includes head (Head), main body (Body) two large divisions, and wherein head description is browsed
Information needed for device, and main body then includes the particular content to be illustrated.
In order to realize the judgement to spam, it is necessary first to define spam.Sentence relative to traditional spam
Determine method, the present invention does not extract the text message or graph data of mail, and is based on the HTML in the mail of html format
The occurrence number of the label in code, order or arrangement regulation are counted and formed for characterizing and describing the spam
Rubbish attribute.Method shown by the present invention can run in mail server, mail server connection database, for protecting
Deposit the information such as description table.For patterned mail, it is constituted by text (Text) and HTML code.And HTML code for
The description of text has distinctive specification with order.Mail is entered in the home server of mail server or reception mail
After row decoding, you can the HTML code of the mail is shown.
In embodiments, the character is the one or two kinds of any of the above in II yard of numeral, letter or ASC
The mechanized data of combination.Preferably, in the present embodiment, the byte length of character is fixed, the character byte length
Less than or equal to 3 standard byte length.It is consequently formed label as shown in table 1 below and describes table.Label describes table can be preserved
In the database associated by local mail server, and the label described to label in table by keeper is corresponding with character
Relation and character species make an amendment.It is excessive for number of labels in the HTML code of the spam pre-entered in database
If, will be 2 standard byte length, and not more than 3 standard words with the character digit unification of label tool corresponding relation
Section length.
Label |
Character |
Label |
Character |
Label |
Character |
Label |
Character |
A/AREA |
a |
FRAME |
k |
OBJECT |
u |
SUP |
5 |
B |
b |
FRAMESET |
l |
OL |
v |
TABLE |
6 |
BR |
c |
H1-H6 |
m |
P |
w |
TD |
7 |
center |
d |
HR |
n |
PRE |
x |
TEXTAREA |
8 |
DD |
e |
IFRAME |
o |
SCRIPT |
y |
TR |
9 |
DIV |
f |
IMG |
p |
SELECT |
z |
UL |
0 |
DL |
g |
INPUT |
q |
SPAN |
1 |
|
|
DT |
h |
LABEL |
r |
STRONG |
2 |
|
|
FONT |
i |
LEGEND |
s |
STYLE |
3 |
|
|
FORM |
j |
MAP |
t |
SUB |
4 |
|
|
Table 1- labels describe table
Because the transmission behavior of spam generally has certain regularity.Such as same sender is to different addressees
In the mail of the identical or different Mail Contents of content that people sends, the label in the HTML code of its mail often has same
One property keeps identical substantially.Therefore, whether the description for label can be that spam is made fast and accurately to mail
Judge.Subsequently spam is described and the follow-up decision process to new mail in, table is described by with the label
In corresponding character, character sequence and matching degree be compared, to judge whether new mail is spam.
Next, order extracts the label in the HTML code of spam, and table extraction is described comprising many according to label
The verification data of individual character.
Shown in ginseng Fig. 2, Fig. 2 shows the envelope spam for pre-entering, the HTML generations of spam illustrated in fig. 2
Code is as shown in Figure 2.“<>" in content be label.Thus, by the label in the HTML code in Fig. 2 according to HTML code order
(shown in ginseng Fig. 3), is recorded after sequentially extracting multiple labels.The label of the spam in Fig. 3 for " div, span,
div、a、div”.Then, the label shown in above-mentioned label and table 1 is described the relation of label and character in table, by above-mentioned rubbish
It is " f, 1, f, a, f " that the label of mail is replaced, so that will be extracted according to label statement table that the word that puts in order had by multiple
The constituted verification data of symbol, can by verification data be stored in in the database associated by local mail server.The database
Including MySQL database, oracle database or DB2 database.
As described in Figure 2, sender's addresses of items of mail of the spam for pre-entering is " jichunlai@chinac.com ",
Addressee's addresses of items of mail is " jichunlai@chinac.com ", and Mail Contents are for " user test1, our company draws a bill in generation, net
Location:Click on", 2017-1-1 12:00:00.Wherein, " click " is a hyperlink, and may link to malice network address, yellow
Network address or sale network address.
Ginseng Fig. 4 and Fig. 5 shown in, if the sender of the spam for pre-entering in different time to same addressee
Content identical mail is have sent, then can be identified as spam.But, in the present invention without to doubtful spam
Content of text judged, but label in the HTML code based on mail and label describe corresponding description data in table
Corresponding verification data in table is described with the label in the HTML code of spam and label to compare to determine new stamp
Whether part (shown in ginseng Fig. 4) is spam.
As shown in figure 4, sender's addresses of items of mail of new mail is " jichunlai@chinac.com ", addressee's mail ground
Location is " jichunlai@chinac.com ", and Mail Contents are for " user zhangsan, our company draws a bill in generation, network address:Click on",
2017-1-2 18:00:00.Wherein, " click " is a hyperlink, and may link to malice network address, yellow network address or pin
Sell network address.
Shown in ginseng Fig. 5, the label of the HTML code of the new mail is still " div, span, div, a, div ".Next,
The label of the label of the HTML code of above-mentioned mail and table 1 is described the relation of label and character in table, by above-mentioned spam
Label translated and replaced with " f, 1, f, a, f ", so as to obtain describe data.Due to the rubbish that spammer sends
Rubbish mail has formatted stationarity and certain rule.Therefore, the sender of same spam sends in different time
Content identical spam or same spammer are in identical or different time to transmitted by different addressees
The label of HTML code of content identical spam describe what is formed after table is translated in the label shown by table 1
Description data have homogeneity or keep substantially identical with the verification data of the spam for pre-entering.
Therefore, in the present embodiment, the HTML code of new mail after new mail is received, can be extracted, and according to mark
Sign description table and the label in the HTML code of new mail is translated into description data, then description data are carried out with verification data
Compare, so that the new mail corresponding to the description data of major general's hit verification data is judged to spam.In present embodiment
In, the implication of " hit " is:Verification data is identical with the character types and character arrangements order in description data.
An envelope non-spam email (another envelope new mail) shown in ginseng Fig. 6 and Fig. 7, and its non-spam email HTML generations
Code.In the figure 7, the label in the HTML code of the new mail is " div, div, div ", and it by with table 1 enter label
The description data formed after row replacement or translation are " f, f, f ".This describes data with the rubbish for pre-entering illustrated in fig. 2
The verification data " f, 1, f, a, f " of rubbish mail is different in the distributing order of the length of character, character, therefore can be by shown in Fig. 6
The new mail for going out is judged to non-spam email.
The Mail Contents of spam would generally be pretended based on spammer or due to spam hair
Send the difference of time and cause time attribute or transmission time in Mail Contents to have differences, for example, being inserted in spam
Enter hair in the interference informations such as the picture unrelated with promotional component, audio file or the Mail Contents as shown in Fig. 2 and Fig. 4
Send the difference of time.But because the promotional component for needing to be sent to e-mail recipient in spam is same or similar
's.Therefore, during the label in the HTML code of spam describes the description data formed after table is translated according to label
All identical or part similitudes can be showed.As long as the label of the HTML code of new mail to be determined is by shown in table one
A section continuous data slot is advance with database during the label for going out describes the description data formed after table is translated
Verification data corresponding to the spam of input matches substantially, you can the new mail is judged into spam.Specifically,
The description data of new mail and the similitude of the verification data of predefined spam or part similitude institute specific as follows
State.
The part similitude can be that the part positioned at head described in data describes data and matches with verification data
(shown in ginseng Fig. 8), the part similitude can also be that the part positioned at middle part described in data describes data and check number
According to matching (shown in ginseng Fig. 8), the part similitude can also be describe to be located in data afterbody part describe data with
Verification data matches (shown in ginseng Fig. 8), so as to realize putting in order the stem of at least matching verification data or afterbody
The new mail corresponding to description data for putting in order is judged to spam.
If specifically, the label of the HTML code of an envelope new mail describes to be formd after table is translated according to label
Description data as shown in Figure 8 " a, b ..., h, i ..., e, q, q ..., q, 6,0,0 ..., f ..., f, 5 ".If foregoing description number
According to head " a, b ..., h, i " it is same or similar with the verification data of envelope spam set in advance, then can assert that this is new
Mail is spam.It is similarly shown, if the description data of new mail middle part " e, q, q ..., q, 6,0,0 " with another envelope
The verification data of spam is same or similar, then can assert that the new mail is spam;If the description data of new mail
Afterbody " f ..., f, 5 " it is same or similar with the verification data of another envelope spam set in advance, then can assert the new stamp
Part is spam.
Certainly, as long as being in continuous character institute group in the new mail to be determined described corresponding to data illustrated in fig. 8
Into son description data (relative to the unit of whole description data) and the rubbish being input into database or server in advance
If the verification data of rubbish mail is same or similar, so that it may which the new mail is judged into spam.
Therefore, through the above way, can effectively to overcoming spammer artificially to add various interference informations to exist
Caused various interference in the decision process of spam are carried out using the method shown by the present invention, is ensureing to judge accurate
Property while, improve as far as possible and the spam for pretending or encapsulating interference information accurately recognized and judged, show
Reduce the computing cost of background server or web page search engine with writing.
In the present embodiment, the label describes table includes some records, and the record is by verification data and check number
According to length information constitute.If gained retouches after being translated using the label that label is described in HTML code of the table to new mail
The length information for stating data is equal with the length information of verification data, then new mail is judged into spam.
Specifically, shown in ginseng Fig. 2 to Fig. 5, the verification data of the label of the HTML code of spam set in advance is (i.e.
" f, 1, f, a, f ") length information for 5, Fig. 4 and the label of the HTML code of new mail illustrated in fig. 5 description data (i.e.
" f, 1, f, a, f ") length information for also be 5.Therefore, it is right with the length information of description data by above-mentioned verification data
Than, you can the new mail shown in Fig. 4 is judged to spam.In above-mentioned comparison procedure, computer is only needed to byte number
Considerably less description data are contrasted with verification data, can substantially reduce the computing cost of computer, are improve to rubbish postal
The judgement efficiency of part.
In the present embodiment, should recognize that the method for spam also included by html tag:The verification data
The span of length information is [m, n], if describe the label in HTML code of the table to new mail using label translating
The length information of the description data of gained is located in the span of the length information of verification data afterwards, then be judged to new mail
Spam.Wherein, the upper limit n of the span of the length information of the verification data takes 100, and the lower limit m of span takes 20.
If it should be noted that an envelope mail is non-spam email, the HTML code of the mail is by the label shown by table one
The description data of description table translation can not possibly be long.Therefore, with reference to actual conditions, in the present embodiment, verification data
The upper limit n of length information takes 100;If long, can cause to describe data to hold in the calculating of scanned for checkout data this processes
Pin.Meanwhile, the lower limit m of the length information of verification data is also impossible to too short.It is too short, can cause to new mail cause erroneous judgement,
So as to non-spam email is judged into spam.
Recombine shown in Fig. 8, in order to further improve the recognition efficiency to new mail, in the present invention, the method is also
Including the new mail corresponding to the description data with verification data registration higher than 80% is judged into spam.Specifically,
Both can using the label in the HTML code of new mail by the description data that are formed after translation as an entirety with it is advance
The entirety of the verification data of the spam of input is compared, and is higher than corresponding to 80% description data by overall registration
New mail be judged to spam;The verification data of the spam that will can also pre-enter as scanning element, with right
Description data more long are scanned, if hitting the 80% of the segment description data or a segment description data described in data
Character match or hit with the verification data of the spam for pre-entering, then the new mail is judged to rubbish postal
Part.
Shown in ginseng Fig. 9, further, in the present embodiment, also including by the description data scanning check number of new mail
According to if it is higher than the number of characters that description data are included to describe the number of characters included in the fragment that data coincide with verification data
The 80% of amount, then be judged to spam by the new mail;Wherein, the fragment that the description data coincide with verification data
Length be more than or equal to 10 characters.
In this situation, three description data of the three dotted line frame marks of use shown in Fig. 9 are mutually overlap with verification data
The fragment (i.e. " sub- description data ") of conjunction.At least 10 characters are included in each fragment, and the byte number of character is identical.This theory
In bright book, in order to simplify expression, the byte length of character is set as 1 standard byte length by unification.Wherein, in three fragments
With the length of the character in the verification data of the spam for pre-entering and put in order consistent.
It is assumed that verification data include 100 characters, and this 100 characters have specifically put in order.Work as local mail
When server receives a new mail to be determined, describe table to translate the length of the description data for obtaining by label is 90
Individual character.Data order scanned for checkout data will be described;If (character quantity is more than or waits one section of character in description data
The word included in fragment in a continuous fragment in 10) hit verification data, and description data hit verification data
The quantity registration of the quantity of symbol and the character included in the verification data corresponding to the spam for pre-entering is 80%
More than, then the new mail to be determined can be judged to spam.It is achieved thereby that before relatively low computing cost is ensured
Put, the accuracy rate judged spam can be improved again, prevent erroneous judgement.
Those listed above is a series of to be described in detail only for feasibility implementation method of the invention specifically
Bright, they simultaneously are not used to limit the scope of the invention, all equivalent implementations made without departing from skill spirit of the present invention
Or change should be included within the scope of the present invention.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie
In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be in other specific forms realized.Therefore, no matter
From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power
Profit requires to be limited rather than described above, it is intended that all in the implication and scope of the equivalency of claim by falling
Change is included in the present invention.Any reference in claim should not be considered as the claim involved by limitation.
Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each implementation method is only wrapped
Containing an independent technical scheme, this narrating mode of specification is only that for clarity, those skilled in the art should
Specification an as entirety, the technical scheme in each embodiment can also be formed into those skilled in the art through appropriately combined
May be appreciated other embodiment.