Summary of the invention
It is an object of the invention to disclose a kind of method for identifying spam by html tag, to realize to HTML
The spam of format is effectively identified, the computing cost of background server or web page search engine is reduced, and simplifies identification
The step of spam.
For achieving the above object, the present invention provides a kind of method for identifying spam by html tag, packets
Include following steps:
S1, building describe table using the label that character describes label in HTML code;
S2, sequence extract the label in the HTML code of spam, and describe table according to label and extract comprising multiple words
The verification data of symbol;
S3, after receiving new mail, extract the HTML code of new mail, and table is described for new mail according to label
Label in HTML code translates into description data;
S4, description data are compared with verification data, so that major general hits corresponding to the description data of verification data
New mail be determined as spam.
As a further improvement of the present invention, the stem of at least matching verification data is put in order or tail portion arrangement is suitable
New mail corresponding to the description data of sequence is determined as spam.
As a further improvement of the present invention, it includes several records that the label, which describes table, and the record is by check number
According to and verification data length information composition;
If describing resulting description data after table translates the label in the HTML code of new mail using label
Length information is equal with the verification length information of data, then new mail is determined as spam.
As a further improvement of the present invention, the method also includes: it is described verification data length information value model
It encloses for [m, n], if describing resulting description data after table translates the label in the HTML code of new mail using label
Length information be located at verification data length information value range in, then new mail is determined as spam.
As a further improvement of the present invention, the upper limit n of the value range of the length information of the verification data takes 100,
The lower limit m of value range takes 20.
As a further improvement of the present invention, the character is digital, letter or one of II yard of ASC or two
The mechanized data of kind any of the above combination.
As a further improvement of the present invention, the byte length of the character is fixed.
As a further improvement of the present invention, the method also includes by with verification data registration be higher than 80% description
New mail corresponding to data is determined as spam.
As a further improvement of the present invention, the method also includes the description data scanning of new mail is verified data,
If number of characters included in the segment that description data and verification data coincide is higher than the character quantity that description data are included
80%, then the new mail is determined as spam;Wherein, the segment that the description data and verification data coincide
Length is more than or equal to 10 characters.
Compared with prior art, the beneficial effects of the present invention are: in invention, it is only necessary to by by the HTML generation of new mail
Description data composed by code are compared with verification data composed by the HTML code institute in the spam being previously set,
To determine whether new mail is spam, due to only needing that the very small character of byte number is compared in whole process,
Therefore the computing cost for reducing background server or web page search engine significantly simplifies the step of identification spam
Suddenly;Meanwhile which kind of interference no matter spammer make to the content of text of spam, finally mentions from the spam
Whether the HTML code of taking-up all has consistency, therefore can be expeditiously that spam makes accurate judgement to mail.
Specific embodiment
The present invention is described in detail for each embodiment shown in reference to the accompanying drawing, but it should be stated that, these
Embodiment is not limitation of the present invention, those of ordinary skill in the art according to these embodiments made by function, method,
Or equivalent transformation or substitution in structure, all belong to the scope of protection of the present invention within.
It please join a kind of a kind of specific embodiment party for the method that spam is identified by html tag of the present invention shown in FIG. 1
Formula.Method shown by the present invention is mainly used for carrying out the mail of html format the judgement of spam.In the present specification,
Term " mail " is equal to term " Email ".User answers in the webpage or mail of computer or the equipment for accessing internet
With received in software opposite end transmission mail, and in local computer or mail server to the new mail received into
Row determines.
A kind of method for identifying rubbish by html tag of the present invention includes the following steps.
Firstly, building describes the description table of the label in HTML code using character.
HTML (hypertext markup language) is the language being most widely used on current network, and constitutes web document
Dominant language.The descriptive text that html text is made of HTML command, HTML command can with comment, figure, animation,
Sound, table, link etc..The structure of HTML includes head (Head), main body (Body) two large divisions, and wherein head description browses
Information needed for device, and main body then includes the particular content to be illustrated.
In order to realize the judgement to spam, it is necessary first to define spam.Sentence relative to traditional spam
Determine method, the present invention does not extract the text information or graph data of mail, and the HTML in the mail based on html format
Frequency of occurrence, sequence or the arrangement regulation of label in code are counted and are formed for characterizing and describing the spam
Rubbish attribute.Method shown by the present invention can run in mail server, and mail server connects database, for protecting
Deposit the information such as description table.For patterned mail, it is made of with HTML code text (Text).And HTML code for
The description of text has distinctive specification and sequence.Mail server or receive mail local server in mail into
After row decoding, the HTML code of the mail can be shown.
In embodiments, the character is number, letter or one of II yard of ASC or two or more any
Combined mechanized data.Preferably, in the present embodiment, the byte length of character is fixed, the character byte length
Less than or equal to 3 standard byte length.Label as shown in table 1 below is consequently formed and describes table.Label, which describes table, can be reserved for
In database associated by local mail server, and it is corresponding with character by administrator to describe the label in table to label
Relationship and character type make an amendment.It is excessive for number of labels in the HTML code of the spam pre-entered in database
If, 2 standard byte length, and not more than 3 standard words will be unified for the character digit of label tool corresponding relationship
Save length.
Label |
Character |
Label |
Character |
Label |
Character |
Label |
Character |
A/AREA |
a |
FRAME |
k |
OBJECT |
u |
SUP |
5 |
B |
b |
FRAMESET |
l |
OL |
v |
TABLE |
6 |
BR |
c |
H1-H6 |
m |
P |
w |
TD |
7 |
center |
d |
HR |
n |
PRE |
x |
TEXTAREA |
8 |
DD |
e |
IFRAME |
o |
SCRIPT |
y |
TR |
9 |
DIV |
f |
IMG |
p |
SELECT |
z |
UL |
0 |
DL |
g |
INPUT |
q |
SPAN |
1 |
|
|
DT |
h |
LABEL |
r |
STRONG |
2 |
|
|
FONT |
i |
LEGEND |
s |
STYLE |
3 |
|
|
FORM |
j |
MAP |
t |
SUB |
4 |
|
|
Table 1- label describes table
Since the transmission behavior of spam usually has certain regularity.Such as same sender is to different addressees
In the mail of the same or different Mail Contents of the content that human hair is sent, the label in the HTML code of mail often has together
One property keeps identical substantially.It therefore, whether can be that spam is made fast and accurately to mail for the description of label
Determine.It is subsequent spam is described and the subsequent decision process to new mail in, by describing table with the label
In corresponding character, character sequence and matching degree be compared, to judge whether new mail is spam.
Next, sequence extracts the label in the HTML code of spam, and table is described according to label and is extracted comprising more
The verification data of a character.
Join shown in Fig. 2, Fig. 2 shows the envelope spam pre-entered, the HTML generations of spam illustrated in fig. 2
Code is as shown in Figure 2.Content is label in "<>".As a result, by the label in the HTML code in Fig. 2 according to the sequence of HTML code
(shown in ginseng Fig. 3), is recorded after sequentially extracting multiple labels.The label of spam in Fig. 3 be " div, span,
div,a,div".Then, the relationship that label shown in above-mentioned label and table 1 is described to label and character in table, by above-mentioned rubbish
The label of mail is replaced as " f, 1, f, a, f ", so that will be extracted according to label statement table has the word that puts in order by multiple
Data are verified composed by according with, and can will be verified in database associated by data deposit and local mail server.The database
Including MySQL database, oracle database or DB2 database.
As described in Figure 2, sender's mail address of the spam pre-entered is " jichunlai@chinac.com ",
Addressee's mail address is " jichunlai@chinac.com ", and Mail Contents are that " user test1, our company draws a bill in generation, net
Location:It clicks", 2017-1-1 12:00:00.Wherein, " click " is a hyperlink, and may link to malice network address, yellow
Network address or sale network address.
Join shown in Fig. 4 and Fig. 5, if the sender of the spam pre-entered is in different time to the same addressee
The identical mail of content is had sent, then can be identified as spam.But in the present invention without to doubtful spam
Content of text is determined, but the label in the HTML code based on mail and label describe description data corresponding in table
Verification data corresponding to describing in table with the label in the HTML code of spam with label are compared to determine new stamp
Whether part (shown in ginseng Fig. 4) is spam.
As shown in figure 4, sender's mail address of new mail is " jichunlai@chinac.com ", addressee's mail
Location be " jichunlai@chinac.com ", Mail Contents be " user zhangsan, our company generation draw a bill, network address:It clicks",
2017-1-2 18:00:00.Wherein, " click " is a hyperlink, and may link to malice network address, yellow network address or pin
Sell network address.
Join shown in Fig. 5, the label of the HTML code of the new mail is still " div, span, div, a, div ".Next,
The relationship that the label of the label of the HTML code of above-mentioned mail and table 1 is described to label and character in table, by above-mentioned spam
Label translated and replace with " f, 1, f, a, f ", thus obtain description data.The rubbish sent due to spammer
Rubbish mail has formatted stationarity and certain rule.Therefore, the sender of same spam sends in different time
The identical spam of content or same spammer are in the same or different time to transmitted by different addressees
Label label shown by table 1 of HTML code of the identical spam of content describe to be formed by after table is translated
Description data and the verification data of the spam pre-entered have identity or keep identical substantially.
Therefore, in the present embodiment, the HTML code of new mail can be extracted, and according to mark after receiving new mail
It signs description table and the label in the HTML code of new mail is translated into description data, then carry out description data with verification data
Compare, so that major general hits new mail corresponding to the description data of verification data and is determined as spam.In present embodiment
In, the meaning of " hit " are as follows: verification data are identical as the character types and character arrangements sequence that describe in data.
Join Fig. 6 and envelope non-spam email shown in Fig. 7 (another envelope new mail) and its HTML generation of the non-spam email
Code.In Fig. 7, label in the HTML code of the new mail is " div, div, div ", by with table 1 carry out to label into
It is " f, f, f " that description data are formed by after row replacement or translation.This describes data and the rubbish illustrated in fig. 2 pre-entered
The verification data " f, 1, f, a, f " of rubbish mail are different in the length of character, the distributing order of character, therefore can will be shown in Fig. 6
New mail out is determined as non-spam email.
The Mail Contents of spam would generally be pretended based on spammer or since spam is sent out
It send the difference of time and time attribute or sending time in Mail Contents is caused to have differences, for example, being inserted in spam
Enter and is sent out in the interference informations such as the picture unrelated with promotional component, audio file or Mail Contents as shown in Fig. 2 and Fig. 4
Send the difference of time.It but is same or similar due to needing the promotional component sent to e-mail recipient in spam
's.Therefore, the label in the HTML code of spam describes to be formed by after table is translated in description data according to label
All identical or part similitudes can be showed.As long as the label of the HTML code of new mail to be determined is by shown in table one
Label out describes to be formed by after table is translated in description data in a section continuous data slot and database in advance
Verification data corresponding to the spam of input match substantially, the new mail can be determined as spam.Specifically,
The similitude of the verification data of the description data and predefined spam of new mail or part similitude institute specific as follows
It states.
The part positioned at head that the part similitude can be in description data describes data and matches with verification data
(shown in ginseng Fig. 8), the part similitude are also possible to describe to describe data and check number positioned at the part at middle part in data
According to matching (ginseng Fig. 8 shown in), the part which is also possible to describe to be located in data tail portion describe data with
Verification data match (ginseng Fig. 8 shown in), and the stems of at least matching verification data puts in order or tail portion to realize
New mail corresponding to the description data to put in order is determined as spam.
Specifically, if the label of the HTML code of an envelope new mail describes to be formd after table is translated according to label
Description data as shown in Figure 8 " a, b ..., h, i ..., e, q, q ..., q, 6,0,0 ..., f ..., f, 5 ".If foregoing description number
According to head " a, b ..., h, i " it is same or similar with the verification data of preset envelope spam, then can assert that this is new
Mail is spam.Shown in similarly, if the description data of new mail middle part " e, q, q ..., q, 6,0,0 " and another envelope
The verification data of spam are same or similar, then can assert that the new mail is spam;If the description data of new mail
Tail portion " f ..., f, 5 " it is same or similar with the verification data of preset another envelope spam, then can assert the new stamp
Part is spam.
Certainly, as long as being in continuous character institute group in new mail to be determined corresponding to description data illustrated in fig. 8
At son description data (unit relative to entire description data) and the rubbish that is inputted in database or server in advance
If the verification data of rubbish mail are same or similar, so that it may which the new mail is determined as spam.
Therefore, by the above-mentioned means, can effectively exist to overcoming spammer that various interference informations are artificially added
Various interference caused by carrying out in the decision process of spam using method shown by the present invention are guaranteeing to determine accurately
Property while, as far as possible improve to pretend or encapsulation interference information spam carry out accurately identification and determine, show
Landing reduces the computing cost of background server or web page search engine.
In the present embodiment, it includes several records which, which describes table, and the record is by verification data and check number
According to length information composition.If describing resulting after table translates the label in the HTML code of new mail retouch using label
The length information for stating data is equal with the verification length information of data, then new mail is determined as spam.
Specifically, the verification data of the label of the HTML code of preset spam are (i.e. shown in ginseng Fig. 2 to Fig. 5
" f, 1, f, a, f ") length information be 5, Fig. 4 and new mail illustrated in fig. 5 HTML code label description data (i.e.
" f, 1, f, a, f ") length information be also be 5.Therefore, pass through pair of above-mentioned verification data and the length information for describing data
Than that new mail shown in Fig. 4 can be determined as spam.In above-mentioned comparison procedure, computer is only needed to byte number
Considerably less description data are compared with verification data, can be substantially reduced the computing cost of computer, be improved to rubbish postal
The judgement efficiency of part.
In the present embodiment, the method that should identify spam by html tag further include: the verification data
The value range of length information is [m, n], is translated if describing table using label to the label in the HTML code of new mail
In the value range for the length information that the length information of resulting description data is located at verification data afterwards, then new mail is determined as
Spam.Wherein, the upper limit n of the value range of the length information of the verification data takes 100, and the lower limit m of value range takes 20.
It should be noted that the HTML code of the mail is by label shown by table one if an envelope mail is non-spam email
Describing data made of the translation of description table can not be too long.Therefore, in conjunction with actual conditions, in the present embodiment, data are verified
The upper limit n of length information takes 100;If too long, the calculating that will lead to description data in scanned for checkout data this processes is opened
Pin.Meanwhile the lower limit m for verifying the length information of data be also impossible to it is too short.It is too short, it will lead to and new mail is being caused to judge by accident,
To which non-spam email is determined as spam.
It recombines shown in Fig. 8, in order to further increase the recognition efficiency to new mail, in the present invention, this method is also
Including will be determined as spam with new mail corresponding to description data of the verification data registration higher than 80%.Specifically,
Both the label in the HTML code of new mail description data can be formed by as a whole and in advance after translation
The entirety of the verification data of the spam of input is compared, and will be corresponding to description data of the whole registration higher than 80%
New mail be determined as spam;It can also be by the conduct scanning element of the verification data of the spam pre-entered, with right
Longer description data are scanned, if 80% of a segment description data or a segment description data in hit description data
Character and the verification data of the spam pre-entered match or hit, then the new mail is determined as rubbish postal
Part.
Join shown in Fig. 9, further includes by the description data scanning check number of new mail in the present embodiment further
According to if number of characters included in the segment that description data and verification data coincide is higher than the number of characters that description data are included
The 80% of amount, then be determined as spam for the new mail;Wherein, the segment that the description data and verification data coincide
Length be more than or equal to 10 characters.
In in this situation, the three description data identified shown in Fig. 9 with three dotted line frames are mutually be overlapped with verification data
The segment (i.e. " sub- description data ") of conjunction.It include at least ten character in each segment, and the byte number of character is identical.This theory
In bright book, indicated to simplify, it is unified that the byte length of character is set as 1 standard byte length.Wherein, in three segments
With the length of the character in the verification data of the spam pre-entered and put in order consistent.
It is assumed that verification data include 100 characters, and this 100 characters have specifically put in order.Work as local mail
It is 90 by the length that label describes the description data that table is translated when server receives portion new mail to be determined
A character.Description data sequential scan is verified into data;If (character quantity is greater than or waits one section of character in description data
A continuous fragment in 10) hit verification data, and word included in the segment in data hit verification data is described
The quantity registration of character included in verification data corresponding to the quantity of symbol and the spam pre-entered is 80%
More than, then the new mail to be determined can be determined as spam.To realize before guaranteeing lower computing cost
It puts, and can be improved the accuracy rate determined spam, prevent erroneous judgement.
The series of detailed descriptions listed above only for feasible embodiment of the invention specifically
Protection scope bright, that they are not intended to limit the invention, it is all without departing from equivalent implementations made by technical spirit of the present invention
Or change should all be included in the protection scope of the present invention.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie
In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power
Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims
Variation is included within the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.
In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped
Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should
It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art
The other embodiments being understood that.