CN105323153A

CN105323153A - Spam mail filtering method and device

Info

Publication number: CN105323153A
Application number: CN201510794358.0A
Authority: CN
Inventors: 周鑫
Original assignee: TCL Corp
Current assignee: TCL Corp
Priority date: 2015-11-18
Filing date: 2015-11-18
Publication date: 2016-02-10

Abstract

The invention belongs to the field of information filtering and provides a spam mail filtering method and device. The method comprises the steps of: after receiving a new mail, obtaining the mail content of the new mail; processing the obtained mail content into a character string of a preset type; determining the text similarity of the mail content and a preset initial cluster center according to a preset space punishment value and character similarity value and preset initial cluster center data; judging whether the new mail is a spam mail according to the determined text similarity and the preset threshold and judging whether the new mail should be filtered according to the judging result. The embodiments of the invention can improve the accuracy of spam mail filtering.

Description

Rubbish mail filtering method and device

Technical field

The embodiment of the present invention belongs to information filtering field, particularly relates to a kind of rubbish mail filtering method and device.

Background technology

Text cluster is got together by the text of semantic similarity, traditional data digging method is before process text data, first text table must be shown as form that computer can process, that can embody text substantive characteristics, then the reverse document-frequency (TermFrequencyInverseDocumentFrequency of word frequency is used, TFIDF) document is converted into vector form, finally in vector space model, calculates text similarity by Text Clustering Method.Based in the vector space model of TFIDF, owing to not considering the concept similar situation between word, therefore have impact on the accuracy of data clusters.Further, existing method is also difficult to identify the normal email information or key words that spammer incorporates in Mail Contents, thus is difficult to accurately filter out spam.

Summary of the invention

Embodiments provide a kind of rubbish mail filtering method and device, be intended to solve the problem that existing method is difficult to accurately filter out spam.

The embodiment of the present invention is achieved in that a kind of rubbish mail filtering method, and described method comprises:

After receiving new mail, obtain the Mail Contents of described new mail;

The Mail Contents of acquisition is treated to the character string of pre-set categories;

The text similarity at described Mail Contents and described default initial cluster center is determined according to the data at the space penalty value preset, character Similarity value and default initial cluster center;

According to the text similarity determined and default threshold decision, whether new mail is spam, to judge whether according to judged result to filter described new mail.

Another object of the embodiment of the present invention is to provide a kind of junk mail filter device, and described device comprises:

Mail Contents acquiring unit, after receiving new mail, obtains the Mail Contents of described new mail;

Mail Contents pretreatment unit, for being treated to the character string of pre-set categories by the Mail Contents of acquisition;

Text similarity determining unit, for determining the text similarity at described Mail Contents and described default initial cluster center according to the data at the space penalty value preset, character Similarity value and default initial cluster center;

Whether spam judging unit, for being spam according to the text similarity determined and new mail described in the threshold decision preset, judge whether according to judged result to filter described new mail.

In embodiments of the present invention, owing to the Mail Contents of acquisition to be treated to the character string of pre-set categories, therefore shorten the length of Mail Contents, decrease the number of comparisons of Mail Contents, thus improve the speed of filtering posts.Further, owing to remaining complete Mail Contents, therefore ensure that cluster instruction, thus improve the accuracy of filtering spam mail.

Accompanying drawing explanation

Fig. 1 is the flow chart of a kind of rubbish mail filtering method that first embodiment of the invention provides;

Fig. 2 is the structure chart of a kind of junk mail filter device that second embodiment of the invention provides.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.

In the embodiment of the present invention, after receiving new mail, obtain the Mail Contents of described new mail, the Mail Contents of acquisition is treated to the character string of pre-set categories, the text similarity at described Mail Contents and described default initial cluster center is determined according to the data at the space penalty value preset, character Similarity value and default initial cluster center, according to the text similarity determined and default threshold decision, whether new mail is spam, to judge whether according to judged result to filter described new mail.

In order to technical solutions according to the invention are described, be described below by specific embodiment.

embodiment one:

Fig. 1 shows the flow chart of a kind of rubbish mail filtering method that first embodiment of the invention provides, and details are as follows:

Step S11, after receiving new mail, obtains the Mail Contents of described new mail.

In this step, when receiving a new mail, this new mail of decoding, make it to become normal content of text, then obtain the Mail Contents of this new mail from decoded new mail, this Mail Contents comprises text, keyword and annex etc.

Step S12, is treated to the character string of pre-set categories by the Mail Contents of acquisition.

Wherein, the character string of pre-set categories comprises the character string of Chinese character, English character string and other characters.It is pointed out that when Mail Contents comprises numeral, this numeral is divided into " English character string " class.

In this step, suppose that Mail Contents is for " ⊙ is multiple: 55 please excuse me if any bothering! 2 ", then this Mail Contents becomes after treatment: " ⊙ ", " answering ", ": ", " 55 ", " as ", " having ", " beating ", " disturbing ", " asking ", " opinion ", " forgiving ", "! ", " 2 " ", wherein, " ⊙ ", ": ", "! " be divided into " other characters " this class, " answering ", " as ", " having ", " beating ", " disturbing ", " asking ", " opinion ", " forgiving " be divided into " Chinese character " this class, " 55 ", " 2 " are divided into " English character " this class.

Step S13, determines the text similarity at described Mail Contents and described default initial cluster center according to the data at the space penalty value preset, character Similarity value and default initial cluster center.

Wherein, the space penalty value preset is negative, and its concrete numerical value sets as required, can be set as-1 ,-2 etc., certainly, also can be set as other numerical value, be not construed as limiting herein.

Wherein, the data at initial cluster center comprise character string and length, particularly, described basis preset space penalty value, character Similarity value and default initial cluster center data determine that the text similarity at described Mail Contents and described default initial cluster center comprises:

The space penalty value that A1, basis are preset and character Similarity value determine the top score processing character string and the character string at the initial cluster center of presetting obtained.Particularly, A11, according to the following formula initialization backtracking the first row of matrix and first row: F _{0, j}=d × j, wherein, d is default space penalty value, 0≤j≤(length-1 of Mail Contents), or, 0≤j≤(length-1 at the initial cluster center of presetting); F _{i, 0}=d × i, wherein, 0≤i≤(length-1 of Mail Contents), or, 0≤i≤(length-1 at the initial cluster center of presetting).It is pointed out that then i is less than (length-1 of Mail Contents) if j is less than (length-1 at the initial cluster center of presetting).The character string at the initial cluster center of presetting here is the character string as spam manually chosen.A12, determine to recall other ranks of matrix according to following formula: F _i,j=max (F _{i-1, j-1}+ sim (T _i, P _j), F _{i, j-1}+ d, F _{i-1, j}+ d), wherein, sim (T _i, P _j) be T _iwith P _jcharacter Similarity value, and by maximum F _i,jas the top score processing character string and the character string at the initial cluster center of presetting obtained.It is pointed out that T _iwith P _jfor belonging to other character of same class, the character that also may belong to a different category, works as T _iwith P _jfor all belonging to other character of same class, if both couplings, then definable sim (T _i, P _j) be 1 (or for being greater than other numerical value of 0), if both do not mate, then definable sim (T _i, P _j) be 0 (or for being less than other numerical value of 0).Certainly, T is worked as _iwith P _jduring the character belonged to a different category respectively, both do not mate certainly.In this step, maximum F _i,jfor the value of backtracking matrix last cell cell, in order to save workload, can when calculating the value of backtracking matrix last cell cell, directly using this value as the top score processing character string and the character string at the initial cluster center of presetting obtained.

A2, according to the length computation at the length of the top score determined, Mail Contents, default initial cluster center the text similarity at Mail Contents and default initial cluster center.Particularly, A21, the higher value determined in the length of Mail Contents and the length at default initial cluster center; A22, calculate the text similarity at described Mail Contents and default initial cluster center according to the top score determined and the higher value determined.Particularly, definition of T is worked as _iwith P _jduring coupling, sim (T _i, P _j) be 1, both do not mate, sim (T _i, P _j) be 0, then according to the text similarity at initial cluster center that following formula calculates described Mail Contents and presets: the higher value of the top score that SIM=determines/determine, with the text similarity (i.e. SIM) of normalization Mail Contents with the initial cluster center of presetting, make the value of this SIM [0,1] between, when SIM is more close to 1, show that Mail Contents is more similar with the initial cluster center of presetting, otherwise, show that Mail Contents is more dissimilar with the initial cluster center of presetting.Certainly, definition of T is worked as _iwith P _jduring coupling, sim (T _i, P _j) numerical value for non-1, then determine this sim (T _i, P _j) with 1 multiple, be assumed to be " M ", then top score/(higher value that M* determines) determined of SIM=, to ensure that the value of this SIM is between [0,1].

Whether step S14 is spam according to the text similarity determined and new mail described in the threshold decision preset, to judge whether according to judged result to filter described new mail.

Particularly, whether the described text similarity according to determining is spam with new mail described in the threshold decision preset, and to judge whether according to judged result to filter described new mail, comprising:

Whether the text similarity that B1, judgement are determined is greater than default threshold value.Suppose that the threshold value preset is M, then judge whether SIM is greater than M.

B2, when the text similarity determined is greater than default threshold value, judges that described new mail is as spam, and filter described new mail.Particularly, it is inner that filtration new mail refers to that this new mail of refusal leaves " inbox " in, can directly delete this new mail, also this new mail can be left in the file of spam, when being misjudged to make mail, user also can also browse this mail, reduces user's loss.

B3, when the text similarity determined is less than or equal to default threshold value, judge described new mail not as spam, and using described new mail as new initial cluster center.

Alternatively, in order to alleviate the burden of filtering posts, accelerating the speed of filtering posts, before the Mail Contents of the described new mail of described acquisition, comprising:

Judge whether the new mail received is spam, to judge whether according to judged result to filter described new mail by white list and/or blacklist.

Particularly, white list stores the addresses of items of mail of certain customers, and the addresses of items of mail corresponding when the mail received is identical with certain addresses of items of mail of white list, then judge that this mail is not spam.Blacklist also stores IP address or addresses of items of mail, and the addresses of items of mail corresponding when the mail received is identical with certain addresses of items of mail of blacklist, then judge that this mail is spam.Certainly, before being judged by white list and/or blacklist whether the new mail received is spam, can also analyze the mind of the mail received, check and send address and receiver address, if transmission address or receiver address do not exist, then judge that this mail is as spam.By can improve speed and the accuracy of filtering spam mail in conjunction with above-mentioned mail filtering method.

In first embodiment of the invention, after receiving new mail, obtain the Mail Contents of described new mail, the Mail Contents of acquisition is treated to the character string of pre-set categories, the text similarity at described Mail Contents and described default initial cluster center is determined according to the data at the space penalty value preset, character Similarity value and default initial cluster center, according to the text similarity determined and default threshold decision, whether new mail is spam, to judge whether according to judged result to filter described new mail.Owing to the Mail Contents of acquisition to be treated to the character string of pre-set categories, therefore shorten the length of Mail Contents, decrease the number of comparisons of Mail Contents, thus improve the speed of filtering posts.Further, owing to remaining complete Mail Contents, therefore ensure that cluster instruction, thus improve the accuracy of filtering spam mail.

Should be understood that in embodiments of the present invention, the size of the sequence number of above-mentioned each process does not also mean that the priority of execution sequence, and the execution sequence of each process should be determined with its function and internal logic, and should not form any restriction to the implementation process of the embodiment of the present invention.

embodiment two:

Fig. 2 shows the structure chart of a kind of junk mail filter device that second embodiment of the invention provides, this junk mail filter device can be applicable to various terminal, this terminal can comprise carries out through wireless access network RAN and one or more core net the subscriber equipment that communicates, this subscriber equipment can be mobile phone (or being called " honeycomb " phone), there is the computer etc. of mobile device, such as, subscriber equipment can also be portable, pocket, hand-hold type, built-in computer or vehicle-mounted mobile device, they and wireless access network switched voice and/or data.Again such as, this mobile device can comprise smart mobile phone, panel computer, personal digital assistant PDA, point-of-sale terminal POS or vehicle-mounted computer etc.For convenience of explanation, illustrate only the part relevant to the embodiment of the present invention.

This junk mail filter device comprises: Mail Contents acquiring unit 21, Mail Contents pretreatment unit 22, text similarity determining unit 23, spam judging unit 24.Wherein:

Mail Contents acquiring unit 21, after receiving new mail, obtains the Mail Contents of described new mail.

Particularly, when receiving a new mail, this new mail of decoding, make it to become normal content of text, then obtain the Mail Contents of this new mail from decoded new mail, this Mail Contents comprises text, keyword and annex etc.

Mail Contents pretreatment unit 22, for being treated to the character string of pre-set categories by the Mail Contents of acquisition.

Text similarity determining unit 23, for determining the text similarity at described Mail Contents and described default initial cluster center according to the data at the space penalty value preset, character Similarity value and default initial cluster center.

Alternatively, the data at initial cluster center comprise character string and length, and described text similarity determining unit 23 comprises:

Mail Contents coupling mark determining unit, for determining the top score processing character string and the character string at the initial cluster center of presetting obtained according to the space penalty value preset and character Similarity value.Particularly, described Mail Contents coupling mark determining unit comprises: backtracking matrix initialisation module and other ranks value determination modules of backtracking matrix.Wherein, the first row and first row: F that matrix initialisation module is used for recalling according to following formula initialization matrix is recalled _{0, j}=d × j, wherein, d is default space penalty value, 0≤j≤(length-1 of Mail Contents), or, 0≤j≤(length-1 at the initial cluster center of presetting); F _{i, 0}=d × i, wherein, 0≤i≤(length-1 of Mail Contents), or, 0≤i≤(length-1 at the initial cluster center of presetting).It is pointed out that then i is less than (length-1 of Mail Contents) if j is less than (length-1 at the initial cluster center of presetting).The character string at the initial cluster center of presetting here is the character string as spam manually chosen.Other ranks value determination modules of backtracking matrix are used for other ranks of determining to recall matrix according to following formula: F _i,j=max (F _{i-1, j-1}+ sim (T _i, P _j), F _{i, j-1}+ d, F _{i-1, j}+ d), wherein, sim (T _i, P _j) be T _iwith P _jcharacter Similarity value, and by maximum F _i,jas the top score processing character string and the character string at the initial cluster center of presetting obtained.It is pointed out that T _iwith P _jfor belonging to other character of same class, the character that also may belong to a different category, works as T _iwith P _jfor all belonging to other character of same class, if both couplings, then definable sim (T _i, P _j) be 1 (or for being greater than other numerical value of 0), if both do not mate, then definable sim (T _i, P _j) be 0 (or for being less than other numerical value of 0).

Mail Contents similarity calculated, for the length according to the top score determined, Mail Contents, default initial cluster center length computation described in the text similarity at Mail Contents and default initial cluster center.Particularly, described Mail Contents similarity calculated comprises: Mail Contents length comparison module and Text similarity computing module.Wherein, Mail Contents length comparison module is for determining the length of Mail Contents and the higher value in the length at the initial cluster center of presetting; Text similarity computing module is used for the text similarity calculating described Mail Contents and default initial cluster center according to the top score determined and the higher value determined.Particularly, definition of T is worked as _iwith P _jduring coupling, sim (T _i, P _j) be 1, both do not mate, sim (T _i, P _j) be 0, then according to the text similarity at initial cluster center that following formula calculates described Mail Contents and presets: the higher value of the top score that SIM=determines/determine, with the text similarity (i.e. SIM) of normalization Mail Contents with the initial cluster center of presetting, make the value of this SIM [0,1] between, when SIM is more close to 1, show that Mail Contents is more similar with the initial cluster center of presetting, otherwise, show that Mail Contents is more dissimilar with the initial cluster center of presetting.Certainly, definition of T is worked as _iwith P _jduring coupling, sim (T _i, P _j) numerical value for non-1, then determine this sim (T _i, P _j) with 1 multiple, be assumed to be " M ", then top score/(higher value that M* determines) determined of SIM=, to ensure that the value of this SIM is between [0,1].

Whether spam judging unit 24, for being spam according to the text similarity determined and new mail described in the threshold decision preset, judge whether according to judged result to filter described new mail.

Alternatively, described spam judging unit 24 comprises:

Text similarity comparison module, for judging whether the text similarity determined is greater than default threshold value.

Judging rubbish mail module, during for being greater than default threshold value at the text similarity determined, judging that described new mail is as spam, and filters described new mail.Particularly, it is inner that filtration new mail refers to that this new mail of refusal leaves " inbox " in, can directly delete this new mail, also this new mail can be left in the file of spam, with make mail misjudgement time, user also can also browse this mail, reduce user loss.

Non-spam email processing module, during for being less than or equal to default threshold value at the text similarity determined, judges described new mail not as spam, and using described new mail as new initial cluster center.

Alternatively, in order to alleviate the burden of filtering posts, accelerate the speed of filtering posts, described junk mail filter device comprises:

In second embodiment of the invention, owing to the Mail Contents of acquisition to be treated to the character string of pre-set categories, therefore shorten the length of Mail Contents, decrease the number of comparisons of Mail Contents, thus improve the speed of filtering posts.Further, owing to remaining complete Mail Contents, therefore ensure that cluster instruction, thus improve the accuracy of filtering spam mail.

Those of ordinary skill in the art can recognize, in conjunction with unit and the algorithm steps of each example of embodiment disclosed herein description, can realize with the combination of electronic hardware or computer software and electronic hardware.These functions perform with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Professional and technical personnel can use distinct methods to realize described function to each specifically should being used for, but this realization should not thought and exceeds scope of the present invention.

Those skilled in the art can be well understood to, and for convenience and simplicity of description, the specific works process of the system of foregoing description, device and unit, with reference to the corresponding process in preceding method embodiment, can not repeat them here.

In several embodiments that the application provides, should be understood that disclosed system, apparatus and method can realize by another way.Such as, device embodiment described above is only schematic, such as, the division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of device or unit or communication connection can be electrical, machinery or other form.

The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.

If described function using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.Based on such understanding, the part of the part that technical scheme of the present invention contributes to prior art in essence in other words or this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) perform all or part of step of method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, portable hard drive, read-only memory (ROM, Read-OnlyMemory), random access memory (RAM, RandomAccessMemory), magnetic disc or CD etc. various can be program code stored medium.

The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; change can be expected easily or replace, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should described be as the criterion with the protection range of claim.

Claims

1. a rubbish mail filtering method, is characterized in that, described method comprises:

After receiving new mail, obtain the Mail Contents of described new mail;

2. method according to claim 1, it is characterized in that, the data at described initial cluster center comprise character string and length, the space penalty value that described basis is preset, the data at character Similarity value and default initial cluster center determine the text similarity at described Mail Contents and described default initial cluster center, specifically comprise:

The top score processing character string and the character string at the initial cluster center of presetting obtained is determined according to the space penalty value preset and character Similarity value;

The text similarity at Mail Contents and default initial cluster center according to the length computation at the length of the top score determined, Mail Contents, default initial cluster center.

3. method according to claim 2, is characterized in that, the space penalty value that described basis is preset and character Similarity value determine the top score of the character string processing the character string obtained and the initial cluster center of presetting, and specifically comprise:

According to the first row and first row: the F of following formula initialization backtracking matrix _{0, j}=d × j, wherein, d is default space penalty value, 0≤j≤(length-1 of Mail Contents), or, 0≤j≤(length-1 at the initial cluster center of presetting); F _{i, 0}=d × i, wherein, 0≤i≤(length-1 of Mail Contents), or, 0≤i≤(length-1 at the initial cluster center of presetting);

Other ranks of recalling matrix are determined: F according to following formula _i,j=max (F _{i-1, j-1}+ sim (T _i, P _j), F _{i, j-1}+ d, F _{i-1, j}+ d), wherein, sim (T _i, P _j) be T _iwith P _jcharacter Similarity value, and by maximum F _i,jas the top score processing character string and the character string at the initial cluster center of presetting obtained.

4. method according to claim 2, is characterized in that, the text similarity at Mail Contents and default initial cluster center described in the length computation at described top score according to determining, the length of Mail Contents, default initial cluster center, specifically comprises:

Determine the higher value in the length of Mail Contents and the length at default initial cluster center;

The text similarity at described Mail Contents and default initial cluster center is calculated according to the top score determined and the higher value determined.

5. method according to claim 1, is characterized in that, whether the described text similarity according to determining is spam with new mail described in the threshold decision preset, and to judge whether according to judged result to filter described new mail, specifically comprises:

Judge whether the text similarity determined is greater than default threshold value;

When the text similarity determined is greater than default threshold value, judges that described new mail is as spam, and filter described new mail;

When the text similarity determined is less than or equal to default threshold value, judge described new mail not as spam, and using described new mail as new initial cluster center.

6. method according to claim 1, is characterized in that, before the Mail Contents of the described new mail of described acquisition, comprising:

7. a junk mail filter device, is characterized in that, described device comprises:

8. device according to claim 7, is characterized in that, the data at described initial cluster center comprise character string and length, and described text similarity determining unit comprises:

Mail Contents coupling mark determining unit, for determining the top score processing character string and the character string at the initial cluster center of presetting obtained according to the space penalty value preset and character Similarity value;

Mail Contents similarity calculated, for the length according to the top score determined, Mail Contents, default initial cluster center length computation described in the text similarity at Mail Contents and default initial cluster center.

9. device according to claim 8, is characterized in that, described Mail Contents coupling mark determining unit specifically comprises:

Backtracking matrix initialisation module, for recalling the first row and first row: the F of matrix according to following formula initialization _{0, j}=d × j, wherein, d is default space penalty value, 0≤j≤(length-1 of Mail Contents), or, 0≤j≤(length-1 at the initial cluster center of presetting); F _{i, 0}=d × i, wherein, 0≤i≤(length-1 of Mail Contents), or, 0≤i≤(length-1 at the initial cluster center of presetting);

Other ranks value determination modules of backtracking matrix, for determining other ranks of recalling matrix: F according to following formula _i,j=max (F _{i-1, j-1}+ sim (T _i, P _j), F _{i, j-1}+ d, F _{i-1, j}+ d), wherein, sim (T _i, P _j) be T _iwith P _jcharacter Similarity value, and by maximum F _i,jas the top score processing character string and the character string at the initial cluster center of presetting obtained.

10. device according to claim 7, is characterized in that, described Mail Contents similarity calculated comprises:

Mail Contents length comparison module, for determining the length of Mail Contents and the higher value in the length at the initial cluster center of presetting;

Text similarity computing module, for calculating the text similarity at described Mail Contents and default initial cluster center according to the top score determined and the higher value determined.

11. devices according to claim 7, is characterized in that, described spam judging unit comprises:

Text similarity comparison module, for judging whether the text similarity determined is greater than default threshold value;

Judging rubbish mail module, during for being greater than default threshold value at the text similarity determined, judging that described new mail is as spam, and filters described new mail;

12. devices according to claim 7, is characterized in that, described device comprises: