CN105323153A - Spam mail filtering method and device - Google Patents

Spam mail filtering method and device Download PDF

Info

Publication number
CN105323153A
CN105323153A CN201510794358.0A CN201510794358A CN105323153A CN 105323153 A CN105323153 A CN 105323153A CN 201510794358 A CN201510794358 A CN 201510794358A CN 105323153 A CN105323153 A CN 105323153A
Authority
CN
China
Prior art keywords
mail
cluster center
initial cluster
length
mail contents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510794358.0A
Other languages
Chinese (zh)
Inventor
周鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TCL Corp
Original Assignee
TCL Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TCL Corp filed Critical TCL Corp
Priority to CN201510794358.0A priority Critical patent/CN105323153A/en
Publication of CN105323153A publication Critical patent/CN105323153A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/42Mailbox-related aspects, e.g. synchronisation of mailboxes

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention belongs to the field of information filtering and provides a spam mail filtering method and device. The method comprises the steps of: after receiving a new mail, obtaining the mail content of the new mail; processing the obtained mail content into a character string of a preset type; determining the text similarity of the mail content and a preset initial cluster center according to a preset space punishment value and character similarity value and preset initial cluster center data; judging whether the new mail is a spam mail according to the determined text similarity and the preset threshold and judging whether the new mail should be filtered according to the judging result. The embodiments of the invention can improve the accuracy of spam mail filtering.

Description

Rubbish mail filtering method and device
Technical field
The embodiment of the present invention belongs to information filtering field, particularly relates to a kind of rubbish mail filtering method and device.
Background technology
Text cluster is got together by the text of semantic similarity, traditional data digging method is before process text data, first text table must be shown as form that computer can process, that can embody text substantive characteristics, then the reverse document-frequency (TermFrequencyInverseDocumentFrequency of word frequency is used, TFIDF) document is converted into vector form, finally in vector space model, calculates text similarity by Text Clustering Method.Based in the vector space model of TFIDF, owing to not considering the concept similar situation between word, therefore have impact on the accuracy of data clusters.Further, existing method is also difficult to identify the normal email information or key words that spammer incorporates in Mail Contents, thus is difficult to accurately filter out spam.
Summary of the invention
Embodiments provide a kind of rubbish mail filtering method and device, be intended to solve the problem that existing method is difficult to accurately filter out spam.
The embodiment of the present invention is achieved in that a kind of rubbish mail filtering method, and described method comprises:
After receiving new mail, obtain the Mail Contents of described new mail;
The Mail Contents of acquisition is treated to the character string of pre-set categories;
The text similarity at described Mail Contents and described default initial cluster center is determined according to the data at the space penalty value preset, character Similarity value and default initial cluster center;
According to the text similarity determined and default threshold decision, whether new mail is spam, to judge whether according to judged result to filter described new mail.
Another object of the embodiment of the present invention is to provide a kind of junk mail filter device, and described device comprises:
Mail Contents acquiring unit, after receiving new mail, obtains the Mail Contents of described new mail;
Mail Contents pretreatment unit, for being treated to the character string of pre-set categories by the Mail Contents of acquisition;
Text similarity determining unit, for determining the text similarity at described Mail Contents and described default initial cluster center according to the data at the space penalty value preset, character Similarity value and default initial cluster center;
Whether spam judging unit, for being spam according to the text similarity determined and new mail described in the threshold decision preset, judge whether according to judged result to filter described new mail.
In embodiments of the present invention, owing to the Mail Contents of acquisition to be treated to the character string of pre-set categories, therefore shorten the length of Mail Contents, decrease the number of comparisons of Mail Contents, thus improve the speed of filtering posts.Further, owing to remaining complete Mail Contents, therefore ensure that cluster instruction, thus improve the accuracy of filtering spam mail.
Accompanying drawing explanation
Fig. 1 is the flow chart of a kind of rubbish mail filtering method that first embodiment of the invention provides;
Fig. 2 is the structure chart of a kind of junk mail filter device that second embodiment of the invention provides.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.
In the embodiment of the present invention, after receiving new mail, obtain the Mail Contents of described new mail, the Mail Contents of acquisition is treated to the character string of pre-set categories, the text similarity at described Mail Contents and described default initial cluster center is determined according to the data at the space penalty value preset, character Similarity value and default initial cluster center, according to the text similarity determined and default threshold decision, whether new mail is spam, to judge whether according to judged result to filter described new mail.
In order to technical solutions according to the invention are described, be described below by specific embodiment.
embodiment one:
Fig. 1 shows the flow chart of a kind of rubbish mail filtering method that first embodiment of the invention provides, and details are as follows:
Step S11, after receiving new mail, obtains the Mail Contents of described new mail.
In this step, when receiving a new mail, this new mail of decoding, make it to become normal content of text, then obtain the Mail Contents of this new mail from decoded new mail, this Mail Contents comprises text, keyword and annex etc.
Step S12, is treated to the character string of pre-set categories by the Mail Contents of acquisition.
Wherein, the character string of pre-set categories comprises the character string of Chinese character, English character string and other characters.It is pointed out that when Mail Contents comprises numeral, this numeral is divided into " English character string " class.
In this step, suppose that Mail Contents is for " ⊙ is multiple: 55 please excuse me if any bothering! 2 ", then this Mail Contents becomes after treatment: " ⊙ ", " answering ", ": ", " 55 ", " as ", " having ", " beating ", " disturbing ", " asking ", " opinion ", " forgiving ", "! ", " 2 " ", wherein, " ⊙ ", ": ", "! " be divided into " other characters " this class, " answering ", " as ", " having ", " beating ", " disturbing ", " asking ", " opinion ", " forgiving " be divided into " Chinese character " this class, " 55 ", " 2 " are divided into " English character " this class.
Step S13, determines the text similarity at described Mail Contents and described default initial cluster center according to the data at the space penalty value preset, character Similarity value and default initial cluster center.
Wherein, the space penalty value preset is negative, and its concrete numerical value sets as required, can be set as-1 ,-2 etc., certainly, also can be set as other numerical value, be not construed as limiting herein.
Wherein, the data at initial cluster center comprise character string and length, particularly, described basis preset space penalty value, character Similarity value and default initial cluster center data determine that the text similarity at described Mail Contents and described default initial cluster center comprises:
The space penalty value that A1, basis are preset and character Similarity value determine the top score processing character string and the character string at the initial cluster center of presetting obtained.Particularly, A11, according to the following formula initialization backtracking the first row of matrix and first row: F 0, j=d × j, wherein, d is default space penalty value, 0≤j≤(length-1 of Mail Contents), or, 0≤j≤(length-1 at the initial cluster center of presetting); F i, 0=d × i, wherein, 0≤i≤(length-1 of Mail Contents), or, 0≤i≤(length-1 at the initial cluster center of presetting).It is pointed out that then i is less than (length-1 of Mail Contents) if j is less than (length-1 at the initial cluster center of presetting).The character string at the initial cluster center of presetting here is the character string as spam manually chosen.A12, determine to recall other ranks of matrix according to following formula: F i,j=max (F i-1, j-1+ sim (T i, P j), F i, j-1+ d, F i-1, j+ d), wherein, sim (T i, P j) be T iwith P jcharacter Similarity value, and by maximum F i,jas the top score processing character string and the character string at the initial cluster center of presetting obtained.It is pointed out that T iwith P jfor belonging to other character of same class, the character that also may belong to a different category, works as T iwith P jfor all belonging to other character of same class, if both couplings, then definable sim (T i, P j) be 1 (or for being greater than other numerical value of 0), if both do not mate, then definable sim (T i, P j) be 0 (or for being less than other numerical value of 0).Certainly, T is worked as iwith P jduring the character belonged to a different category respectively, both do not mate certainly.In this step, maximum F i,jfor the value of backtracking matrix last cell cell, in order to save workload, can when calculating the value of backtracking matrix last cell cell, directly using this value as the top score processing character string and the character string at the initial cluster center of presetting obtained.
A2, according to the length computation at the length of the top score determined, Mail Contents, default initial cluster center the text similarity at Mail Contents and default initial cluster center.Particularly, A21, the higher value determined in the length of Mail Contents and the length at default initial cluster center; A22, calculate the text similarity at described Mail Contents and default initial cluster center according to the top score determined and the higher value determined.Particularly, definition of T is worked as iwith P jduring coupling, sim (T i, P j) be 1, both do not mate, sim (T i, P j) be 0, then according to the text similarity at initial cluster center that following formula calculates described Mail Contents and presets: the higher value of the top score that SIM=determines/determine, with the text similarity (i.e. SIM) of normalization Mail Contents with the initial cluster center of presetting, make the value of this SIM [0,1] between, when SIM is more close to 1, show that Mail Contents is more similar with the initial cluster center of presetting, otherwise, show that Mail Contents is more dissimilar with the initial cluster center of presetting.Certainly, definition of T is worked as iwith P jduring coupling, sim (T i, P j) numerical value for non-1, then determine this sim (T i, P j) with 1 multiple, be assumed to be " M ", then top score/(higher value that M* determines) determined of SIM=, to ensure that the value of this SIM is between [0,1].
Whether step S14 is spam according to the text similarity determined and new mail described in the threshold decision preset, to judge whether according to judged result to filter described new mail.
Particularly, whether the described text similarity according to determining is spam with new mail described in the threshold decision preset, and to judge whether according to judged result to filter described new mail, comprising:
Whether the text similarity that B1, judgement are determined is greater than default threshold value.Suppose that the threshold value preset is M, then judge whether SIM is greater than M.
B2, when the text similarity determined is greater than default threshold value, judges that described new mail is as spam, and filter described new mail.Particularly, it is inner that filtration new mail refers to that this new mail of refusal leaves " inbox " in, can directly delete this new mail, also this new mail can be left in the file of spam, when being misjudged to make mail, user also can also browse this mail, reduces user's loss.
B3, when the text similarity determined is less than or equal to default threshold value, judge described new mail not as spam, and using described new mail as new initial cluster center.
Alternatively, in order to alleviate the burden of filtering posts, accelerating the speed of filtering posts, before the Mail Contents of the described new mail of described acquisition, comprising:
Judge whether the new mail received is spam, to judge whether according to judged result to filter described new mail by white list and/or blacklist.
Particularly, white list stores the addresses of items of mail of certain customers, and the addresses of items of mail corresponding when the mail received is identical with certain addresses of items of mail of white list, then judge that this mail is not spam.Blacklist also stores IP address or addresses of items of mail, and the addresses of items of mail corresponding when the mail received is identical with certain addresses of items of mail of blacklist, then judge that this mail is spam.Certainly, before being judged by white list and/or blacklist whether the new mail received is spam, can also analyze the mind of the mail received, check and send address and receiver address, if transmission address or receiver address do not exist, then judge that this mail is as spam.By can improve speed and the accuracy of filtering spam mail in conjunction with above-mentioned mail filtering method.
In first embodiment of the invention, after receiving new mail, obtain the Mail Contents of described new mail, the Mail Contents of acquisition is treated to the character string of pre-set categories, the text similarity at described Mail Contents and described default initial cluster center is determined according to the data at the space penalty value preset, character Similarity value and default initial cluster center, according to the text similarity determined and default threshold decision, whether new mail is spam, to judge whether according to judged result to filter described new mail.Owing to the Mail Contents of acquisition to be treated to the character string of pre-set categories, therefore shorten the length of Mail Contents, decrease the number of comparisons of Mail Contents, thus improve the speed of filtering posts.Further, owing to remaining complete Mail Contents, therefore ensure that cluster instruction, thus improve the accuracy of filtering spam mail.
Should be understood that in embodiments of the present invention, the size of the sequence number of above-mentioned each process does not also mean that the priority of execution sequence, and the execution sequence of each process should be determined with its function and internal logic, and should not form any restriction to the implementation process of the embodiment of the present invention.
embodiment two:
Fig. 2 shows the structure chart of a kind of junk mail filter device that second embodiment of the invention provides, this junk mail filter device can be applicable to various terminal, this terminal can comprise carries out through wireless access network RAN and one or more core net the subscriber equipment that communicates, this subscriber equipment can be mobile phone (or being called " honeycomb " phone), there is the computer etc. of mobile device, such as, subscriber equipment can also be portable, pocket, hand-hold type, built-in computer or vehicle-mounted mobile device, they and wireless access network switched voice and/or data.Again such as, this mobile device can comprise smart mobile phone, panel computer, personal digital assistant PDA, point-of-sale terminal POS or vehicle-mounted computer etc.For convenience of explanation, illustrate only the part relevant to the embodiment of the present invention.
This junk mail filter device comprises: Mail Contents acquiring unit 21, Mail Contents pretreatment unit 22, text similarity determining unit 23, spam judging unit 24.Wherein:
Mail Contents acquiring unit 21, after receiving new mail, obtains the Mail Contents of described new mail.
Particularly, when receiving a new mail, this new mail of decoding, make it to become normal content of text, then obtain the Mail Contents of this new mail from decoded new mail, this Mail Contents comprises text, keyword and annex etc.
Mail Contents pretreatment unit 22, for being treated to the character string of pre-set categories by the Mail Contents of acquisition.
Wherein, the character string of pre-set categories comprises the character string of Chinese character, English character string and other characters.It is pointed out that when Mail Contents comprises numeral, this numeral is divided into " English character string " class.
Text similarity determining unit 23, for determining the text similarity at described Mail Contents and described default initial cluster center according to the data at the space penalty value preset, character Similarity value and default initial cluster center.
Wherein, the space penalty value preset is negative, and its concrete numerical value sets as required, can be set as-1 ,-2 etc., certainly, also can be set as other numerical value, be not construed as limiting herein.
Alternatively, the data at initial cluster center comprise character string and length, and described text similarity determining unit 23 comprises:
Mail Contents coupling mark determining unit, for determining the top score processing character string and the character string at the initial cluster center of presetting obtained according to the space penalty value preset and character Similarity value.Particularly, described Mail Contents coupling mark determining unit comprises: backtracking matrix initialisation module and other ranks value determination modules of backtracking matrix.Wherein, the first row and first row: F that matrix initialisation module is used for recalling according to following formula initialization matrix is recalled 0, j=d × j, wherein, d is default space penalty value, 0≤j≤(length-1 of Mail Contents), or, 0≤j≤(length-1 at the initial cluster center of presetting); F i, 0=d × i, wherein, 0≤i≤(length-1 of Mail Contents), or, 0≤i≤(length-1 at the initial cluster center of presetting).It is pointed out that then i is less than (length-1 of Mail Contents) if j is less than (length-1 at the initial cluster center of presetting).The character string at the initial cluster center of presetting here is the character string as spam manually chosen.Other ranks value determination modules of backtracking matrix are used for other ranks of determining to recall matrix according to following formula: F i,j=max (F i-1, j-1+ sim (T i, P j), F i, j-1+ d, F i-1, j+ d), wherein, sim (T i, P j) be T iwith P jcharacter Similarity value, and by maximum F i,jas the top score processing character string and the character string at the initial cluster center of presetting obtained.It is pointed out that T iwith P jfor belonging to other character of same class, the character that also may belong to a different category, works as T iwith P jfor all belonging to other character of same class, if both couplings, then definable sim (T i, P j) be 1 (or for being greater than other numerical value of 0), if both do not mate, then definable sim (T i, P j) be 0 (or for being less than other numerical value of 0).
Mail Contents similarity calculated, for the length according to the top score determined, Mail Contents, default initial cluster center length computation described in the text similarity at Mail Contents and default initial cluster center.Particularly, described Mail Contents similarity calculated comprises: Mail Contents length comparison module and Text similarity computing module.Wherein, Mail Contents length comparison module is for determining the length of Mail Contents and the higher value in the length at the initial cluster center of presetting; Text similarity computing module is used for the text similarity calculating described Mail Contents and default initial cluster center according to the top score determined and the higher value determined.Particularly, definition of T is worked as iwith P jduring coupling, sim (T i, P j) be 1, both do not mate, sim (T i, P j) be 0, then according to the text similarity at initial cluster center that following formula calculates described Mail Contents and presets: the higher value of the top score that SIM=determines/determine, with the text similarity (i.e. SIM) of normalization Mail Contents with the initial cluster center of presetting, make the value of this SIM [0,1] between, when SIM is more close to 1, show that Mail Contents is more similar with the initial cluster center of presetting, otherwise, show that Mail Contents is more dissimilar with the initial cluster center of presetting.Certainly, definition of T is worked as iwith P jduring coupling, sim (T i, P j) numerical value for non-1, then determine this sim (T i, P j) with 1 multiple, be assumed to be " M ", then top score/(higher value that M* determines) determined of SIM=, to ensure that the value of this SIM is between [0,1].
Whether spam judging unit 24, for being spam according to the text similarity determined and new mail described in the threshold decision preset, judge whether according to judged result to filter described new mail.
Alternatively, described spam judging unit 24 comprises:
Text similarity comparison module, for judging whether the text similarity determined is greater than default threshold value.
Judging rubbish mail module, during for being greater than default threshold value at the text similarity determined, judging that described new mail is as spam, and filters described new mail.Particularly, it is inner that filtration new mail refers to that this new mail of refusal leaves " inbox " in, can directly delete this new mail, also this new mail can be left in the file of spam, with make mail misjudgement time, user also can also browse this mail, reduce user loss.
Non-spam email processing module, during for being less than or equal to default threshold value at the text similarity determined, judges described new mail not as spam, and using described new mail as new initial cluster center.
Alternatively, in order to alleviate the burden of filtering posts, accelerate the speed of filtering posts, described junk mail filter device comprises:
Judge whether the new mail received is spam, to judge whether according to judged result to filter described new mail by white list and/or blacklist.
Particularly, white list stores the addresses of items of mail of certain customers, and the addresses of items of mail corresponding when the mail received is identical with certain addresses of items of mail of white list, then judge that this mail is not spam.Blacklist also stores IP address or addresses of items of mail, and the addresses of items of mail corresponding when the mail received is identical with certain addresses of items of mail of blacklist, then judge that this mail is spam.Certainly, before being judged by white list and/or blacklist whether the new mail received is spam, can also analyze the mind of the mail received, check and send address and receiver address, if transmission address or receiver address do not exist, then judge that this mail is as spam.By can improve speed and the accuracy of filtering spam mail in conjunction with above-mentioned mail filtering method.
In second embodiment of the invention, owing to the Mail Contents of acquisition to be treated to the character string of pre-set categories, therefore shorten the length of Mail Contents, decrease the number of comparisons of Mail Contents, thus improve the speed of filtering posts.Further, owing to remaining complete Mail Contents, therefore ensure that cluster instruction, thus improve the accuracy of filtering spam mail.
Those of ordinary skill in the art can recognize, in conjunction with unit and the algorithm steps of each example of embodiment disclosed herein description, can realize with the combination of electronic hardware or computer software and electronic hardware.These functions perform with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Professional and technical personnel can use distinct methods to realize described function to each specifically should being used for, but this realization should not thought and exceeds scope of the present invention.
Those skilled in the art can be well understood to, and for convenience and simplicity of description, the specific works process of the system of foregoing description, device and unit, with reference to the corresponding process in preceding method embodiment, can not repeat them here.
In several embodiments that the application provides, should be understood that disclosed system, apparatus and method can realize by another way.Such as, device embodiment described above is only schematic, such as, the division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of device or unit or communication connection can be electrical, machinery or other form.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.
If described function using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.Based on such understanding, the part of the part that technical scheme of the present invention contributes to prior art in essence in other words or this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) perform all or part of step of method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, portable hard drive, read-only memory (ROM, Read-OnlyMemory), random access memory (RAM, RandomAccessMemory), magnetic disc or CD etc. various can be program code stored medium.
The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; change can be expected easily or replace, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should described be as the criterion with the protection range of claim.

Claims (12)

1. a rubbish mail filtering method, is characterized in that, described method comprises:
After receiving new mail, obtain the Mail Contents of described new mail;
The Mail Contents of acquisition is treated to the character string of pre-set categories;
The text similarity at described Mail Contents and described default initial cluster center is determined according to the data at the space penalty value preset, character Similarity value and default initial cluster center;
According to the text similarity determined and default threshold decision, whether new mail is spam, to judge whether according to judged result to filter described new mail.
2. method according to claim 1, it is characterized in that, the data at described initial cluster center comprise character string and length, the space penalty value that described basis is preset, the data at character Similarity value and default initial cluster center determine the text similarity at described Mail Contents and described default initial cluster center, specifically comprise:
The top score processing character string and the character string at the initial cluster center of presetting obtained is determined according to the space penalty value preset and character Similarity value;
The text similarity at Mail Contents and default initial cluster center according to the length computation at the length of the top score determined, Mail Contents, default initial cluster center.
3. method according to claim 2, is characterized in that, the space penalty value that described basis is preset and character Similarity value determine the top score of the character string processing the character string obtained and the initial cluster center of presetting, and specifically comprise:
According to the first row and first row: the F of following formula initialization backtracking matrix 0, j=d × j, wherein, d is default space penalty value, 0≤j≤(length-1 of Mail Contents), or, 0≤j≤(length-1 at the initial cluster center of presetting); F i, 0=d × i, wherein, 0≤i≤(length-1 of Mail Contents), or, 0≤i≤(length-1 at the initial cluster center of presetting);
Other ranks of recalling matrix are determined: F according to following formula i,j=max (F i-1, j-1+ sim (T i, P j), F i, j-1+ d, F i-1, j+ d), wherein, sim (T i, P j) be T iwith P jcharacter Similarity value, and by maximum F i,jas the top score processing character string and the character string at the initial cluster center of presetting obtained.
4. method according to claim 2, is characterized in that, the text similarity at Mail Contents and default initial cluster center described in the length computation at described top score according to determining, the length of Mail Contents, default initial cluster center, specifically comprises:
Determine the higher value in the length of Mail Contents and the length at default initial cluster center;
The text similarity at described Mail Contents and default initial cluster center is calculated according to the top score determined and the higher value determined.
5. method according to claim 1, is characterized in that, whether the described text similarity according to determining is spam with new mail described in the threshold decision preset, and to judge whether according to judged result to filter described new mail, specifically comprises:
Judge whether the text similarity determined is greater than default threshold value;
When the text similarity determined is greater than default threshold value, judges that described new mail is as spam, and filter described new mail;
When the text similarity determined is less than or equal to default threshold value, judge described new mail not as spam, and using described new mail as new initial cluster center.
6. method according to claim 1, is characterized in that, before the Mail Contents of the described new mail of described acquisition, comprising:
Judge whether the new mail received is spam, to judge whether according to judged result to filter described new mail by white list and/or blacklist.
7. a junk mail filter device, is characterized in that, described device comprises:
Mail Contents acquiring unit, after receiving new mail, obtains the Mail Contents of described new mail;
Mail Contents pretreatment unit, for being treated to the character string of pre-set categories by the Mail Contents of acquisition;
Text similarity determining unit, for determining the text similarity at described Mail Contents and described default initial cluster center according to the data at the space penalty value preset, character Similarity value and default initial cluster center;
Whether spam judging unit, for being spam according to the text similarity determined and new mail described in the threshold decision preset, judge whether according to judged result to filter described new mail.
8. device according to claim 7, is characterized in that, the data at described initial cluster center comprise character string and length, and described text similarity determining unit comprises:
Mail Contents coupling mark determining unit, for determining the top score processing character string and the character string at the initial cluster center of presetting obtained according to the space penalty value preset and character Similarity value;
Mail Contents similarity calculated, for the length according to the top score determined, Mail Contents, default initial cluster center length computation described in the text similarity at Mail Contents and default initial cluster center.
9. device according to claim 8, is characterized in that, described Mail Contents coupling mark determining unit specifically comprises:
Backtracking matrix initialisation module, for recalling the first row and first row: the F of matrix according to following formula initialization 0, j=d × j, wherein, d is default space penalty value, 0≤j≤(length-1 of Mail Contents), or, 0≤j≤(length-1 at the initial cluster center of presetting); F i, 0=d × i, wherein, 0≤i≤(length-1 of Mail Contents), or, 0≤i≤(length-1 at the initial cluster center of presetting);
Other ranks value determination modules of backtracking matrix, for determining other ranks of recalling matrix: F according to following formula i,j=max (F i-1, j-1+ sim (T i, P j), F i, j-1+ d, F i-1, j+ d), wherein, sim (T i, P j) be T iwith P jcharacter Similarity value, and by maximum F i,jas the top score processing character string and the character string at the initial cluster center of presetting obtained.
10. device according to claim 7, is characterized in that, described Mail Contents similarity calculated comprises:
Mail Contents length comparison module, for determining the length of Mail Contents and the higher value in the length at the initial cluster center of presetting;
Text similarity computing module, for calculating the text similarity at described Mail Contents and default initial cluster center according to the top score determined and the higher value determined.
11. devices according to claim 7, is characterized in that, described spam judging unit comprises:
Text similarity comparison module, for judging whether the text similarity determined is greater than default threshold value;
Judging rubbish mail module, during for being greater than default threshold value at the text similarity determined, judging that described new mail is as spam, and filters described new mail;
Non-spam email processing module, during for being less than or equal to default threshold value at the text similarity determined, judges described new mail not as spam, and using described new mail as new initial cluster center.
12. devices according to claim 7, is characterized in that, described device comprises:
Judge whether the new mail received is spam, to judge whether according to judged result to filter described new mail by white list and/or blacklist.
CN201510794358.0A 2015-11-18 2015-11-18 Spam mail filtering method and device Pending CN105323153A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510794358.0A CN105323153A (en) 2015-11-18 2015-11-18 Spam mail filtering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510794358.0A CN105323153A (en) 2015-11-18 2015-11-18 Spam mail filtering method and device

Publications (1)

Publication Number Publication Date
CN105323153A true CN105323153A (en) 2016-02-10

Family

ID=55249783

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510794358.0A Pending CN105323153A (en) 2015-11-18 2015-11-18 Spam mail filtering method and device

Country Status (1)

Country Link
CN (1) CN105323153A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109428946A (en) * 2017-08-31 2019-03-05 Abb瑞士股份有限公司 Method and system for Data Stream Processing
CN110661750A (en) * 2018-06-28 2020-01-07 深信服科技股份有限公司 Mail sender identity detection method, system, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1922837A (en) * 2004-05-14 2007-02-28 布赖特梅有限公司 Method and device for filtrating rubbish E-mail based on similarity measurement
CN101159704A (en) * 2007-10-23 2008-04-09 浙江大学 Microcontent similarity based antirubbish method
CN103488689A (en) * 2013-09-02 2014-01-01 新浪网技术(中国)有限公司 Mail classification method and mail classification system based on clustering

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1922837A (en) * 2004-05-14 2007-02-28 布赖特梅有限公司 Method and device for filtrating rubbish E-mail based on similarity measurement
CN101159704A (en) * 2007-10-23 2008-04-09 浙江大学 Microcontent similarity based antirubbish method
CN103488689A (en) * 2013-09-02 2014-01-01 新浪网技术(中国)有限公司 Mail classification method and mail classification system based on clustering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周鑫: "带噪声的文本聚类及其在反垃圾邮件中的应用", 《中国优秀硕士论文全文数据库信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109428946A (en) * 2017-08-31 2019-03-05 Abb瑞士股份有限公司 Method and system for Data Stream Processing
CN110661750A (en) * 2018-06-28 2020-01-07 深信服科技股份有限公司 Mail sender identity detection method, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN105389400B (en) Voice interaction method and device
EP2849474A1 (en) Information processing method and terminal
CN105141496A (en) Instant communication message playback method and device
EP2219355A2 (en) Voice recognition server, telephone equipment, voice recognition system, and voice recognition method
CN103067896A (en) Junk short message filtering method and device
CN102368842B (en) Detection method of abnormal behavior of mobile terminal and detection system thereof
CN102970402A (en) Method and device for updating contact information of mobile terminal address book
CN107798143A (en) A kind of information search method, device, terminal and readable storage medium storing program for executing
CN103020807A (en) Information display method and system
CN102946474B (en) Method and device for automatically sharing contact information of contacts and mobile terminal
CN103365834A (en) System and method for eliminating language ambiguity
US20090075681A1 (en) Short Message Service Message Compactor and Uncompactor
CN103906012A (en) Information sending method and device
CN101873180A (en) Automatic voice storage method and terminal
CN101212739A (en) Information processing device for mobile communication terminal
CN105207881A (en) Message sending method and equipment
CN106453062A (en) Application notification management method and terminal
CN104615923A (en) Unlocking method and unlocking device of terminal equipment
CN105323153A (en) Spam mail filtering method and device
CN106791036A (en) A kind of information processing method, device and mobile terminal
CN103220211A (en) Processing method, processing device and mobile terminal for social network site (SNS) messages
CN104346151B (en) A kind of information processing method and electronic equipment
CN105574112A (en) Comment information processing method and system of communication process
CN101345966A (en) Method and device for automatically matching menu
CN104065617A (en) Harassing-email processing method, device and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160210