WO2010037292A1 - 一种确定可疑垃圾邮件范围的方法和系统 - Google Patents

一种确定可疑垃圾邮件范围的方法和系统 Download PDF

Info

Publication number
WO2010037292A1
WO2010037292A1 PCT/CN2009/073563 CN2009073563W WO2010037292A1 WO 2010037292 A1 WO2010037292 A1 WO 2010037292A1 CN 2009073563 W CN2009073563 W CN 2009073563W WO 2010037292 A1 WO2010037292 A1 WO 2010037292A1
Authority
WO
WIPO (PCT)
Prior art keywords
spam
string
repetitions
feature
suspected
Prior art date
Application number
PCT/CN2009/073563
Other languages
English (en)
French (fr)
Inventor
王晖
陈志强
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2010037292A1 publication Critical patent/WO2010037292A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking

Definitions

  • the present invention relates to the field of electronic mail technologies, and in particular, to a method and system for determining a suspected spam range. Background of the invention
  • E-mail has become an important communication tool for people to communicate and communicate, and how to solve problems.
  • FIG. 1 is a flow chart of a method for filtering spam by using a full-text search method in the prior art. As shown in FIG. 1, the method includes:
  • Step 101 Search for the subject and all the body of the current email, and cut the sample from the full text of the email with a fixed length of information, as the key feature information of the email, representing the original email.
  • Step 102 Determine, according to the key feature information, whether there is an email similar to the content of the current email in the stored email, and if yes, perform step 103; otherwise, return to step 101.
  • Step 103 Determine whether the number of emails similar to the content of the current email has reached a predefined garbage threshold. If yes, go to step 104, otherwise return to step 101.
  • Step 104 Mark the current email and the email similar to the current email content as spam, and end the process.
  • the method shown in Figure 1 searches for the subject and all texts of each email.
  • the object determines whether the stored email has an email similar to the content of the current email, and then filters the spam according to the number of emails with similar content.
  • This method requires full-text search processing for each email, and the amount of data processing is huge, and it is inefficient to determine whether the email is spam. Summary of the invention
  • a method for determining a suspected spam range comprising:
  • the mail having the feature is determined to be a suspected spam.
  • a system for determining a range of suspicious spam comprising a string intercepting device, a statistical device, and a suspected spam determining device;
  • the character string intercepting device is configured to intercept a first predetermined number of characters from each received email, and send the intercepted character string to the statistical device;
  • the statistic device is configured to receive a character string, and count the number of repetitions of each string received in all the received character strings, and send the string of the second predetermined number of bits according to the number of repetitions to the second predetermined number of bits.
  • the suspected spam determining device is configured to receive a character string, and count the number of repetitions of each string received in all the received character strings, and send the string of the second predetermined number of bits according to the number of repetitions to the second predetermined number of bits.
  • the suspicious spam determining means is configured to determine the received character string as a suspicious spam feature, and determine the mail having the feature as a suspected spam. It can be seen that, in the present invention, by intercepting the first predetermined number of characters from each received email as the suspected spam feature to be determined, the statistics of each of the intercepted suspected spam features intercepted are all intercepted.
  • Determining the number of repetitions in the suspicious spam feature determining the characteristics of the suspected spam to be determined according to the number of repetitions from the second to the second predetermined number of suspected spam, and determining the message having the feature as Suspicious spam, you can predetermine the scope of suspicious spam before judging whether the email is spam, and then you only need to judge whether the suspected spam is spam, instead of judging each email, it improves. Determine the efficiency of the message as spam.
  • FIG. 1 is a flow chart of a method for filtering spam by using a full-text search method in the prior art
  • FIG. 2 is a flowchart of a method for determining a range of suspicious spam in an embodiment of the present invention
  • FIG. 3 is a first diagram of a system for determining a range of suspected spam.
  • Embodiment FIG. 4 is a structural diagram of a second embodiment of a system for determining a suspected spam range
  • FIG. 5 is a structural diagram of a third embodiment of a system for determining a suspected spam range. Mode for carrying out the invention
  • FIG. 2 is a flowchart of a method for determining a suspected spam range according to an embodiment of the present invention. As shown in FIG. 2, the method includes:
  • Step 201 Intercept the suspected spam feature to be determined from each received email.
  • the total number of characters in the subject of the email and all the texts is greater than the first pre-
  • the first predetermined number of characters are intercepted from the fixed position of the subject of the email and the entire body as the suspected spam feature to be determined, and the sum of the characters of the subject and the entire body of the mail is less than the first
  • the subject and all texts of the mail are intercepted as features of the suspected spam to be determined. All of the texts described do not contain the subject matter.
  • the suspected spam feature to be determined here is actually a string that is intercepted from the message.
  • the fixed position of the entire body refers to a certain part of the body, for example, may be the beginning of the body, or may be other parts of the body, for example, may be the middle or the tail of the body.
  • the first predetermined number is 60
  • the subject of the first email has 10 characters
  • all the text has 100 characters
  • the second The subject of the email has 12 characters
  • all the text has 18 characters.
  • the feature of the suspected spam to be identified from the first email is the 10 characters of the subject of the first email and the first electronic a string consisting of 50 characters starting at the beginning of the body of the mail
  • the suspected spam feature to be determined from the second email is a string consisting of all the characters of the second email in sequence.
  • a lot of spam in spam usually appears at the beginning of the subject and body of the email, for example, in the first paragraph of the email, so when the fixed location is the beginning of the body, it can also Reduce the amount of information that needs to be processed while avoiding the missed detection of spam.
  • the suspicious spam feature to be determined may also be intercepted in the middle or the tail to avoid the missed detection of the spam.
  • the spam of the mail usually appears in which position of the mail can be designed to judge whether the mail is suspicious or not
  • the person skilled in the art of the mail program is determined according to the statistical information, and then when the program or device for determining the suspected spam range is designed according to the method shown in FIG. 2, the fixed position is specifically set as the beginning, the middle or the middle of the mail.
  • the subsequent program or device only needs to process the subject of the email and the body at the fixed location when determining the scope of the suspected spam, without processing the full text of the email, and processing the full text.
  • the statistical information can be obtained by counting the probability that the spam in the spam that has been determined to appear in each location of the mail.
  • Step 202 Count the number of repetitions of each of the to-be-identified suspicious spam features intercepted in the intercepted all suspected spam features.
  • the method for counting the number of repetitions may be:
  • the number of repetitions of each suspected spam feature to be determined is the same as the length of the suspected spam feature to be determined, and the number of repetitions is used as the feature of the suspected spam to be determined. The number of repetitions in all the suspected spam features to be determined.
  • Method two counting each suspicious spam feature to be determined is greater than or equal to the length
  • the number of repetitions is used as the number of repetitions of the to-be-identified suspicious spam feature in the intercepted all suspected spam features to be determined. Specifically, in each character of the long-length mail feature, whether the character of the suspected spam feature to be determined is counted according to the order of occurrence of each character in the suspected spam feature to be determined, and if so, Add 1 to the number of repetitions.
  • the current suspicious spam features to be identified are "123456", “12345”, “12345””13589” and "1 ⁇ 2 ⁇ 3 ⁇ 4 ⁇ 5", according to method one, the suspicious spam characteristics are to be determined”
  • the number of repetitions of 12345" is 2, according to method two, to be determined suspicious spam
  • the number of repetitions of "12345” is 4.
  • the interference of the interference word in the spam can be removed, for example, the interference of the character " ⁇ " is removed, and the suspected spam caused by the interference word is avoided.
  • Step 203 Determine the feature of the suspected spam to be determined according to the number of repetitions from the second to the second predetermined number of digits as the characteristics of the suspected spam.
  • the second predetermined number is a preset natural number.
  • each character string may be sorted according to the number of repetitions. For example, the character string may be sorted in descending or ascending order according to the number of repetitions, and then the second predetermined number of digits or the second predetermined number of digits in the last row will be ranked first.
  • the string is determined to be a feature of suspected spam.
  • the strings are sorted in descending order of the number of repetitions of the string (ie, in descending order), and the mailing list in which the corresponding string appears is listed for subsequent use of the suspected spam. For details, see Table 1.
  • the EML in Table 1 indicates the mail.
  • the second predetermined number has a value of 2
  • the character string A, the character string B, and the character string C are characteristics of the suspected spam.
  • the second predetermined number of specific values is also determined when designing the corresponding program for determining the suspected spam range.
  • the "A,,""B", and “C” are the code names of the strings, not the actual strings.
  • the string A can represent the string "12345”
  • the string B can represent the string "6789”.
  • AI artificial intelligence
  • a person skilled in the art first presets a threshold range, and selects a specific value for the first predetermined number, the meaning of the threshold range is: if the number of repetitions of the string is within the threshold range, the character A string is a feature of suspected spam, otherwise the string is not a feature of suspected spam.
  • the threshold range can be determined as (1000, 5000) when the spam range is reached.
  • the first predetermined number is 5, and the number of repetitions of a certain character string counted according to the method shown in FIG. 2 is greater than or equal to 5000, indicating that the first predetermined number is taken. If the value is set too small, the string with the number of repetitions greater than or equal to 5000 may not only appear in spam, but may also appear in a large amount in non-spam. Thereafter, the designer increases the first predetermined number of values, for example, Take 7 and count the number of repetitions of each string again according to the method shown in Figure 2. If the number of repetitions in this calculation is Between (1000, 5000), it indicates that the first predetermined number of values is reasonable, and therefore, the first predetermined number can be taken as 7.
  • the characteristics of the suspected spam can be stored in the signature database, and the email with the features in the signature database can be judged as suspicious spam in the future, and then only the suspected spam can be determined. Whether it is spam or not.
  • the feature library may be in the form of Table 1, that is, the feature of storing the suspected spam, the number of repetitions of each feature appearing in the message, and the mailing list in which the feature appears, and may also take other forms, such as storing only features. And the number of repetitions.
  • the feature library occupies a small storage space. Therefore, when the feature library is used to determine the suspected spam range, the storage space occupied by the anti-spam system can be reduced because the full text of the message according to the prior art is used. For spam processing, you need to store the full text of all the messages you need to process, and the storage space is large.
  • FIG. 3 is a structural diagram of a first embodiment of a system for determining a suspected spam range. As shown in FIG. 3, the system includes a character string intercepting device 301, a statistical device 302, and a suspected spam determining device 303.
  • the string intercepting device 301 is configured to intercept a first predetermined number of characters from each received email as the suspected spam feature to be determined, and send the intercepted suspected spam feature to the statistical device 302.
  • the statistical device 302 is configured to receive the suspicious spam feature to be determined, and count the number of repetitions of each to-be-identified suspicious spam feature received in the received suspected spam feature, which will be ranked according to the number of repetitions.
  • the first second predetermined number of suspected spam features to be determined are sent to the suspected spam determining means 303.
  • the suspicious spam determining device 303 is configured to determine the received suspicious spam feature to be determined as a feature of the suspected spam, and use the message having the feature as the suspected spam Pieces.
  • the string intercepting device 301 is further configured to: when the sum of the number of characters of the email and the total body of the email is greater than the first predetermined number, intercept the first predetermined number of characters from a fixed position of the email and the entire body of the email As the suspicious spam feature to be determined, and when the sum of the number of characters of the subject and all the main texts of the mail is less than the first predetermined number, the subject and all the texts of the mail are intercepted as the suspicious spam characteristics to be determined, and the intercepted The suspected spam feature to be determined is sent to the statistical device 302.
  • FIG. 4 is a structural diagram of a second embodiment of a system for determining a suspected spam range, and the system shown in FIG. 4 differs from the system shown in FIG. 3 only in that:
  • the suspected spam determining device 303 includes a feature library 3031 and a suspected spam determining module 3032.
  • the feature library 3031 is configured to store the received suspicious spam feature to be determined as a suspicious spam feature.
  • the suspicious spam determination module 3032 is configured to receive an email, determine whether the received email has the feature in the feature library 3031, and determine the email having the feature as a suspected spam.
  • FIG. 5 is a structural diagram of a third embodiment of a system for determining a suspected spam range, and the system shown in FIG. 5 differs from the system shown in FIG. 3 or FIG. 4 only in that: the system shown in FIG. 5 further includes spam determination.
  • the spam determining means 504 is configured to determine whether the suspicious spam determined by the suspicious spam determining means 303 is spam.
  • the spam determining device may use artificial intelligence (AI), Bayesian, neural network or support vector machine to determine whether the suspected spam is spam.
  • AI artificial intelligence
  • Bayesian Bayesian
  • neural network or support vector machine
  • each of the intercepted each is counted by intercepting the first predetermined number of characters from each received email as the suspected spam feature to be determined.
  • the number of repetitions of the suspicious spam feature to be determined in all the suspected spam characteristics to be determined is determined as the suspected spam according to the number of repetitions of the second predetermined number of suspected spam characteristics.
  • the feature is that the mail having the feature is used as a suspicious spam, and the range of the suspected spam can be determined in advance before determining whether the mail is a suspected spam, and then it is only necessary to determine whether the suspected spam is spam or not. It is not necessary to judge each email, which improves the efficiency of judging whether the email is spam.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Description

一种确定可疑垃圾邮件范围的方法和系统 技术领域
本发明涉及电子邮件技术领域, 尤其涉及一种确定可疑垃圾邮件范 围的方法和系统。 发明背景
电子邮件已经成为人们沟通交流的重要通讯工具, 随之而来, 如何 个亟待解决的问题。
目前, 为了最大限度地防止垃圾邮件对电子邮件用户的干扰, 出现 了一种采用全文搜索的方式过滤垃圾邮件的方法, 下面参照图 1对该方 法进行介绍。
图 1是现有技术中采用全文搜索方式过滤垃圾邮件的方法流程图, 如图 1所示, 该方法包括:
步骤 101 , 搜索当前电子邮件的主题和全部正文, 以固定的信息长 度从邮件全文中截取样本,作为该邮件的关键特征信息,代表原始邮件。
步骤 102, 根据所述关键特征信息判断存储的邮件中是否有与该当 前电子邮件的内容相似的电子邮件, 如果是, 执行步骤 103 , 否则, 返 回步骤 101。
步骤 103 , 判断与该当前电子邮件的内容相似的电子邮件数目是否 已达到预定义的垃圾阈值, 若是, 执行步骤 104, 否则返回步骤 101。
步骤 104, 将该当前电子邮件以及与该当前电子邮件内容相似的电 子邮件标注为垃圾邮件, 结束本流程。
可见, 图 1所示的方法以每一封电子邮件的主题和全部正文为搜索 对象, 判断存储的电子邮件中是否有与该当前电子邮件的内容相似的电 子邮件, 然后根据内容相似的电子邮件数目来过滤垃圾邮件。 这种方法 需要对每一封邮件都进行全文搜索处理, 数据处理量庞大, 判断邮件是 否为垃圾邮件的效率较低。 发明内容
有鉴于此, 本发明的目的在于提供一种确定可疑垃圾邮件范围的方 法和系统, 以预先确定可疑垃圾邮件的范围, 从而提高判断邮件是否为 垃圾邮件的效率。
为达到上述目的, 本发明的技术方案具体是这样实现的: 一种确定可疑垃圾邮件范围的方法, 该方法包括:
从每个已接收的电子邮件中截取第一预定数目个字符;
统计截取到的每个字符串在截取到的所有字符串中的重复次数, 将 按照重复次数由多到少排在前第二预定数目位的字符串确定为可疑垃 圾邮件特征;
将具有所述特征的邮件确定为可疑垃圾邮件。
一种确定可疑垃圾邮件范围的系统, 该系统包括字符串截取装置、 统计装置和可疑垃圾邮件确定装置;
所述字符串截取装置 , 用于从每个已接收的电子邮件中截取第一预 定数目个字符, 将截取到的字符串发给统计装置;
所述统计装置, 用于接收字符串, 统计接收的每个字符串在接收的 所有字符串中的重复次数, 将按照重复次数由多到少排在前第二预定数 目位的字符串发给所述可疑垃圾邮件确定装置;
所述可疑垃圾邮件确定装置, 用于将接收的字符串确定为可疑垃圾 邮件特征, 将具有所述特征的邮件确定为可疑垃圾邮件。 可见, 由于本发明中, 通过从每个已接收的电子邮件中截取第一预 定数目个字符作为待确定可疑垃圾邮件特征, 统计截取到的每个待确定 可疑垃圾邮件特征在截取到的所有待确定可疑垃圾邮件特征中的重复 次数, 将按照重复次数由多到少排在前第二预定数目位的待确定可疑垃 圾邮件特征确定为可疑垃圾邮件的特征, 将具有所述特征的邮件确定为 可疑垃圾邮件, 可以在判断邮件是否是垃圾邮件之前, 预先确定出可疑 垃圾邮件的范围, 后续只需判断可疑垃圾邮件是否为垃圾邮件即可, 而 不必对每一封邮件均进行判断, 提高了判断邮件是否为垃圾邮件的效 率。 附图简要说明
图 1是现有技术中采用全文搜索方式过滤垃圾邮件的方法流程图; 图 2是本发明实施例中确定可疑垃圾邮件范围的方法流程图; 图 3是确定可疑垃圾邮件范围的系统的第一实施例结构图; 图 4是确定可疑垃圾邮件范围的系统的第二实施例结构图; 图 5是确定可疑垃圾邮件范围的系统的第三实施例结构图。 实施本发明的方式
为使本发明的目的、 技术方案及优点更加清楚明白, 以下参照附图 并举实施例, 对本发明进一步详细说明。
图 2是本发明实施例中确定可疑垃圾邮件范围的方法流程图, 如图 2所示, 该方法包括:
步骤 201 , 从每个已接收的电子邮件中截取待确定可疑垃圾邮件特 征。
本步骤中, 在电子邮件的主题与全部正文的字符数总和大于第一预 定数目时, 从电子邮件的主题和全部正文的固定位置处截取所述第一预 定数目的字符作为待确定可疑垃圾邮件特征, 在邮件的主题与全部正文 的字符数总和少于所述第一预定数目时, 截取所述邮件的主题和全部正 文作为待确定可疑垃圾邮件特征。 所述的全部正文不包含主题。 这里的 待确定可疑垃圾邮件特征实际上就是从邮件中截取的字符串。
所述全部正文的固定位置是指正文的某一部分, 例如可以是正文的 起始处,也可以是正文的其他部分, 例如, 可以是正文的中部或者尾部。
例如, 如果所述固定位置是指所述全部正文的起始处, 所述第一预 定数目为 60个, 第一封电子邮件的主题有 10个字符, 全部正文有 100 个字符, 第二封电子邮件的主题有 12个字符, 全部正文有 18个字符, 那么, 从第一封邮件中截取的待确定可疑垃圾邮件特征就是第一封电子 邮件的主题的 10个字符和从第一封电子邮件的正文起始处开始的 50个 字符顺次组成的字符串, 从第二封电子邮件中截取的待确定可疑垃圾邮 件特征就是所述第二封电子邮件的全部字符顺次组成的字符串。
由于本步骤中, 对于字符数大于第一预定数目的电子邮件, 只对所 述电子邮件的主题和部分正文进行处理, 而非对所述电子邮件的全部正 文进行处理, 因此, 需要处理的信息量较小, 可以提高对每封电子邮件 进行处理的速度。
另外, 通常垃圾邮件中的大量垃圾信息都出现在邮件的主题和正文 的起始处, 例如出现在电子邮件的首段, 因此, 当所述固定位置是正文 的起始处时, 还可以在减小需要处理的信息量的同时, 避免对垃圾信息 的漏检。 当然, 如果邮件中的大量垃圾信息出现的位置靠后, 例如出现 在邮件的中部或者尾部, 那么也可以在该中部或者尾部截取待确定可疑 垃圾邮件特征, 从而避免对垃圾信息的漏检。 其中, 邮件的垃圾信息通 常出现在邮件的哪些位置上可以由设计用于判断邮件是否是可疑垃圾 邮件的程序的本领域技术人员依据统计信息而定, 然后在按照图 2所示 方法设计确定可疑垃圾邮件范围的程序或者装置时, 将所述固定位置具 体设置为邮件的起始处、 中部或者尾部, 则后续所述程序或装置在确定 可疑垃圾邮件范围时, 只需对电子邮件的主题和该固定位置处的正文进 行处理即可, 而无需搜索电子邮件的全文, 对该全文进行处理。 通过统 计已经判断出的垃圾邮件中的垃圾信息出现在邮件各个位置的概率, 可 以得到所述统计信息。
步骤 202, 统计截取到的每个待确定可疑垃圾邮件特征在截取到的 所有待确定可疑垃圾邮件特征中的重复次数。
本步骤中, 统计所述重复次数的方法可以为:
方法一, 统计每个待确定可疑垃圾邮件特征在与该待确定可疑垃圾 邮件特征长度相同的所有待确定可疑垃圾邮件特征中的重复次数, 将该 重复次数作为该待确定可疑垃圾邮件特征在所述截取到的所有待确定 可疑垃圾邮件特征中的重复次数。
方法二, 统计每个待确定可疑垃圾邮件特征在长度大于或者等于该
,确
复次数, 将该重复次数作为该待确定可疑垃圾邮件特征在所述截取到的 所有待确定可疑垃圾邮件特征中的重复次数。 具体地, 可以搜索长度大 邮件特征的各个字符中, 是否按照被统计的待确定可疑垃圾邮件特征中 各个字符的出现顺序, 出现了被统计的待确定可疑垃圾邮件特征的各个 字符, 若是, 则将重复次数加 1。
例如, 当前截取的待确定可疑垃圾邮件特征有" 123456"、 "12345"、 "12345" "13589" 和 " 1~2~3~4~5" , 按照方法一, 待确定可疑垃圾邮 件特征 "12345" 的重复次数为 2, 按照方法二, 待确定可疑垃圾邮件特 征 "12345" 的重复次数为 4。
当按照方法二统计所述重复次数时, 可以去除垃圾邮件中干扰字的 干扰, 例如, 去除字符 "~" 的干扰, 避免由于干扰字造成的漏检可疑 垃圾邮件。
步骤 203, 将按照重复次数由多到少排在前第二预定数目位的待确 定可疑垃圾邮件特征确定为可疑垃圾邮件的特征。 其中, 第二预定数目 是预先设定的自然数。
本步骤中, 可以按照重复次数对各个字符串进行排序, 例如, 可以 按照重复次数对字符串进行降序排列或升序排列, 然后将排在最前面第 二预定数目位或最后面第二预定数目位的字符串确定为可疑垃圾邮件 的特征。
例如, 按照字符串的重复次数从高到低的顺序对字符串进行排序 (即降序排列), 列出出现相应字符串的邮件列表, 以供后续确定可疑 垃圾邮件使用, 具体请见表一, 表一中的 EML表示邮件。
Figure imgf000008_0002
Figure imgf000008_0001
如果所述第二预定数目的取值为 2, 那么, 字符串 A、 字符串 B和 字符串 C就是可疑垃圾邮件的特征。第二预定数目的具体取值也是在设 计相应的确定可疑垃圾邮件范围的程序时确定的。 其中的 "A,,、 "B"和 "C" 是字符串的代号, 并非真正的字符串, 例如, 字符串 A可以代表 字符串 "12345" , 字符串 B可以代表字符串 "6789"。 步骤 204, 将具有所述特征的邮件作为可疑垃圾邮件, 结束本流程。 当所述第二预定数目的取值为 2时, 按照表一, 出现了字符串 A或 字符串 B或字符串 C的邮件被确定为可疑垃圾邮件。
确定出可疑垃圾邮件的范围后, 可以将确定出的可疑垃圾邮件范围 交给反垃圾系统, 则后续可以只判断可疑垃圾邮件是否是垃圾邮件, 而 无需判断接收的每一封邮件是否是垃圾邮件。 其中, 可以通过人工或者 人工智能(AI )来判断可疑垃圾邮件是否为垃圾邮件。 实际应用中, 可 以在收到电子邮件后立即按照图 2的方法判断该邮件是否是可疑垃圾邮 件, 也可以先存储收到的电子邮件, 然后定时或定量判断当前存储的电 子邮件是否是可疑垃圾邮件。 的, 下面对所述第一预定数目的具体数值的选择方法进行介绍。
本领域的技术人员首先预设一个阈值范围, 并为所述第一预定数目 选择一个具体的取值, 所述阈值范围的含义是: 如果字符串的重复次数 在该阈值范围内, 则该字符串是可疑垃圾邮件的特征, 否则, 该字符串 不是可疑垃圾邮件的特征。 本领域技术人员可以依据经验来确定该阈值 范围, 例如, 如果通过一段时间的人工统计, 发现垃圾电子邮件占所有 电子邮件的比例在 10%~50%之间,那么当对 10000封邮件圈定可疑垃圾 邮件范围时, 所述阈值范围可以确定为 ( 1000, 5000 )。
假设所述阈值范围是( 1000, 5000 ), 第一预定数目取值是 5 , 按照 图 2所示的方法统计出的某一字符串的重复次数大于等于 5000,则说明 第一预定数目的取值设得过小, 该重复次数大于等于 5000 的字符串不 仅可能出现在垃圾邮件中, 还可能大量地出现在非垃圾邮件中, 此后, 设计人员增大第一预定数目的取值, 例如, 取为 7, 再次按照图 2所示 的方法统计每一字符串的重复次数, 如果本次统计出的该重复次数在 ( 1000, 5000 )之间, 则说明第一预定数目的取值是合理的, 因此, 可 以将第一预定数目取为 7。
图 2中, 确定出可疑垃圾邮件的特征后, 可以将可疑垃圾邮件的特 征存入特征库, 日后将具有该特征库中的特征的电子邮件判断为可疑垃 圾邮件, 之后只需判断可疑垃圾邮件是否为垃圾邮件即可。 其中, 特征 库可以采用表一的形式, 即存储有可疑垃圾邮件的特征、 每个特征在邮 件中出现的重复次数以及出现该特征的邮件列表, 也可以采用其他的形 式, 例如只存储有特征和所述重复次数。
所述的特征库占用的存储空间小, 因此, 在利用该特征库确定可疑 垃圾邮件范围时,可以减小反垃圾邮件系统占用的存储空间,这是因为, 如果按照现有技术对邮件的全文进行垃圾邮件处理, 则需要存储所有需 要处理的邮件的全文, 存储空间占用较大。
下面给出确定可疑垃圾邮件范围的系统的实施例。
图 3是确定可疑垃圾邮件范围的系统的第一实施例结构图, 如图 3 所示, 该系统包括字符串截取装置 301、 统计装置 302和可疑垃圾邮件 确定装置 303。
字符串截取装置 301 , 用于从每个已接收的电子邮件中截取第一预 定数目个字符作为待确定可疑垃圾邮件特征, 将截取到的待确定可疑垃 圾邮件特征发给统计装置 302。
统计装置 302, 用于接收待确定可疑垃圾邮件特征, 统计接收的每 个待确定可疑垃圾邮件特征在接收的所有待确定可疑垃圾邮件特征中 的重复次数, 将按照重复次数由多到少排在前第二预定数目位的待确定 可疑垃圾邮件特征发给可疑垃圾邮件确定装置 303。
可疑垃圾邮件确定装置 303 , 用于将接收的待确定可疑垃圾邮件特 征确定为可疑垃圾邮件的特征, 将具有所述特征的邮件作为可疑垃圾邮 件。
字符串截取装置 301 , 还可以用于在电子邮件的主题与全部正文的 字符数总和大于第一预定数目时, 从电子邮件的主题和全部正文的固定 位置处截取所述第一预定数目的字符作为待确定可疑垃圾邮件特征, 并 在邮件的主题与全部正文的字符数总和少于所述第一预定数目时, 截取 所述邮件的主题和全部正文作为待确定可疑垃圾邮件特征, 将截取到的 待确定可疑垃圾邮件特征发给统计装置 302。
图 4是确定可疑垃圾邮件范围的系统的第二实施例结构图, 图 4所 示的系统与图 3所示的系统的区别仅在于:
可疑垃圾邮件确定装置 303 包括特征库 3031和可疑垃圾邮件确定 模块 3032。
特征库 3031 ,用于将接收的待确定可疑垃圾邮件特征作为可疑垃圾 邮件特征进行存储。
可疑垃圾邮件确定模块 3032, 用于接收电子邮件, 判断接收的电子 邮件是否具有特征库 3031 中的特征, 将具有所述特征的电子邮件确定 为可疑垃圾邮件。
图 5是确定可疑垃圾邮件范围的系统的第三实施例结构图, 图 5所 示的系统与图 3或图 4所示的系统的区别仅在于: 图 5所示的系统进一 步包括垃圾邮件确定装置 504。
垃圾邮件确定装置 504, 用于判断可疑垃圾邮件确定装置 303确定 出的可疑垃圾邮件是否是垃圾邮件。 具体地, 垃圾邮件确定装置可以采 用人工智能(AI )、 贝叶斯类、 神经网络类或支持向量机等方式来判断 可疑垃圾邮件是否是垃圾邮件。
可见, 由于本发明实施例中, 通过从每个已接收的电子邮件中截取 第一预定数目个字符作为待确定可疑垃圾邮件特征, 统计截取到的每个 待确定可疑垃圾邮件特征在截取到的所有待确定可疑垃圾邮件特征中 的重复次数, 将按照重复次数由多到少排在前第二预定数目位的待确定 可疑垃圾邮件特征确定为可疑垃圾邮件的特征, 将具有所述特征的邮件 作为可疑垃圾邮件, 可以在判断邮件是否是可疑垃圾邮件之前, 预先确 定出可疑垃圾邮件的范围, 后续只需判断可疑垃圾邮件是否为垃圾邮件 即可, 而不必对每一封邮件均进行判断, 提高了判断邮件是否为垃圾邮 件的效率。
而且, 在确定可疑垃圾邮件的范围时, 只对电子邮件的主题和固定 位置处的正文进行处理, 无须对电子邮件的全文都进行处理, 减少了需 要处理的信息量, 提高了判断邮件是否为垃圾邮件的效率。
另外, 由于特征库占用的存储空间较小, 与现有技术中判断邮件是 否是垃圾邮件时需要保存邮件的全文信息相比, 能够节省存储空间。
以上所述, 仅为本发明的较佳实施例而已, 并非用于限定本发明的 保护范围, 凡在本发明的精神和原则之内所做的任何修改、 等同替换、 改进等, 均应包含在本发明的保护范围之内。

Claims

权利要求书
1、 一种确定可疑垃圾邮件范围的方法, 其特征在于, 该方法包括: 从每个已接收的电子邮件中截取第一预定数目个字符;
统计截取到的每个字符串在截取到的所有字符串中的重复次数, 将 按照重复次数由多到少排在前第二预定数目位的字符串确定为可疑垃 圾邮件特征;
将具有所述特征的邮件确定为可疑垃圾邮件。
2、 如权利要求 1 所述的方法, 其特征在于, 所述从每个已接收的 电子邮件中截取第一预定数目个字符包括:
在电子邮件的主题与全部正文的字符数总和大于第一预定数目时, 从电子邮件的主题和全部正文的固定位置处截取所述第一预定数目的 字符, 在邮件的主题与全部正文的字符数总和少于所述第一预定数目 时, 截取所述邮件的主题和全部正文。
3、 如权利要求 1 所述的方法, 其特征在于, 所述统计截取到的每 个字符串在截取到的所有字符串中的重复次数包括:
统计每个字符串在与该字符串长度相同的所有字符串中的重复次 数, 将该重复次数确定为该字符串在所述截取到的所有字符串中的重复 次数。
4、 如权利要求 1 所述的方法, 其特征在于, 所述统计截取到的每 个字符串在截取到的所有字符串中的重复次数包括:
统计每个字符串在长度大于或者等于该字符串长度的所有字符串 中的重复次数, 将该重复次数作为该待字符串在所述截取到的所有字符 串中的重复次数。
5、 如权利要求 4所述的方法, 其特征在于, 统计每个字符串在长 度大于或者等于该字符串长度的所有字符串中的重复次数包括: 搜索长度大于或者等于被统计的字符串长度的字符串的各个字符 中, 是否按照被统计的字符串中各个字符的出现顺序, 出现了被统计的 字符串的各个字符, 若是, 则将重复次数加 1。
6、 如权利要求 1所述的方法, 其特征在于,
该方法进一步包括: 将被确定为可疑垃圾邮件特征的字符串存储在 可疑垃圾邮件特征库中;
所述将具有所述特征的邮件确定为可疑垃圾邮件为:
将具有所述特征库中字符串的邮件确定为可疑垃圾邮件。
7、 如权利要求 2所述的方法, 其特征在于, 所述全部正文的固定 位置为所述全部正文的起始处或中部或尾部。
8、 如权利要求 1至 7任一权项所述的方法, 其特征在于, 该方法 进一步包括:
判断可疑垃圾邮件是否为垃圾邮件。
9、 一种确定可疑垃圾邮件范围的系统, 其特征在于, 该系统包括 字符串截取装置、 统计装置和可疑垃圾邮件确定装置;
所述字符串截取装置 , 用于从每个已接收的电子邮件中截取第一预 定数目个字符, 将截取到的字符串发给统计装置;
所述统计装置, 用于接收字符串, 统计接收的每个字符串在接收的 所有字符串中的重复次数, 将按照重复次数由多到少排在前第二预定数 目位的字符串发给所述可疑垃圾邮件确定装置;
所述可疑垃圾邮件确定装置, 用于将接收的字符串确定为可疑垃圾 邮件特征, 将具有所述特征的邮件确定为可疑垃圾邮件。
10、 如权利要求 9所述的系统, 其特征在于,
所述字符串截取装置, 用于在电子邮件的主题与全部正文的字符数 总和大于第一预定数目时, 从电子邮件的主题和全部正文的固定位置处 截取所述第一预定数目的字符, 在邮件的主题与全部正文的字符数总和 少于所述第一预定数目时, 截取所述邮件的主题和全部正文; 将截取到 的字符串发给统计装置。
11、 如权利要求 9所述的系统, 其特征在于, 所述可疑垃圾邮件确 定装置包括特征库和可疑垃圾邮件确定模块;
所述特征库, 用于将接收的字符串确定为可疑垃圾邮件特征并存 储;
所述可疑垃圾邮件确定模块, 用于接收电子邮件, 判断接收的电子 邮件是否具有所述特征库中的特征, 将具有所述特征的电子邮件确定为 可疑垃圾邮件。
12、 如权利要求 9或 10或 11所述的系统, 其特征在于, 该系统进 一步包括垃圾邮件确定装置;
所述垃圾邮件确定装置, 用于判断所述可疑垃圾邮件确定装置确定 出的可疑垃圾邮件是否是垃圾邮件。
PCT/CN2009/073563 2008-09-27 2009-08-27 一种确定可疑垃圾邮件范围的方法和系统 WO2010037292A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200810167115.4 2008-09-27
CN2008101671154A CN101360074B (zh) 2008-09-27 2008-09-27 一种确定可疑垃圾邮件范围的方法和系统

Publications (1)

Publication Number Publication Date
WO2010037292A1 true WO2010037292A1 (zh) 2010-04-08

Family

ID=40332415

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2009/073563 WO2010037292A1 (zh) 2008-09-27 2009-08-27 一种确定可疑垃圾邮件范围的方法和系统

Country Status (2)

Country Link
CN (1) CN101360074B (zh)
WO (1) WO2010037292A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9348912B2 (en) 2007-10-18 2016-05-24 Microsoft Technology Licensing, Llc Document length as a static relevance feature for ranking search results
CN101360074B (zh) * 2008-09-27 2011-09-21 腾讯科技(深圳)有限公司 一种确定可疑垃圾邮件范围的方法和系统
US8738635B2 (en) * 2010-06-01 2014-05-27 Microsoft Corporation Detection of junk in search result ranking
US9495462B2 (en) 2012-01-27 2016-11-15 Microsoft Technology Licensing, Llc Re-ranking search results
CN104283855A (zh) * 2013-07-08 2015-01-14 北京思普崚技术有限公司 一种垃圾邮件的截获方法
CN105279238B (zh) * 2015-09-28 2018-11-06 北京国双科技有限公司 字符串处理方法和装置
CN114040409B (zh) * 2021-11-11 2023-06-06 中国联合网络通信集团有限公司 短信识别方法、装置、设备及存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006293573A (ja) * 2005-04-08 2006-10-26 Yaskawa Information Systems Co Ltd 電子メール処理装置および電子メールフィルタリング方法および電子メールフィルタリングプログラム
CN101106539A (zh) * 2007-08-03 2008-01-16 浙江大学 基于支持向量机的垃圾邮件过滤方法
CN101360074A (zh) * 2008-09-27 2009-02-04 腾讯科技(深圳)有限公司 一种确定可疑垃圾邮件范围的方法和系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006293573A (ja) * 2005-04-08 2006-10-26 Yaskawa Information Systems Co Ltd 電子メール処理装置および電子メールフィルタリング方法および電子メールフィルタリングプログラム
CN101106539A (zh) * 2007-08-03 2008-01-16 浙江大学 基于支持向量机的垃圾邮件过滤方法
CN101360074A (zh) * 2008-09-27 2009-02-04 腾讯科技(深圳)有限公司 一种确定可疑垃圾邮件范围的方法和系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LIU, JIE ET AL.: "Fast Spam Detecting Method Under High-speed Network.", COMPUTER ENGINEERING, vol. 32, no. 4, February 2006 (2006-02-01), pages 139 *

Also Published As

Publication number Publication date
CN101360074B (zh) 2011-09-21
CN101360074A (zh) 2009-02-04

Similar Documents

Publication Publication Date Title
WO2010037292A1 (zh) 一种确定可疑垃圾邮件范围的方法和系统
US8032602B2 (en) Prioritization of recipient email messages
US7321922B2 (en) Automated solicited message detection
US8429178B2 (en) Reliability of duplicate document detection algorithms
JP4742618B2 (ja) 情報処理システム、プログラム及び情報処理方法
US7769815B2 (en) System and method for determining that an email message is spam based on a comparison with other potential spam messages
US11704583B2 (en) Machine learning and validation of account names, addresses, and/or identifiers
JP4742619B2 (ja) 情報処理システム、プログラム及び情報処理方法
RU2710739C1 (ru) Система и способ формирования эвристических правил для выявления писем, содержащих спам
JP2008502998A (ja) サーバーへの電子メッセージのコンテンツについての通信情報
WO2009117966A1 (zh) 电子文件列表的显示处理方法和系统
CN110213152B (zh) 识别垃圾邮件的方法、装置、服务器及存储介质
CN104717120A (zh) 确定信息发送时间的方法和装置
US20070106738A1 (en) Message value indicator system and method
CN101464975A (zh) 电子邮件的排序方法及系统
CN111010336A (zh) 一种海量邮件解析方法及装置
CN101795273B (zh) 一种垃圾邮件过滤方法及装置
US8843574B2 (en) Electronic mail system, user terminal apparatus, information providing apparatus, and computer readable medium
CN102118383A (zh) 识别电子邮件的方法及识别电子邮件服务器的方法
JP5366204B2 (ja) メールフィルタリングシステム、そのコンピュータプログラム、情報生成方法
JP6059559B2 (ja) 受信メールの優先度別自動振分け装置および方法
CN1987909B (zh) 一种提纯贝叶斯垃圾邮件的方法、系统及装置
CN109828957A (zh) 信息显示方法、装置、电子设备及存储介质
CN106713108B (zh) 一种结合用户关系与贝叶斯理论的邮件分类方法
Yamakawa et al. Analysis of spam mail sent to Japanese mail addresses in the long term

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09817206

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 5958/CHENP/2010

Country of ref document: IN

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 04/08/2011)

122 Ep: pct application non-entry in european phase

Ref document number: 09817206

Country of ref document: EP

Kind code of ref document: A1