WO2010037292A1 - 一种确定可疑垃圾邮件范围的方法和系统 - Google Patents
一种确定可疑垃圾邮件范围的方法和系统 Download PDFInfo
- Publication number
- WO2010037292A1 WO2010037292A1 PCT/CN2009/073563 CN2009073563W WO2010037292A1 WO 2010037292 A1 WO2010037292 A1 WO 2010037292A1 CN 2009073563 W CN2009073563 W CN 2009073563W WO 2010037292 A1 WO2010037292 A1 WO 2010037292A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- spam
- string
- repetitions
- feature
- suspected
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/21—Monitoring or handling of messages
- H04L51/212—Monitoring or handling of messages using filtering or selective blocking
Definitions
- the present invention relates to the field of electronic mail technologies, and in particular, to a method and system for determining a suspected spam range. Background of the invention
- E-mail has become an important communication tool for people to communicate and communicate, and how to solve problems.
- FIG. 1 is a flow chart of a method for filtering spam by using a full-text search method in the prior art. As shown in FIG. 1, the method includes:
- Step 101 Search for the subject and all the body of the current email, and cut the sample from the full text of the email with a fixed length of information, as the key feature information of the email, representing the original email.
- Step 102 Determine, according to the key feature information, whether there is an email similar to the content of the current email in the stored email, and if yes, perform step 103; otherwise, return to step 101.
- Step 103 Determine whether the number of emails similar to the content of the current email has reached a predefined garbage threshold. If yes, go to step 104, otherwise return to step 101.
- Step 104 Mark the current email and the email similar to the current email content as spam, and end the process.
- the method shown in Figure 1 searches for the subject and all texts of each email.
- the object determines whether the stored email has an email similar to the content of the current email, and then filters the spam according to the number of emails with similar content.
- This method requires full-text search processing for each email, and the amount of data processing is huge, and it is inefficient to determine whether the email is spam. Summary of the invention
- a method for determining a suspected spam range comprising:
- the mail having the feature is determined to be a suspected spam.
- a system for determining a range of suspicious spam comprising a string intercepting device, a statistical device, and a suspected spam determining device;
- the character string intercepting device is configured to intercept a first predetermined number of characters from each received email, and send the intercepted character string to the statistical device;
- the statistic device is configured to receive a character string, and count the number of repetitions of each string received in all the received character strings, and send the string of the second predetermined number of bits according to the number of repetitions to the second predetermined number of bits.
- the suspected spam determining device is configured to receive a character string, and count the number of repetitions of each string received in all the received character strings, and send the string of the second predetermined number of bits according to the number of repetitions to the second predetermined number of bits.
- the suspicious spam determining means is configured to determine the received character string as a suspicious spam feature, and determine the mail having the feature as a suspected spam. It can be seen that, in the present invention, by intercepting the first predetermined number of characters from each received email as the suspected spam feature to be determined, the statistics of each of the intercepted suspected spam features intercepted are all intercepted.
- Determining the number of repetitions in the suspicious spam feature determining the characteristics of the suspected spam to be determined according to the number of repetitions from the second to the second predetermined number of suspected spam, and determining the message having the feature as Suspicious spam, you can predetermine the scope of suspicious spam before judging whether the email is spam, and then you only need to judge whether the suspected spam is spam, instead of judging each email, it improves. Determine the efficiency of the message as spam.
- FIG. 1 is a flow chart of a method for filtering spam by using a full-text search method in the prior art
- FIG. 2 is a flowchart of a method for determining a range of suspicious spam in an embodiment of the present invention
- FIG. 3 is a first diagram of a system for determining a range of suspected spam.
- Embodiment FIG. 4 is a structural diagram of a second embodiment of a system for determining a suspected spam range
- FIG. 5 is a structural diagram of a third embodiment of a system for determining a suspected spam range. Mode for carrying out the invention
- FIG. 2 is a flowchart of a method for determining a suspected spam range according to an embodiment of the present invention. As shown in FIG. 2, the method includes:
- Step 201 Intercept the suspected spam feature to be determined from each received email.
- the total number of characters in the subject of the email and all the texts is greater than the first pre-
- the first predetermined number of characters are intercepted from the fixed position of the subject of the email and the entire body as the suspected spam feature to be determined, and the sum of the characters of the subject and the entire body of the mail is less than the first
- the subject and all texts of the mail are intercepted as features of the suspected spam to be determined. All of the texts described do not contain the subject matter.
- the suspected spam feature to be determined here is actually a string that is intercepted from the message.
- the fixed position of the entire body refers to a certain part of the body, for example, may be the beginning of the body, or may be other parts of the body, for example, may be the middle or the tail of the body.
- the first predetermined number is 60
- the subject of the first email has 10 characters
- all the text has 100 characters
- the second The subject of the email has 12 characters
- all the text has 18 characters.
- the feature of the suspected spam to be identified from the first email is the 10 characters of the subject of the first email and the first electronic a string consisting of 50 characters starting at the beginning of the body of the mail
- the suspected spam feature to be determined from the second email is a string consisting of all the characters of the second email in sequence.
- a lot of spam in spam usually appears at the beginning of the subject and body of the email, for example, in the first paragraph of the email, so when the fixed location is the beginning of the body, it can also Reduce the amount of information that needs to be processed while avoiding the missed detection of spam.
- the suspicious spam feature to be determined may also be intercepted in the middle or the tail to avoid the missed detection of the spam.
- the spam of the mail usually appears in which position of the mail can be designed to judge whether the mail is suspicious or not
- the person skilled in the art of the mail program is determined according to the statistical information, and then when the program or device for determining the suspected spam range is designed according to the method shown in FIG. 2, the fixed position is specifically set as the beginning, the middle or the middle of the mail.
- the subsequent program or device only needs to process the subject of the email and the body at the fixed location when determining the scope of the suspected spam, without processing the full text of the email, and processing the full text.
- the statistical information can be obtained by counting the probability that the spam in the spam that has been determined to appear in each location of the mail.
- Step 202 Count the number of repetitions of each of the to-be-identified suspicious spam features intercepted in the intercepted all suspected spam features.
- the method for counting the number of repetitions may be:
- the number of repetitions of each suspected spam feature to be determined is the same as the length of the suspected spam feature to be determined, and the number of repetitions is used as the feature of the suspected spam to be determined. The number of repetitions in all the suspected spam features to be determined.
- Method two counting each suspicious spam feature to be determined is greater than or equal to the length
- the number of repetitions is used as the number of repetitions of the to-be-identified suspicious spam feature in the intercepted all suspected spam features to be determined. Specifically, in each character of the long-length mail feature, whether the character of the suspected spam feature to be determined is counted according to the order of occurrence of each character in the suspected spam feature to be determined, and if so, Add 1 to the number of repetitions.
- the current suspicious spam features to be identified are "123456", “12345”, “12345””13589” and "1 ⁇ 2 ⁇ 3 ⁇ 4 ⁇ 5", according to method one, the suspicious spam characteristics are to be determined”
- the number of repetitions of 12345" is 2, according to method two, to be determined suspicious spam
- the number of repetitions of "12345” is 4.
- the interference of the interference word in the spam can be removed, for example, the interference of the character " ⁇ " is removed, and the suspected spam caused by the interference word is avoided.
- Step 203 Determine the feature of the suspected spam to be determined according to the number of repetitions from the second to the second predetermined number of digits as the characteristics of the suspected spam.
- the second predetermined number is a preset natural number.
- each character string may be sorted according to the number of repetitions. For example, the character string may be sorted in descending or ascending order according to the number of repetitions, and then the second predetermined number of digits or the second predetermined number of digits in the last row will be ranked first.
- the string is determined to be a feature of suspected spam.
- the strings are sorted in descending order of the number of repetitions of the string (ie, in descending order), and the mailing list in which the corresponding string appears is listed for subsequent use of the suspected spam. For details, see Table 1.
- the EML in Table 1 indicates the mail.
- the second predetermined number has a value of 2
- the character string A, the character string B, and the character string C are characteristics of the suspected spam.
- the second predetermined number of specific values is also determined when designing the corresponding program for determining the suspected spam range.
- the "A,,""B", and “C” are the code names of the strings, not the actual strings.
- the string A can represent the string "12345”
- the string B can represent the string "6789”.
- AI artificial intelligence
- a person skilled in the art first presets a threshold range, and selects a specific value for the first predetermined number, the meaning of the threshold range is: if the number of repetitions of the string is within the threshold range, the character A string is a feature of suspected spam, otherwise the string is not a feature of suspected spam.
- the threshold range can be determined as (1000, 5000) when the spam range is reached.
- the first predetermined number is 5, and the number of repetitions of a certain character string counted according to the method shown in FIG. 2 is greater than or equal to 5000, indicating that the first predetermined number is taken. If the value is set too small, the string with the number of repetitions greater than or equal to 5000 may not only appear in spam, but may also appear in a large amount in non-spam. Thereafter, the designer increases the first predetermined number of values, for example, Take 7 and count the number of repetitions of each string again according to the method shown in Figure 2. If the number of repetitions in this calculation is Between (1000, 5000), it indicates that the first predetermined number of values is reasonable, and therefore, the first predetermined number can be taken as 7.
- the characteristics of the suspected spam can be stored in the signature database, and the email with the features in the signature database can be judged as suspicious spam in the future, and then only the suspected spam can be determined. Whether it is spam or not.
- the feature library may be in the form of Table 1, that is, the feature of storing the suspected spam, the number of repetitions of each feature appearing in the message, and the mailing list in which the feature appears, and may also take other forms, such as storing only features. And the number of repetitions.
- the feature library occupies a small storage space. Therefore, when the feature library is used to determine the suspected spam range, the storage space occupied by the anti-spam system can be reduced because the full text of the message according to the prior art is used. For spam processing, you need to store the full text of all the messages you need to process, and the storage space is large.
- FIG. 3 is a structural diagram of a first embodiment of a system for determining a suspected spam range. As shown in FIG. 3, the system includes a character string intercepting device 301, a statistical device 302, and a suspected spam determining device 303.
- the string intercepting device 301 is configured to intercept a first predetermined number of characters from each received email as the suspected spam feature to be determined, and send the intercepted suspected spam feature to the statistical device 302.
- the statistical device 302 is configured to receive the suspicious spam feature to be determined, and count the number of repetitions of each to-be-identified suspicious spam feature received in the received suspected spam feature, which will be ranked according to the number of repetitions.
- the first second predetermined number of suspected spam features to be determined are sent to the suspected spam determining means 303.
- the suspicious spam determining device 303 is configured to determine the received suspicious spam feature to be determined as a feature of the suspected spam, and use the message having the feature as the suspected spam Pieces.
- the string intercepting device 301 is further configured to: when the sum of the number of characters of the email and the total body of the email is greater than the first predetermined number, intercept the first predetermined number of characters from a fixed position of the email and the entire body of the email As the suspicious spam feature to be determined, and when the sum of the number of characters of the subject and all the main texts of the mail is less than the first predetermined number, the subject and all the texts of the mail are intercepted as the suspicious spam characteristics to be determined, and the intercepted The suspected spam feature to be determined is sent to the statistical device 302.
- FIG. 4 is a structural diagram of a second embodiment of a system for determining a suspected spam range, and the system shown in FIG. 4 differs from the system shown in FIG. 3 only in that:
- the suspected spam determining device 303 includes a feature library 3031 and a suspected spam determining module 3032.
- the feature library 3031 is configured to store the received suspicious spam feature to be determined as a suspicious spam feature.
- the suspicious spam determination module 3032 is configured to receive an email, determine whether the received email has the feature in the feature library 3031, and determine the email having the feature as a suspected spam.
- FIG. 5 is a structural diagram of a third embodiment of a system for determining a suspected spam range, and the system shown in FIG. 5 differs from the system shown in FIG. 3 or FIG. 4 only in that: the system shown in FIG. 5 further includes spam determination.
- the spam determining means 504 is configured to determine whether the suspicious spam determined by the suspicious spam determining means 303 is spam.
- the spam determining device may use artificial intelligence (AI), Bayesian, neural network or support vector machine to determine whether the suspected spam is spam.
- AI artificial intelligence
- Bayesian Bayesian
- neural network or support vector machine
- each of the intercepted each is counted by intercepting the first predetermined number of characters from each received email as the suspected spam feature to be determined.
- the number of repetitions of the suspicious spam feature to be determined in all the suspected spam characteristics to be determined is determined as the suspected spam according to the number of repetitions of the second predetermined number of suspected spam characteristics.
- the feature is that the mail having the feature is used as a suspicious spam, and the range of the suspected spam can be determined in advance before determining whether the mail is a suspected spam, and then it is only necessary to determine whether the suspected spam is spam or not. It is not necessary to judge each email, which improves the efficiency of judging whether the email is spam.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Transfer Between Computers (AREA)
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200810167115.4 | 2008-09-27 | ||
CN2008101671154A CN101360074B (zh) | 2008-09-27 | 2008-09-27 | 一种确定可疑垃圾邮件范围的方法和系统 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010037292A1 true WO2010037292A1 (zh) | 2010-04-08 |
Family
ID=40332415
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2009/073563 WO2010037292A1 (zh) | 2008-09-27 | 2009-08-27 | 一种确定可疑垃圾邮件范围的方法和系统 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN101360074B (zh) |
WO (1) | WO2010037292A1 (zh) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9348912B2 (en) | 2007-10-18 | 2016-05-24 | Microsoft Technology Licensing, Llc | Document length as a static relevance feature for ranking search results |
CN101360074B (zh) * | 2008-09-27 | 2011-09-21 | 腾讯科技(深圳)有限公司 | 一种确定可疑垃圾邮件范围的方法和系统 |
US8738635B2 (en) * | 2010-06-01 | 2014-05-27 | Microsoft Corporation | Detection of junk in search result ranking |
US9495462B2 (en) | 2012-01-27 | 2016-11-15 | Microsoft Technology Licensing, Llc | Re-ranking search results |
CN104283855A (zh) * | 2013-07-08 | 2015-01-14 | 北京思普崚技术有限公司 | 一种垃圾邮件的截获方法 |
CN105279238B (zh) * | 2015-09-28 | 2018-11-06 | 北京国双科技有限公司 | 字符串处理方法和装置 |
CN114040409B (zh) * | 2021-11-11 | 2023-06-06 | 中国联合网络通信集团有限公司 | 短信识别方法、装置、设备及存储介质 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006293573A (ja) * | 2005-04-08 | 2006-10-26 | Yaskawa Information Systems Co Ltd | 電子メール処理装置および電子メールフィルタリング方法および電子メールフィルタリングプログラム |
CN101106539A (zh) * | 2007-08-03 | 2008-01-16 | 浙江大学 | 基于支持向量机的垃圾邮件过滤方法 |
CN101360074A (zh) * | 2008-09-27 | 2009-02-04 | 腾讯科技(深圳)有限公司 | 一种确定可疑垃圾邮件范围的方法和系统 |
-
2008
- 2008-09-27 CN CN2008101671154A patent/CN101360074B/zh active Active
-
2009
- 2009-08-27 WO PCT/CN2009/073563 patent/WO2010037292A1/zh active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006293573A (ja) * | 2005-04-08 | 2006-10-26 | Yaskawa Information Systems Co Ltd | 電子メール処理装置および電子メールフィルタリング方法および電子メールフィルタリングプログラム |
CN101106539A (zh) * | 2007-08-03 | 2008-01-16 | 浙江大学 | 基于支持向量机的垃圾邮件过滤方法 |
CN101360074A (zh) * | 2008-09-27 | 2009-02-04 | 腾讯科技(深圳)有限公司 | 一种确定可疑垃圾邮件范围的方法和系统 |
Non-Patent Citations (1)
Title |
---|
LIU, JIE ET AL.: "Fast Spam Detecting Method Under High-speed Network.", COMPUTER ENGINEERING, vol. 32, no. 4, February 2006 (2006-02-01), pages 139 * |
Also Published As
Publication number | Publication date |
---|---|
CN101360074B (zh) | 2011-09-21 |
CN101360074A (zh) | 2009-02-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2010037292A1 (zh) | 一种确定可疑垃圾邮件范围的方法和系统 | |
US8032602B2 (en) | Prioritization of recipient email messages | |
US7321922B2 (en) | Automated solicited message detection | |
US8429178B2 (en) | Reliability of duplicate document detection algorithms | |
JP4742618B2 (ja) | 情報処理システム、プログラム及び情報処理方法 | |
US7769815B2 (en) | System and method for determining that an email message is spam based on a comparison with other potential spam messages | |
US11704583B2 (en) | Machine learning and validation of account names, addresses, and/or identifiers | |
JP4742619B2 (ja) | 情報処理システム、プログラム及び情報処理方法 | |
RU2710739C1 (ru) | Система и способ формирования эвристических правил для выявления писем, содержащих спам | |
JP2008502998A (ja) | サーバーへの電子メッセージのコンテンツについての通信情報 | |
WO2009117966A1 (zh) | 电子文件列表的显示处理方法和系统 | |
CN110213152B (zh) | 识别垃圾邮件的方法、装置、服务器及存储介质 | |
CN104717120A (zh) | 确定信息发送时间的方法和装置 | |
US20070106738A1 (en) | Message value indicator system and method | |
CN101464975A (zh) | 电子邮件的排序方法及系统 | |
CN111010336A (zh) | 一种海量邮件解析方法及装置 | |
CN101795273B (zh) | 一种垃圾邮件过滤方法及装置 | |
US8843574B2 (en) | Electronic mail system, user terminal apparatus, information providing apparatus, and computer readable medium | |
CN102118383A (zh) | 识别电子邮件的方法及识别电子邮件服务器的方法 | |
JP5366204B2 (ja) | メールフィルタリングシステム、そのコンピュータプログラム、情報生成方法 | |
JP6059559B2 (ja) | 受信メールの優先度別自動振分け装置および方法 | |
CN1987909B (zh) | 一种提纯贝叶斯垃圾邮件的方法、系统及装置 | |
CN109828957A (zh) | 信息显示方法、装置、电子设备及存储介质 | |
CN106713108B (zh) | 一种结合用户关系与贝叶斯理论的邮件分类方法 | |
Yamakawa et al. | Analysis of spam mail sent to Japanese mail addresses in the long term |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09817206 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 5958/CHENP/2010 Country of ref document: IN |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 04/08/2011) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09817206 Country of ref document: EP Kind code of ref document: A1 |