WO2014036787A1 - Mail process method and system - Google Patents

Mail process method and system Download PDF

Info

Publication number
WO2014036787A1
WO2014036787A1 PCT/CN2012/085093 CN2012085093W WO2014036787A1 WO 2014036787 A1 WO2014036787 A1 WO 2014036787A1 CN 2012085093 W CN2012085093 W CN 2012085093W WO 2014036787 A1 WO2014036787 A1 WO 2014036787A1
Authority
WO
WIPO (PCT)
Prior art keywords
mail
feature information
server
central server
spam
Prior art date
Application number
PCT/CN2012/085093
Other languages
French (fr)
Chinese (zh)
Inventor
林延中
潘庆峰
Original Assignee
盈世信息科技(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 盈世信息科技(北京)有限公司 filed Critical 盈世信息科技(北京)有限公司
Publication of WO2014036787A1 publication Critical patent/WO2014036787A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking

Definitions

  • the present invention relates to the field of communications, and in particular, to a mail processing method and system.
  • the technical problem to be solved by the embodiments of the present invention is to provide a mail processing method and system, which can extract a small number of features that best represent mails, and these features enable the central server to cluster and classify mails, and The amount of data that needs to be transferred is very small, greatly reducing the amount of traffic between the mail server and the central server, allowing the mail server to handle very large-scale mail and achieve a very high level of spam filtering.
  • an embodiment of the present invention provides a mail processing method, including:
  • the mail server obtains the corresponding feature information of the mail according to different types of the mail, and sends the feature information to the central server;
  • the central server determines, according to the mail feature information, whether the mail corresponding to the mail feature information is spam, and feeds back the result of the judgment to the mail server.
  • the method further includes:
  • the central server further performs a clustering operation on the mail according to the mail feature information, and the clustering operation classifies the mails with similar mail feature information into one category.
  • the step of the mail server acquiring the corresponding feature information of the mail according to different types of mails includes:
  • the text data is processed according to the Nilsimsa algorithm to obtain 64-byte sequence feature information representative of the mail;
  • the mail is picture data, extracting feature information of the picture according to a compression rate distribution characteristic of the picture in the mail;
  • the other data is calculated according to the MD5 algorithm to obtain a 32-byte sequence characteristic information.
  • the step of determining, by the central server, whether the email corresponding to the feature information is a spam email according to the feature information comprises:
  • the step of the central server further performing a clustering operation on the mail according to the feature information includes:
  • the similar messages are classified into one category for storage.
  • the present invention further provides a mail processing system, including:
  • the mail server is configured to receive the mail sent by the user, obtain the mail characteristic information of the mail, send the mail characteristic information to the central server, and perform corresponding operations according to the judgment result of the mail according to the mail characteristic information of the central server;
  • a central server configured to receive the mail feature information sent by the mail server, determine, according to the mail feature information, whether the mail corresponding to the mail feature information is spam, and feed back the result of the judgment to the mail server;
  • the central server further performs a clustering operation on the mail according to the mail feature information, and the clustering operation classifies the mails with similar mail feature information into one category.
  • the mail server comprises:
  • a text message feature extraction unit configured to process the text data according to a Nilsimsa algorithm when the mail is text data, to obtain 64-byte sequence feature information representative of the mail;
  • a picture mail feature extraction unit configured to: when the mail is picture data, extract feature information of the picture according to a compression rate distribution characteristic of the picture in the mail;
  • the other data mail feature extraction unit is configured to calculate the other data according to the MD5 algorithm when the mail is non-text and other data of the picture, to obtain a 32-byte sequence feature information.
  • the central server comprises:
  • a receiving unit configured to receive feature information sent by the mail server;
  • a determining unit configured to compare the feature information with the existing spam feature information of the central server, and determine the feature information when the similarity between the feature information and the spam feature information exceeds a preset criterion The corresponding email is spam, otherwise, the email corresponding to the feature information is determined to be a normal email;
  • a sending unit configured to send result information determined by the determining unit to the mail server.
  • the central server further includes:
  • a comparing unit configured to perform pairwise matching on the mail feature information sent to the central server, and when the similarity between the mail feature information exceeds a preset criterion, determining that the mail corresponding to the mail feature information is similar a mailing; a categorization storage unit for classifying the similar mails into one category for storage.
  • the mail server can upload the characteristics of the mail to be checked in real time and immediately obtain the judgment result returned by the central server, which is time-sensitive.
  • the present invention extracts different features based on information in different formats within the mail, and these features retain sufficient information for the central server to cluster and classify, and determine whether the mail is spam.
  • the central server can also process the query requests sent by a large number of mail servers, obtain sufficient data volume, perform clustering and classification, and greatly enhance the filtering effect of the central server.
  • FIG. 1 is a schematic flow chart of a mail processing method according to the present invention.
  • FIG. 2 is a schematic flow chart of still another method for processing a mail according to the present invention.
  • FIG. 3 is a schematic structural diagram of a mail processing system according to the present invention.
  • FIG. 4 is a schematic structural diagram of a mail server of the mail processing system of the present invention.
  • FIG. 5 is a schematic structural diagram of a central server of the mail processing system of the present invention.
  • FIG. 6 is another schematic structural diagram of a central server of the mail processing system of the present invention.
  • FIG. 1 is a schematic flow chart of a mail processing method according to an embodiment of the present invention.
  • the mail server acquires corresponding feature information of the mail according to different types of mails, and sends the feature information to the central server.
  • the central server determines, according to the mail feature information, whether the mail corresponding to the mail feature information is spam, and feeds back a result of the judgment to the mail server.
  • FIG. 2 is a schematic flow chart of another method for processing a mail according to an embodiment of the present invention.
  • the mail server receives the mail sent by the user.
  • 201 When the mail is text data, the text data is processed according to a Nilsimsa algorithm, and 64-byte sequence feature information representative of the mail is obtained.
  • the Nilsimsa algorithm first splits the content of the mail and extracts the adjacent 4 bytes (4 bytes is the empirical value, considering that the general Chinese character requires two bytes, and the general Chinese phrase contains two Chinese characters) .
  • the original features extracted are: “This is”, “Yes”, “One”, “Measure”, “Test”. If a character information is randomly added to this text, the text becomes “this is just a test” and the original features extracted are “this only”, “just”, “is one”, “one”, “a”, “ test”. From the above example, if you change the original text a little, it will only affect the two of the original features extracted ("This is " ⁇ ” this one, "only”).
  • mapping function is not particularly required, as long as a string can be mapped to an integer function.
  • One of the simplest mapping function examples is to represent two Chinese characters. The binary number corresponding to four bytes is treated as a four-byte integer). This integer pair 512 is then modulo and saved into a histogram of 512 buckets.
  • the picture is scanned to obtain a compression ratio of each sub-block of the picture, and the compression ratio of each N consecutive sub-blocks is combined into a new compression rate change element, where N is For a natural number greater than 1, N can be set according to requirements, and each compression change element and the position code in the picture in which it is located are combined to obtain the feature information of the picture.
  • [0031] 206 comparing the feature information with the existing spam feature information of the central server, and determining the feature information when the similarity between the feature information and the spam feature information exceeds a preset criterion.
  • the corresponding email is spam. Otherwise, it is determined that the email corresponding to the feature information is a normal email.
  • the honeypot mailbox refers to some email accounts that we register our, and the email account is posted to the Internet and collected by the spammer. Since these email accounts are not actually applicable, the emails sent to these accounts are basically spam. If the characteristics of a certain mail received by a plurality of honeypot mailboxes are similar, it is basically considered that the mail feature is a spam feature.
  • [0034] 208 Perform pairwise comparison of the mail feature information sent to the central server. When the similarity between the mail feature information exceeds a preset criterion, determine that the mail corresponding to the mail feature information is similar. mail.
  • FIG. 3 is a schematic structural diagram of a mail processing system 1 according to an embodiment of the present invention, including:
  • the mail server 11 is configured to receive the mail sent by the user, obtain the mail characteristic information of the mail, send it to the central server 12, and perform corresponding operations according to the judgment result of the mail according to the mail characteristic information by the central server 12; And receiving the mail feature information sent by the mail server 11, determining, according to the mail feature information, whether the mail corresponding to the mail feature information is spam, and feeding back the result of the judgment to the mail server.
  • the central server 12 further performs a clustering operation on the mail according to the mail feature information, and the clustering operation is classified into a mail with similar mail feature information.
  • FIG. 4 is a schematic structural diagram of a mail server 11 in a mail processing system 1 according to an embodiment of the present invention, including: a text mail feature extraction unit 111, configured to: when the mail is text data, according to the Nilsimsa algorithm The text data is processed to obtain 64-byte sequence feature information representative of the mail.
  • a text mail feature extraction unit 111 configured to: when the mail is text data, according to the Nilsimsa algorithm The text data is processed to obtain 64-byte sequence feature information representative of the mail.
  • the Nilsimsa algorithm first splits the content of the mail and extracts the adjacent 4 bytes (4 bytes is an empirical value, considering that the general Chinese character requires two bytes, the general Chinese phrase Contains two Chinese characters). For example, for the text "This is a test”, the original features extracted are: “This is”, “Yes”, “One”, “Measure”, “Test”. If you add a character message randomly to this text, the text becomes “This is just a test”, then the original features extracted are "this only”, “just”, “is one”, “one”, “one test”, “test”. From the above example, if you change the original text a little, it will only affect the two of the original features extracted ("this is” ⁇ "this", “just”).
  • mapping function is not particularly required, as long as a string can be mapped to an integer function.
  • An example of a function is to treat a binary number corresponding to four bytes representing two Chinese characters as a four-byte integer). This integer pair 512 is then modulo and saved into a histogram of 512 buckets.
  • the picture mail feature extraction unit 112 is configured to extract feature information of the picture according to a compression rate distribution characteristic of the picture in the mail when the mail is picture data.
  • N is For a natural number greater than 1, N can be set according to requirements, and each compression change element and the position code in the picture in which it is located are combined to obtain the feature information of the picture.
  • the other data mail feature extraction unit 113 is configured to calculate the other data according to the MD5 algorithm when the mail is non-text and other data of the picture, to obtain a 32-byte sequence feature information.
  • the sending unit 114 is configured to send the extracted mail feature information to the central server 12.
  • FIG. 5 is a schematic structural diagram of a central server 12 in a mail processing system 1 according to an embodiment of the present invention, including: a receiving unit 121, configured to receive feature information sent by the mail server 11;
  • the determining unit 122 is configured to compare the feature information with the existing spam feature information of the central server 12, and determine that the similarity between the feature information and the spam feature information exceeds a preset criterion.
  • the email corresponding to the feature information is spam, otherwise, the email corresponding to the feature information is determined to be a normal email;
  • the sending unit 123 is configured to send the result information determined by the determining unit 122 to the mail server.
  • FIG. 6 is another schematic structural diagram of the central server 12 in the mail processing system 1 according to the embodiment of the present invention. Different from FIG. 5, the method further includes:
  • the comparing unit 124 is configured to perform the pairwise matching of the mail feature information sent to the central server.
  • the categorization storage unit 125 is configured to classify the similar mails into one category for storage.
  • the mail server can upload the characteristics of the mail to be checked in real time and immediately obtain the judgment result returned by the center server, which is time-sensitive.
  • the present invention extracts different features based on information in different formats within the mail, and these features retain sufficient information for the central server to cluster and classify, and determine whether the mail is spam.
  • the central server can also process the query requests sent by a large number of mail servers, obtain sufficient data volume, perform clustering and classification, and greatly enhance the filtering effect of the central server.

Abstract

Disclosed is a mail process method including: a mail server obtains the corresponding mail characteristic information according to the different mail type, and sends the characteristic information to a center server; the center server determines whether a mail corresponding with the characteristic information is a spam according to the mail characteristic information, and returns the determination result to the mail server. Corresponding, the invention also provides a mail process system. Utilizing embodiments of the invention, the mail server can real-time upload a mail characteristic waiting for check, and immediately obtain a determination result returning from the center server, the invention has highly time efficiency. The invention extracts the different characteristic according to the different format information within a mail, provides the center server to perform clustering and classifying, and determines whether the mail is a spam. Also, the center server can obtain enough data quantity through the query request sent from the mail server that processes a mass of mails, and perform clustering and classifying, which greatly improves the filtering effect of the center server.

Description

种邮件处理方法及系统 技术领域  Mail processing method and system
[0001] 本发明涉及通信领域, 尤其涉入一种邮件处理方法及系统。  [0001] The present invention relates to the field of communications, and in particular, to a mail processing method and system.
背景技术 Background technique
[0002] 随着通信技术的日益发展, 邮件已成为人们日常生活沟通的重要工具, 但随之而来 的问题是庞大的垃圾邮件, 严重影响的用户的正常邮件的使用。 现有反垃圾过滤装置,都是 定时下载中心服务器的过滤规则库,并定时更新以便获取过滤垃圾邮件的能力。 这个方法有 一定的时效性, 在两次更新的期间, 可能会漏掉一批新出现的新类型垃圾邮件。  [0002] With the development of communication technology, mail has become an important tool for people's daily communication, but the problem that comes with it is huge spam, which seriously affects the use of normal mail by users. The existing anti-spam filtering devices are all regularly downloading the filtering rule base of the central server, and are regularly updated to obtain the ability to filter spam. This method is time-sensitive, and during the two updates, a new batch of new types of spam may be missed.
[0003] 一个解决这个时效性问题的方案是将邮件都转发给中心服务器过滤, 但是这个方案 的缺点是消耗大量的带宽, 而且中心服务器如果需要同时处理几十甚至几百个邮件服务器的 转发请求的, 对硬件的要求会非常高, 甚至需要大量的服务器才能完成。 [0003] One solution to this time-effect problem is to forward mail to the central server for filtering, but the disadvantage of this scheme is that it consumes a large amount of bandwidth, and the central server needs to handle forwarding requests of dozens or even hundreds of mail servers at the same time. The hardware requirements are very high and even require a large number of servers to complete.
发明内容 Summary of the invention
[0004] 本发明实施例所要解决的技术问题在于, 提供一种邮件处理方法及系统, 可对邮件 提取最能代表邮件的少量特征, 这些特征可使中心服务器对邮件进行聚类和分类, 且需要传 输的数据量非常少, 大大减轻了邮件服务器和中心服务器的通讯量, 让邮件服务器可以处理 超大规模的邮件并达到相当高的垃圾邮件过滤效果。  [0004] The technical problem to be solved by the embodiments of the present invention is to provide a mail processing method and system, which can extract a small number of features that best represent mails, and these features enable the central server to cluster and classify mails, and The amount of data that needs to be transferred is very small, greatly reducing the amount of traffic between the mail server and the central server, allowing the mail server to handle very large-scale mail and achieve a very high level of spam filtering.
[0005] 为达到上述技术效果, 本发明实施例提供了一种邮件处理方法, 包括:  [0005] In order to achieve the above technical effects, an embodiment of the present invention provides a mail processing method, including:
邮件服务器根据邮件的不同类型获取邮件相应的特征信息, 并将所述特征信息发送至中心服 务器; The mail server obtains the corresponding feature information of the mail according to different types of the mail, and sends the feature information to the central server;
所述中心服务器根据所述邮件特征信息判断所述邮件特征信息对应的邮件是否为垃圾邮件, 并将判断的结果反馈至所述邮件服务器。 The central server determines, according to the mail feature information, whether the mail corresponding to the mail feature information is spam, and feeds back the result of the judgment to the mail server.
[0006] 进一步的, 还包括: [0006] Further, the method further includes:
所述中心服务器还根据所述邮件特征信息对所述邮件进行聚类操作, 所述聚类操作为将邮件 特征信息相似的邮件归为一类。 The central server further performs a clustering operation on the mail according to the mail feature information, and the clustering operation classifies the mails with similar mail feature information into one category.
[0007] 优选的, 所述邮件服务器根据邮件的不同类型获取邮件相应的特征信息的步骤包 括:  [0007] Preferably, the step of the mail server acquiring the corresponding feature information of the mail according to different types of mails includes:
当所述邮件为文本数据时, 根据 Nilsimsa算法对所述文本数据进行处理, 获取可代表所述邮 件的 64字节序列特征信息; 当所述邮件为图片数据时, 根据所述邮件中的图片的压縮率分布特性, 提取所述图片的特征 信息; When the mail is text data, the text data is processed according to the Nilsimsa algorithm to obtain 64-byte sequence feature information representative of the mail; When the mail is picture data, extracting feature information of the picture according to a compression rate distribution characteristic of the picture in the mail;
当所述邮件为非文本及图片的其它数据时, 根据 MD5 算法对所述其它数据进行计算, 获得 一个 32字节序列特征信息。 When the mail is non-text and other data of the picture, the other data is calculated according to the MD5 algorithm to obtain a 32-byte sequence characteristic information.
[0008] 优选的, 所述中心服务器根据所述特征信息判断所述特征信息对应的邮件是否为垃 圾邮件的步骤包括:  [0008] Preferably, the step of determining, by the central server, whether the email corresponding to the feature information is a spam email according to the feature information comprises:
接收所述邮件服务器发送的特征信息; Receiving feature information sent by the mail server;
将所述特征信息与所述中心服务器已有的垃圾邮件特征信息进行比对, 当所述特征信息与所 述垃圾邮件特征信息的相似度超过预设标准时, 判断所述特征信息对应的邮件为垃圾邮件, 否则, 判断所述特征信息对应的邮件为正常邮件。 Comparing the feature information with the existing spam feature information of the central server, and when the similarity between the feature information and the spam feature information exceeds a preset criterion, determining that the message corresponding to the feature information is Spam, otherwise, it is judged that the mail corresponding to the feature information is a normal mail.
[0009] 优选的, 所述中心服务器还根据所述特征信息对所述邮件进行聚类操作的步骤包 括:  [0009] Preferably, the step of the central server further performing a clustering operation on the mail according to the feature information includes:
将发送至所述中心服务器的邮件特征信息进行两两比对, 当所述邮件特征信息间的相似度超 过预设标准时, 则判断所述邮件特征信息各自对应的邮件为相似邮件; And performing the pairwise comparison on the mail feature information sent to the central server. When the similarity between the mail feature information exceeds a preset criterion, determining that the mail corresponding to the mail feature information is a similar mail;
将所述相似邮件归为一类进行存储。 The similar messages are classified into one category for storage.
[0010] 相应的, 本发明还提供了一种邮件处理系统, 包括:  [0010] Correspondingly, the present invention further provides a mail processing system, including:
邮件服务器, 用于接收用户发送的邮件, 获取所述邮件的邮件特征信息后发送至中心服务 器, 并根据中心服务器根据所述邮件特征信息对邮件的判断结果进行相应操作; The mail server is configured to receive the mail sent by the user, obtain the mail characteristic information of the mail, send the mail characteristic information to the central server, and perform corresponding operations according to the judgment result of the mail according to the mail characteristic information of the central server;
中心服务器, 用于接收所述邮件服务器发送的邮件特征信息, 根据所述邮件特征信息判断所 述邮件特征信息对应的邮件是否为垃圾邮件, 并将判断的结果反馈至所述邮件服务器; 所述中心服务器还根据所述邮件特征信息对所述邮件进行聚类操作, 所述聚类操作为将邮件 特征信息相似的邮件归为一类。 a central server, configured to receive the mail feature information sent by the mail server, determine, according to the mail feature information, whether the mail corresponding to the mail feature information is spam, and feed back the result of the judgment to the mail server; The central server further performs a clustering operation on the mail according to the mail feature information, and the clustering operation classifies the mails with similar mail feature information into one category.
[0011] 优选的, 所述邮件服务器包括:  [0011] Preferably, the mail server comprises:
文本邮件特征提取单元, 用于当所述邮件为文本数据时, 根据 Nilsimsa算法对所述文本数据 进行处理, 获取可代表所述邮件的 64字节序列特征信息; a text message feature extraction unit, configured to process the text data according to a Nilsimsa algorithm when the mail is text data, to obtain 64-byte sequence feature information representative of the mail;
图片邮件特征提取单元, 用于当所述邮件为图片数据时, 根据所述邮件中的图片的压縮率分 布特性, 提取所述图片的特征信息; a picture mail feature extraction unit, configured to: when the mail is picture data, extract feature information of the picture according to a compression rate distribution characteristic of the picture in the mail;
其它数据邮件特征提取单元, 用于当所述邮件为非文本及图片的其它数据时, 根据 MD5 算 法对所述其它数据进行计算, 获得一个 32字节序列特征信息。 The other data mail feature extraction unit is configured to calculate the other data according to the MD5 algorithm when the mail is non-text and other data of the picture, to obtain a 32-byte sequence feature information.
[0012] 优选的, 所述中心服务器包括:  [0012] Preferably, the central server comprises:
接收单元, 用于接收所述邮件服务器发送的特征信息; 判断单元, 用于将所述特征信息与所述中心服务器已有的垃圾邮件特征信息进行比对, 当所 述特征信息与所述垃圾邮件特征信息相似度超过预设标准时, 判断所述特征信息对应的邮件 为垃圾邮件, 否则, 判断所述特征信息对应的邮件为正常邮件; a receiving unit, configured to receive feature information sent by the mail server; a determining unit, configured to compare the feature information with the existing spam feature information of the central server, and determine the feature information when the similarity between the feature information and the spam feature information exceeds a preset criterion The corresponding email is spam, otherwise, the email corresponding to the feature information is determined to be a normal email;
发送单元, 用于将所述判断单元判断的结果信息发送至所述邮件服务器。 And a sending unit, configured to send result information determined by the determining unit to the mail server.
[0013] 优选的, 所述中心服务器还包括:  [0013] Preferably, the central server further includes:
对比单元, 用于将发送至所述中心服务器的邮件特征信息进行两两比对, 当所述邮件特征信 息间的相似度超过预设标准时, 则判断所述邮件特征信息各自对应的邮件为相似邮件; 归类存储单元, 用于将所述相似邮件归为一类进行存储。 a comparing unit, configured to perform pairwise matching on the mail feature information sent to the central server, and when the similarity between the mail feature information exceeds a preset criterion, determining that the mail corresponding to the mail feature information is similar a mailing; a categorization storage unit for classifying the similar mails into one category for storage.
[0014] 实施本发明具有如下有益效果: [0014] The implementation of the present invention has the following beneficial effects:
实施本发明, 可以让邮件服务器实时上传待查邮件的特征并立即获得中心服务器返回的判断 结果, 时效性高。 本发明根据邮件内不同格式的信息提取不同的特征, 且这些特征保留了足 够的信息量供中心服务器进行聚类和分类, 并判定邮件是否为垃圾邮件。 另外, 中心服务器 也可以通过处理大量的邮件服务器发送过来的查询请求, 获得足够的数据量, 进行聚类和分 类, 大大加强了中心服务器的过滤效果。 By implementing the invention, the mail server can upload the characteristics of the mail to be checked in real time and immediately obtain the judgment result returned by the central server, which is time-sensitive. The present invention extracts different features based on information in different formats within the mail, and these features retain sufficient information for the central server to cluster and classify, and determine whether the mail is spam. In addition, the central server can also process the query requests sent by a large number of mail servers, obtain sufficient data volume, perform clustering and classification, and greatly enhance the filtering effect of the central server.
附图说明 DRAWINGS
[0015] 图 1是本发明一种邮件处理方法的流程示意图;  1 is a schematic flow chart of a mail processing method according to the present invention;
图 2是本发明一种邮件处理方法的又一流程示意图; 2 is a schematic flow chart of still another method for processing a mail according to the present invention;
图 3是本发明一种邮件处理系统的结构示意图; 3 is a schematic structural diagram of a mail processing system according to the present invention;
图 4是本发明邮件处理系统的邮件服务器的结构示意图; 4 is a schematic structural diagram of a mail server of the mail processing system of the present invention;
图 5是本发明邮件处理系统的中心服务器的结构示意图; 5 is a schematic structural diagram of a central server of the mail processing system of the present invention;
图 6是本发明邮件处理系统的中心服务器的又一结构示意图。 6 is another schematic structural diagram of a central server of the mail processing system of the present invention.
具体实施方式 detailed description
[0016] 为使本发明的目的、 技术方案和优点更加清楚, 下面将结合附图对本发明作进一步 地详细描述。  [0016] The present invention will be further described in detail below with reference to the accompanying drawings.
[0017] 图 1是本发明实施例一种邮件处理方法的流程示意图。  1 is a schematic flow chart of a mail processing method according to an embodiment of the present invention.
[0018] 100, 邮件服务器根据邮件的不同类型获取邮件相应的特征信息, 并将所述特征信息 发送至中心服务器。  [0018] 100. The mail server acquires corresponding feature information of the mail according to different types of mails, and sends the feature information to the central server.
[0019] 101 , 所述中心服务器根据所述邮件特征信息判断所述邮件特征信息对应的邮件是否 为垃圾邮件, 并将判断的结果反馈至所述邮件服务器。  [0019] The central server determines, according to the mail feature information, whether the mail corresponding to the mail feature information is spam, and feeds back a result of the judgment to the mail server.
[0020] 图 2是本发明实施例一种邮件处理方法的又一流程示意图。  2 is a schematic flow chart of another method for processing a mail according to an embodiment of the present invention.
[0021] 200, 邮件服务器接收用户发送的邮件。 [0022] 201 , 当所述邮件为文本数据时, 根据 Nilsimsa算法对所述文本数据进行处理, 获取 可代表所述邮件的 64字节序列特征信息。 [0021] 200. The mail server receives the mail sent by the user. [0022] 201. When the mail is text data, the text data is processed according to a Nilsimsa algorithm, and 64-byte sequence feature information representative of the mail is obtained.
[0023] Nilsimsa算法首先对邮件内容进行分拆, 把相邻 4个字节提取出来 (4个字节是经验 值, 考虑到一般汉字需要两个字节表示, 一般汉语词组包含两个汉字)。 比如对于文本 "这 是一个测试", 则提取的原始特征为: "这是", "是一", "一个", "个测", "测试"。 假 如在这个文本中随机添加一个字符信息, 文本变成 "这只是一个测试" 则提取的原始特征 为 "这只", "只是", "是一", "一个", "个测", "测试"。 从上面的例子看, 如果稍微改变 原文一点, 则最终也只是影响了提取的原始特征中的两个 ("这是" → "这只", "只 是 ")。 所以只要判定两个文本序列生成的原始特征相似的比例, 即可间接获得两个文本序 列的相似比例。 再对每个原始特征映射通过映射函数映射出一个整数 (映射函数没有特别要 求, 只要能将一个字符串映射成一个整数的函数都可以。 一个最简单的映射函数例子就是将 代表两个汉字的四个字节对应的二进制数字看成一个四字节的整数)。 然后将这个整数对 512取模, 保存到一个 512个桶的直方图中。  [0023] The Nilsimsa algorithm first splits the content of the mail and extracts the adjacent 4 bytes (4 bytes is the empirical value, considering that the general Chinese character requires two bytes, and the general Chinese phrase contains two Chinese characters) . For example, for the text "This is a test", the original features extracted are: "This is", "Yes", "One", "Measure", "Test". If a character information is randomly added to this text, the text becomes "this is just a test" and the original features extracted are "this only", "just", "is one", "one", "a", " test". From the above example, if you change the original text a little, it will only affect the two of the original features extracted ("This is "→" this one, "only"). Therefore, as long as the proportion of the original features generated by the two text sequences is determined to be similar, the similar proportions of the two text sequences can be obtained indirectly. Then, for each original feature map, an integer is mapped by the mapping function (the mapping function is not particularly required, as long as a string can be mapped to an integer function. One of the simplest mapping function examples is to represent two Chinese characters. The binary number corresponding to four bytes is treated as a four-byte integer). This integer pair 512 is then modulo and saved into a histogram of 512 buckets.
[0024] 对直方图再做一次 0/1化处理。 首先计算这个直方图的平均高度, 然后把高于直方图 平均高度的桶设置为 1, 低于平均高度的桶设置为 0。 于是就可以获得了一个 512bit (即 64字 节) 的一个特征序列了。  [0024] Do another 0/1 processing on the histogram. First calculate the average height of this histogram, then set the bucket above the average height of the histogram to 1, and set the bucket below the average height to 0. A 512-bit (ie, 64-byte) sequence of features is then obtained.
[0025] 202, 当所述邮件为图片数据时, 根据所述邮件中的图片的压縮率分布特性, 提取所 述图片的特征信息。  [0025] 202. When the mail is picture data, extracting feature information of the picture according to a compression rate distribution characteristic of the picture in the mail.
[0026] 具体的, 对所述图片进行扫描, 获得图片的每个子块的压縮率, 将每 N个连续的子 块的压縮率合并成一个新的压縮率变化元素, 其中 N是大于 1 的自然数, 可根据需求设定 N, 再将每一个压縮变化元素和它所在的图片中的位置编码进行组合, 从而获得所述图片的 特征信息。  Specifically, the picture is scanned to obtain a compression ratio of each sub-block of the picture, and the compression ratio of each N consecutive sub-blocks is combined into a new compression rate change element, where N is For a natural number greater than 1, N can be set according to requirements, and each compression change element and the position code in the picture in which it is located are combined to obtain the feature information of the picture.
[0027] 203, 当所述邮件为非文本及图片的其它数据时, 根据 MD5算法对所述其它数据进行 计算, 获得一个 32字节序列特征信息。  [0027] 203. When the mail is non-text and other data of the picture, calculate the other data according to the MD5 algorithm to obtain a 32-byte sequence feature information.
[0028] 需要说明的是, 步骤 201、 201、 203 之间无必然的先后顺序, 只是根据邮件中数据 类型择一执行。  [0028] It should be noted that there is no necessary sequence between the steps 201, 201, and 203, but only one of the data types in the mail is executed.
[0029] 204, 将邮件服务器从邮件中提取的邮件特征信息发送至中央服务器。  [0029] 204. Send the mail feature information extracted by the mail server from the mail to the central server.
[0030] 205, 接收所述邮件服务器发送的特征信息。 [0030] 205. Receive feature information sent by the mail server.
[0031] 206, 将所述特征信息与所述中心服务器已有的垃圾邮件特征信息进行比对, 当所述 特征信息与所述垃圾邮件特征信息相似度超过预设标准时, 判断所述特征信息对应的邮件为 垃圾邮件, 否则, 判断所述特征信息对应的邮件为正常邮件。 [0032] 需要说明的是, 通过蜜罐邮箱和用户举报, 可以获得某邮件是垃圾邮件的判定。 然 后通过比较需判定的邮件是否和某已知的垃圾邮件是否相似, 来判定此未知样本是否垃圾邮 件。 另外, 蜜罐邮箱是指我们自行注册的一些邮箱账号, 并将邮箱账号公布到互联网, 由发 垃圾邮件的人收集。 由于这些邮箱账号实际并不适用, 所以发送到这些账号的邮件基本都是 垃圾邮件。 假如多个蜜罐邮箱收到的某个邮件的特征相似, 则基本可以认为这个邮件特征是 垃圾邮件特征。 [0031] 206, comparing the feature information with the existing spam feature information of the central server, and determining the feature information when the similarity between the feature information and the spam feature information exceeds a preset criterion. The corresponding email is spam. Otherwise, it is determined that the email corresponding to the feature information is a normal email. [0032] It should be noted that, by the honeypot mailbox and the user reporting, it can be determined that the mail is spam. Then determine whether the unknown sample is spam by comparing whether the message to be determined is similar to a known spam. In addition, the honeypot mailbox refers to some email accounts that we register ourselves, and the email account is posted to the Internet and collected by the spammer. Since these email accounts are not actually applicable, the emails sent to these accounts are basically spam. If the characteristics of a certain mail received by a plurality of honeypot mailboxes are similar, it is basically considered that the mail feature is a spam feature.
[0033] 207, 中央服务器将所述结果反馈至邮件服务器。  [0033] 207. The central server feeds back the result to the mail server.
[0034] 208, 将发送至所述中心服务器的邮件特征信息进行两两比对, 当所述邮件特征信息 间的相似度超过预设标准时, 则判断所述邮件特征信息各自对应的邮件为相似邮件。  [0034] 208: Perform pairwise comparison of the mail feature information sent to the central server. When the similarity between the mail feature information exceeds a preset criterion, determine that the mail corresponding to the mail feature information is similar. mail.
[0035] 209, 将所述相似邮件归为一类进行存储。  [0035] 209. Sort the similar mails into one category for storage.
[0036] 需要说明的是, 208及 209与其它步骤间没有必然的先后关系。  [0036] It should be noted that there is no necessary relationship between 208 and 209 and other steps.
[0037] 图 3是本发明实施例一种邮件处理系统 1的结构示意图, 包括:  3 is a schematic structural diagram of a mail processing system 1 according to an embodiment of the present invention, including:
邮件服务器 11, 用于接收用户发送的邮件, 获取所述邮件的邮件特征信息后发送至中心服 务器 12, 并根据中心服务器 12根据所述邮件特征信息对邮件的判断结果进行相应操作; 中心服务器 12, 用于接收所述邮件服务器 11发送的邮件特征信息, 根据所述邮件特征信息 判断所述邮件特征信息对应的邮件是否为垃圾邮件, 并将判断的结果反馈至所述邮件服务 器。 The mail server 11 is configured to receive the mail sent by the user, obtain the mail characteristic information of the mail, send it to the central server 12, and perform corresponding operations according to the judgment result of the mail according to the mail characteristic information by the central server 12; And receiving the mail feature information sent by the mail server 11, determining, according to the mail feature information, whether the mail corresponding to the mail feature information is spam, and feeding back the result of the judgment to the mail server.
[0038] 所述中心服务器 12还根据所述邮件特征信息对所述邮件进行聚类操作, 所述聚类操 作为将邮件特征信息相似的邮件归为一类。  [0038] The central server 12 further performs a clustering operation on the mail according to the mail feature information, and the clustering operation is classified into a mail with similar mail feature information.
[0039] 图 4是本发明实施例一种邮件处理系统 1中邮件服务器 11的结构示意图,包括: 文本邮件特征提取单元 111, 用于当所述邮件为文本数据时, 根据 Nilsimsa算法对所述文本 数据进行处理, 获取可代表所述邮件的 64字节序列特征信息。  4 is a schematic structural diagram of a mail server 11 in a mail processing system 1 according to an embodiment of the present invention, including: a text mail feature extraction unit 111, configured to: when the mail is text data, according to the Nilsimsa algorithm The text data is processed to obtain 64-byte sequence feature information representative of the mail.
[0040] 需要说明的是, Nilsimsa算法首先对邮件内容进行分拆, 把相邻 4个字节提取出来 (4 个字节是经验值, 考虑到一般汉字需要两个字节表示, 一般汉语词组包含两个汉字)。 比如对于文本 "这是一个测试", 则提取的原始特征为: "这是", "是一", "一个", "个测", "测试"。 假如在这个文本中随机添加一个字符信息, 文本变成 "这只是一个测 试", 则提取的原始特征为 "这只", "只是", "是一", "一个", "个测", "测试"。 从上面的 例子看, 如果稍微改变原文一点, 则最终也只是影响了提取的原始特征中的两个 ("这是" → "这只", "只是 ")。 所以只要判定两个文本序列生成的原始特征相似的比例, 即可间接 获得两个文本序列的相似比例。 再对每个原始特征映射通过映射函数映射出一个整数 (映射 函数没有特别要求, 只要能将一个字符串映射成一个整数的函数都可以。 一个最简单的映射 函数例子就是将代表两个汉字的四个字节对应的二进制数字看成一个四字节的整数)。 然后 将这个整数对 512取模, 保存到一个 512个桶的直方图中。 [0040] It should be noted that the Nilsimsa algorithm first splits the content of the mail and extracts the adjacent 4 bytes (4 bytes is an empirical value, considering that the general Chinese character requires two bytes, the general Chinese phrase Contains two Chinese characters). For example, for the text "This is a test", the original features extracted are: "This is", "Yes", "One", "Measure", "Test". If you add a character message randomly to this text, the text becomes "This is just a test", then the original features extracted are "this only", "just", "is one", "one", "one test", "test". From the above example, if you change the original text a little, it will only affect the two of the original features extracted ("this is" → "this", "just"). Therefore, as long as the proportion of the original features generated by the two text sequences is determined, the similar proportions of the two text sequences can be obtained indirectly. Then, for each original feature map, an integer is mapped by the mapping function (the mapping function is not particularly required, as long as a string can be mapped to an integer function. One of the simplest mappings) An example of a function is to treat a binary number corresponding to four bytes representing two Chinese characters as a four-byte integer). This integer pair 512 is then modulo and saved into a histogram of 512 buckets.
[0041] 对直方图再做一次 0/1化处理。 首先计算这个直方图的平均高度, 然后把高于直方图 平均高度的桶设置为 1, 低于平均高度的桶设置为 0。 于是就可以获得了一个 512bit (即 64字 节) 的一个特征序列了。  [0041] Do another 0/1 processing on the histogram. First calculate the average height of this histogram, then set the bucket above the average height of the histogram to 1, and set the bucket below the average height to 0. A 512-bit (ie, 64-byte) sequence of features is then obtained.
[0042] 图片邮件特征提取单元 112, 用于当所述邮件为图片数据时, 根据所述邮件中的图片 的压縮率分布特性, 提取所述图片的特征信息。  [0042] The picture mail feature extraction unit 112 is configured to extract feature information of the picture according to a compression rate distribution characteristic of the picture in the mail when the mail is picture data.
[0043] 具体的, 对所述图片进行扫描, 获得图片的每个子块的压縮率, 将每 N个连续的子 块的压縮率合并成一个新的压縮率变化元素, 其中 N是大于 1 的自然数, 可根据需求设定 N, 再将每一个压縮变化元素和它所在的图片中的位置编码进行组合, 从而获得所述图片的 特征信息。  [0043] Specifically, scanning the picture, obtaining a compression ratio of each sub-block of the picture, and combining the compression ratio of each N consecutive sub-blocks into a new compression rate change element, where N is For a natural number greater than 1, N can be set according to requirements, and each compression change element and the position code in the picture in which it is located are combined to obtain the feature information of the picture.
[0044] 其它数据邮件特征提取单元 113, 用于当所述邮件为非文本及图片的其它数据时, 根 据 MD5算法对所述其它数据进行计算, 获得一个 32字节序列特征信息。  [0044] The other data mail feature extraction unit 113 is configured to calculate the other data according to the MD5 algorithm when the mail is non-text and other data of the picture, to obtain a 32-byte sequence feature information.
[0045] 发送单元 114, 用于将提取的邮件特征信息发送至中心服务器 12。  [0045] The sending unit 114 is configured to send the extracted mail feature information to the central server 12.
[0046] 图 5是本发明实施例一种邮件处理系统 1中的中心服务器 12的结构示意图, 包括: 接收单元 121, 用于接收所述邮件服务器 11发送的特征信息;  5 is a schematic structural diagram of a central server 12 in a mail processing system 1 according to an embodiment of the present invention, including: a receiving unit 121, configured to receive feature information sent by the mail server 11;
判断单元 122, 用于将所述特征信息与所述中心服务器 12 已有的垃圾邮件特征信息进行比 对, 当所述特征信息与所述垃圾邮件特征信息相似度超过预设标准时, 判断所述特征信息对 应的邮件为垃圾邮件, 否则, 判断所述特征信息对应的邮件为正常邮件; The determining unit 122 is configured to compare the feature information with the existing spam feature information of the central server 12, and determine that the similarity between the feature information and the spam feature information exceeds a preset criterion. The email corresponding to the feature information is spam, otherwise, the email corresponding to the feature information is determined to be a normal email;
发送单元 123, 用于将所述判断单元 122判断的结果信息发送至所述邮件服务器。 The sending unit 123 is configured to send the result information determined by the determining unit 122 to the mail server.
[0047] 图 6是本发明实施例一种邮件处理系统 1中的中心服务器 12的又一结构示意图, 与 图 5不同的是, 还包括:  6 is another schematic structural diagram of the central server 12 in the mail processing system 1 according to the embodiment of the present invention. Different from FIG. 5, the method further includes:
对比单元 124, 用于将发送至所述中心服务器的邮件特征信息进行两两比对, 当所述邮件特 征信息间的相似度超过预设标准时, 则判断所述邮件特征信息各自对应的邮件为相似邮件; 归类存储单元 125, 用于将所述相似邮件归为一类进行存储。 The comparing unit 124 is configured to perform the pairwise matching of the mail feature information sent to the central server. When the similarity between the mail feature information exceeds a preset criterion, determining that the mail corresponding to the mail feature information is Similar mail; the categorization storage unit 125 is configured to classify the similar mails into one category for storage.
[0048] 实施本发明, 可以让邮件服务器实时上传待查邮件的特征并立即获得中心服务器返 回的判断结果, 时效性高。 本发明根据邮件内不同格式的信息提取不同的特征, 且这些特征 保留了足够的信息量供中心服务器进行聚类和分类, 并判定邮件是否为垃圾邮件。 另外, 中 心服务器也可以通过处理大量的邮件服务器发送过来的查询请求, 获得足够的数据量, 进行 聚类和分类, 大大加强了中心服务器的过滤效果。  [0048] By implementing the present invention, the mail server can upload the characteristics of the mail to be checked in real time and immediately obtain the judgment result returned by the center server, which is time-sensitive. The present invention extracts different features based on information in different formats within the mail, and these features retain sufficient information for the central server to cluster and classify, and determine whether the mail is spam. In addition, the central server can also process the query requests sent by a large number of mail servers, obtain sufficient data volume, perform clustering and classification, and greatly enhance the filtering effect of the central server.
[0049] 以上所述是本发明的优选实施方式, 应当指出, 对于本技术领域的普通技术人员来 说, 在不脱离本发明原理的前提下, 还可以做出若干改进和润饰, 这些改进和润饰也视为本 发明的保护范围。 The above is a preferred embodiment of the present invention, it should be noted that one of ordinary skill in the art would It is to be understood that a number of modifications and refinements may be made without departing from the principles of the invention, and such modifications and refinements are also considered to be within the scope of the invention.

Claims

权 利 要 求 Rights request
1. 一种邮件处理方法, 其特征在于, 包括:  A mail processing method, comprising:
邮件服务器根据邮件的不同类型获取邮件相应的特征信息, 并将所述特征信息发送至中心服 务器; The mail server obtains the corresponding feature information of the mail according to different types of the mail, and sends the feature information to the central server;
所述中心服务器根据所述邮件特征信息判断所述邮件特征信息对应的邮件是否为垃圾邮件, 并将判断的结果反馈至所述邮件服务器。 The central server determines, according to the mail feature information, whether the mail corresponding to the mail feature information is spam, and feeds back the result of the judgment to the mail server.
2. 如权利要求 1所述的邮件处理方法, 其特征在于, 还包括:  2. The mail processing method according to claim 1, further comprising:
所述中心服务器还根据所述邮件特征信息对所述邮件进行聚类操作, 所述聚类操作为将邮件 特征信息相似的邮件归为一类。 The central server further performs a clustering operation on the mail according to the mail feature information, and the clustering operation classifies the mails with similar mail feature information into one category.
3. 如权利要求 2 所述的邮件处理方法, 其特征在于, 所述邮件服务器根据邮件的不同类型 获取邮件相应的特征信息的步骤包括:  3. The mail processing method according to claim 2, wherein the step of the mail server acquiring the corresponding feature information of the mail according to different types of mails comprises:
当所述邮件为文本数据时, 根据 Nilsimsa算法对所述文本数据进行处理, 获取可代表所述邮 件的 64字节序列特征信息; When the mail is text data, the text data is processed according to the Nilsimsa algorithm to obtain 64-byte sequence feature information representative of the mail;
当所述邮件为图片数据时, 根据所述邮件中的图片的压縮率分布特性, 提取所述图片的特征 信息; When the mail is picture data, extracting feature information of the picture according to a compression rate distribution characteristic of the picture in the mail;
当所述邮件为非文本及图片的其它数据时, 根据 MD5 算法对所述其它数据进行计算, 获得 一个 32字节序列特征信息。 When the mail is non-text and other data of the picture, the other data is calculated according to the MD5 algorithm to obtain a 32-byte sequence characteristic information.
4. 如权利要求 3 所述的邮件处理方法, 其特征在于, 所述中心服务器根据所述特征信息判 断所述特征信息对应的邮件是否为垃圾邮件的步骤包括:  The mail processing method according to claim 3, wherein the step of determining, by the central server, whether the mail corresponding to the feature information is spam according to the feature information comprises:
接收所述邮件服务器发送的特征信息; Receiving feature information sent by the mail server;
将所述特征信息与所述中心服务器已有的垃圾邮件特征信息进行比对, 当所述特征信息与所 述垃圾邮件特征信息的相似度超过预设标准时, 判断所述特征信息对应的邮件为垃圾邮件, 否则, 判断所述特征信息对应的邮件为正常邮件。 Comparing the feature information with the existing spam feature information of the central server, and when the similarity between the feature information and the spam feature information exceeds a preset criterion, determining that the message corresponding to the feature information is Spam, otherwise, it is judged that the mail corresponding to the feature information is a normal mail.
5. 如权利要求 4 所述的邮件处理方法, 其特征在于, 所述中心服务器还根据所述特征信息 对所述邮件进行聚类操作的步骤包括:  The mail processing method according to claim 4, wherein the step of the clustering operation of the mail by the central server according to the feature information comprises:
将发送至所述中心服务器的邮件特征信息进行两两比对, 当所述邮件特征信息间的相似度超 过预设标准时, 则判断所述邮件特征信息各自对应的邮件为相似邮件; And performing the pairwise comparison on the mail feature information sent to the central server. When the similarity between the mail feature information exceeds a preset criterion, determining that the mail corresponding to the mail feature information is a similar mail;
将所述相似邮件归为一类进行存储。 The similar messages are classified into one category for storage.
6. 一种邮件处理系统, 其特征在于, 包括:  6. A mail processing system, comprising:
邮件服务器, 用于接收用户发送的邮件, 获取所述邮件的邮件特征信息后发送至中心服务 器, 并根据中心服务器根据所述邮件特征信息对邮件的判断结果进行相应操作; 中心服务器, 用于接收所述邮件服务器发送的邮件特征信息, 根据所述邮件特征信息判断所 述邮件特征信息对应的邮件是否为垃圾邮件, 并将判断的结果反馈至所述邮件服务器; 所述中心服务器还根据所述邮件特征信息对所述邮件进行聚类操作, 所述聚类操作为将邮件 特征信息相似的邮件归为一类。 The mail server is configured to receive the mail sent by the user, obtain the mail characteristic information of the mail, send the mail characteristic information to the central server, and perform corresponding operations according to the judgment result of the mail according to the mail characteristic information of the central server; a central server, configured to receive the mail feature information sent by the mail server, determine, according to the mail feature information, whether the mail corresponding to the mail feature information is spam, and feed back the result of the judgment to the mail server; The central server further performs a clustering operation on the mail according to the mail feature information, and the clustering operation classifies the mails with similar mail feature information into one category.
7. 如权利要求 6所述的邮件处理系统, 其特征在于, 所述邮件服务器包括:  7. The mail processing system according to claim 6, wherein the mail server comprises:
文本邮件特征提取单元, 用于当所述邮件为文本数据时, 根据 Nilsimsa算法对所述文本数据 进行处理, 获取可代表所述邮件的 64字节序列特征信息; a text message feature extraction unit, configured to process the text data according to a Nilsimsa algorithm when the mail is text data, to obtain 64-byte sequence feature information representative of the mail;
图片邮件特征提取单元, 用于当所述邮件为图片数据时, 根据所述邮件中的图片的压縮率分 布特性, 提取所述图片的特征信息; a picture mail feature extraction unit, configured to: when the mail is picture data, extract feature information of the picture according to a compression rate distribution characteristic of the picture in the mail;
其它数据邮件特征提取单元, 用于当所述邮件为非文本及图片的其它数据时, 根据 MD5 算 法对所述其它数据进行计算, 获得一个 32字节序列特征信息。 The other data mail feature extraction unit is configured to calculate the other data according to the MD5 algorithm when the mail is non-text and other data of the picture, to obtain a 32-byte sequence feature information.
8. 如权利要求 7所述的邮件处理系统, 其特征在于, 所述中心服务器包括:  8. The mail processing system according to claim 7, wherein the central server comprises:
接收单元, 用于接收所述邮件服务器发送的特征信息; a receiving unit, configured to receive feature information sent by the mail server;
判断单元, 用于将所述特征信息与所述中心服务器已有的垃圾邮件特征信息进行比对, 当所 述特征信息与所述垃圾邮件特征信息相似度超过预设标准时, 判断所述特征信息对应的邮件 为垃圾邮件, 否则, 判断所述特征信息对应的邮件为正常邮件; a determining unit, configured to compare the feature information with the existing spam feature information of the central server, and determine the feature information when the similarity between the feature information and the spam feature information exceeds a preset criterion The corresponding email is spam, otherwise, the email corresponding to the feature information is determined to be a normal email;
发送单元, 用于将所述判断单元判断的结果信息发送至所述邮件服务器。 And a sending unit, configured to send result information determined by the determining unit to the mail server.
9. 如权利要求 8所述的邮件处理系统, 其特征在于, 所述中心服务器还包括:  The mail processing system according to claim 8, wherein the central server further comprises:
对比单元, 用于将发送至所述中心服务器的邮件特征信息进行两两比对, 当所述邮件特征信 息间的相似度超过预设标准时, 则判断所述邮件特征信息各自对应的邮件为相似邮件; 归类存储单元, 用于将所述相似邮件归为一类进行存储。 a comparing unit, configured to perform pairwise matching on the mail feature information sent to the central server, and when the similarity between the mail feature information exceeds a preset criterion, determining that the mail corresponding to the mail feature information is similar a mailing; a categorization storage unit for classifying the similar mails into one category for storage.
PCT/CN2012/085093 2012-09-07 2012-11-23 Mail process method and system WO2014036787A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210327916.9 2012-09-07
CN201210327916.9A CN103684971B (en) 2012-09-07 2012-09-07 Method and system for processing mails

Publications (1)

Publication Number Publication Date
WO2014036787A1 true WO2014036787A1 (en) 2014-03-13

Family

ID=50236487

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/085093 WO2014036787A1 (en) 2012-09-07 2012-11-23 Mail process method and system

Country Status (2)

Country Link
CN (1) CN103684971B (en)
WO (1) WO2014036787A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224569B (en) * 2014-06-30 2018-09-07 华为技术有限公司 A kind of data filtering, the method and device for constructing data filter

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107204989B (en) 2017-06-30 2020-11-10 腾讯科技(深圳)有限公司 Advertisement blocking method, terminal, server and storage medium
CN110048936B (en) * 2019-04-18 2021-09-10 宁波青年优品信息科技有限公司 Method for judging junk mail by semantic associated words
CN113630302B (en) * 2020-05-09 2023-07-11 阿里巴巴集团控股有限公司 Junk mail identification method and device and computer readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1909520A (en) * 2006-08-04 2007-02-07 华南理工大学 Rubbish mail filtration system and method based on email server
CN1941746A (en) * 2005-09-27 2007-04-04 腾讯科技(深圳)有限公司 Method and system against rubbish e-mails
CN101094197A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and mail server of anti garbage mail

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101917352B (en) * 2010-06-12 2012-07-25 盈世信息科技(北京)有限公司 Method for recognizing picture spam mails and system thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1941746A (en) * 2005-09-27 2007-04-04 腾讯科技(深圳)有限公司 Method and system against rubbish e-mails
CN101094197A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and mail server of anti garbage mail
CN1909520A (en) * 2006-08-04 2007-02-07 华南理工大学 Rubbish mail filtration system and method based on email server

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224569B (en) * 2014-06-30 2018-09-07 华为技术有限公司 A kind of data filtering, the method and device for constructing data filter

Also Published As

Publication number Publication date
CN103684971A (en) 2014-03-26
CN103684971B (en) 2017-02-08

Similar Documents

Publication Publication Date Title
CN103441924B (en) A kind of rubbish mail filtering method based on short text and device
US8713014B1 (en) Simplifying lexicon creation in hybrid duplicate detection and inductive classifier systems
CA2859131C (en) Systems and methods for spam detection using character histograms
WO2014036787A1 (en) Mail process method and system
US20060149820A1 (en) Detecting spam e-mail using similarity calculations
CN104883671B (en) A kind of judgment method and system of refuse messages
WO2014137233A1 (en) Document classification using multiscale text fingerprints
CN106453249B (en) network mail service monitoring method
EP2947847A1 (en) Machine learning and validation of account names, addresses, and/or identifiers
CN106227780A (en) Automatization's sectional drawing evidence collecting method of a kind of magnanimity webpage and system
WO2013166922A1 (en) Information processing method and terminal
WO2011153894A1 (en) Method and system for distinguishing image spam mail
WO2014019465A1 (en) Method, device, and storage medium for detecting abnormal message based on account attribute
CN102833182A (en) Method, client and system for carrying out face identification in instant messaging
CN112527530B (en) Message processing method, device, apparatus, storage medium and computer program product
US8473556B2 (en) Apparatus, a method, a program and a system for processing an e-mail
WO2010037292A1 (en) Method and system for determining suspicious spam range
CN111010336A (en) Massive mail analysis method and device
CN105701224A (en) Security information customized service system based on big data
CN101540741A (en) Image junk mail filtering method based on threshold
WO2022005664A1 (en) Clustering and cluster tracking of categorical data
CN101094197A (en) Method and mail server of anti garbage mail
CN103841006A (en) Method and device for intercepting junk mails in cloud computing system
WO2015196658A1 (en) Method and device for acquiring delivery state of e-mail, and computer storage medium
CN106911660B (en) Information management method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12884318

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12884318

Country of ref document: EP

Kind code of ref document: A1