WO2014036787A1

WO2014036787A1 - Mail process method and system

Info

Publication number: WO2014036787A1
Application number: PCT/CN2012/085093
Authority: WO
Inventors: 林延中; 潘庆峰
Original assignee: 盈世信息科技(北京)有限公司
Priority date: 2012-09-07
Filing date: 2012-11-23
Publication date: 2014-03-13
Also published as: CN103684971A; CN103684971B

Abstract

Disclosed is a mail process method including: a mail server obtains the corresponding mail characteristic information according to the different mail type, and sends the characteristic information to a center server; the center server determines whether a mail corresponding with the characteristic information is a spam according to the mail characteristic information, and returns the determination result to the mail server. Corresponding, the invention also provides a mail process system. Utilizing embodiments of the invention, the mail server can real-time upload a mail characteristic waiting for check, and immediately obtain a determination result returning from the center server, the invention has highly time efficiency. The invention extracts the different characteristic according to the different format information within a mail, provides the center server to perform clustering and classifying, and determines whether the mail is a spam. Also, the center server can obtain enough data quantity through the query request sent from the mail server that processes a mass of mails, and perform clustering and classifying, which greatly improves the filtering effect of the center server.

Description

Mail processing method and system

[0001] The present invention relates to the field of communications, and in particular, to a mail processing method and system.

Background technique

[0002] With the development of communication technology, mail has become an important tool for people's daily communication, but the problem that comes with it is huge spam, which seriously affects the use of normal mail by users. The existing anti-spam filtering devices are all regularly downloading the filtering rule base of the central server, and are regularly updated to obtain the ability to filter spam. This method is time-sensitive, and during the two updates, a new batch of new types of spam may be missed.

[0003] One solution to this time-effect problem is to forward mail to the central server for filtering, but the disadvantage of this scheme is that it consumes a large amount of bandwidth, and the central server needs to handle forwarding requests of dozens or even hundreds of mail servers at the same time. The hardware requirements are very high and even require a large number of servers to complete.

Summary of the invention

[0004] The technical problem to be solved by the embodiments of the present invention is to provide a mail processing method and system, which can extract a small number of features that best represent mails, and these features enable the central server to cluster and classify mails, and The amount of data that needs to be transferred is very small, greatly reducing the amount of traffic between the mail server and the central server, allowing the mail server to handle very large-scale mail and achieve a very high level of spam filtering.

[0005] In order to achieve the above technical effects, an embodiment of the present invention provides a mail processing method, including:

The mail server obtains the corresponding feature information of the mail according to different types of the mail, and sends the feature information to the central server;

The central server determines, according to the mail feature information, whether the mail corresponding to the mail feature information is spam, and feeds back the result of the judgment to the mail server.

[0006] Further, the method further includes:

The central server further performs a clustering operation on the mail according to the mail feature information, and the clustering operation classifies the mails with similar mail feature information into one category.

[0007] Preferably, the step of the mail server acquiring the corresponding feature information of the mail according to different types of mails includes:

When the mail is text data, the text data is processed according to the Nilsimsa algorithm to obtain 64-byte sequence feature information representative of the mail; When the mail is picture data, extracting feature information of the picture according to a compression rate distribution characteristic of the picture in the mail;

When the mail is non-text and other data of the picture, the other data is calculated according to the MD5 algorithm to obtain a 32-byte sequence characteristic information.

[0008] Preferably, the step of determining, by the central server, whether the email corresponding to the feature information is a spam email according to the feature information comprises:

Receiving feature information sent by the mail server;

Comparing the feature information with the existing spam feature information of the central server, and when the similarity between the feature information and the spam feature information exceeds a preset criterion, determining that the message corresponding to the feature information is Spam, otherwise, it is judged that the mail corresponding to the feature information is a normal mail.

[0009] Preferably, the step of the central server further performing a clustering operation on the mail according to the feature information includes:

And performing the pairwise comparison on the mail feature information sent to the central server. When the similarity between the mail feature information exceeds a preset criterion, determining that the mail corresponding to the mail feature information is a similar mail;

The similar messages are classified into one category for storage.

[0010] Correspondingly, the present invention further provides a mail processing system, including:

The mail server is configured to receive the mail sent by the user, obtain the mail characteristic information of the mail, send the mail characteristic information to the central server, and perform corresponding operations according to the judgment result of the mail according to the mail characteristic information of the central server;

a central server, configured to receive the mail feature information sent by the mail server, determine, according to the mail feature information, whether the mail corresponding to the mail feature information is spam, and feed back the result of the judgment to the mail server; The central server further performs a clustering operation on the mail according to the mail feature information, and the clustering operation classifies the mails with similar mail feature information into one category.

[0011] Preferably, the mail server comprises:

a text message feature extraction unit, configured to process the text data according to a Nilsimsa algorithm when the mail is text data, to obtain 64-byte sequence feature information representative of the mail;

a picture mail feature extraction unit, configured to: when the mail is picture data, extract feature information of the picture according to a compression rate distribution characteristic of the picture in the mail;

The other data mail feature extraction unit is configured to calculate the other data according to the MD5 algorithm when the mail is non-text and other data of the picture, to obtain a 32-byte sequence feature information.

[0012] Preferably, the central server comprises:

a receiving unit, configured to receive feature information sent by the mail server; a determining unit, configured to compare the feature information with the existing spam feature information of the central server, and determine the feature information when the similarity between the feature information and the spam feature information exceeds a preset criterion The corresponding email is spam, otherwise, the email corresponding to the feature information is determined to be a normal email;

And a sending unit, configured to send result information determined by the determining unit to the mail server.

[0013] Preferably, the central server further includes:

a comparing unit, configured to perform pairwise matching on the mail feature information sent to the central server, and when the similarity between the mail feature information exceeds a preset criterion, determining that the mail corresponding to the mail feature information is similar a mailing; a categorization storage unit for classifying the similar mails into one category for storage.

[0014] The implementation of the present invention has the following beneficial effects:

By implementing the invention, the mail server can upload the characteristics of the mail to be checked in real time and immediately obtain the judgment result returned by the central server, which is time-sensitive. The present invention extracts different features based on information in different formats within the mail, and these features retain sufficient information for the central server to cluster and classify, and determine whether the mail is spam. In addition, the central server can also process the query requests sent by a large number of mail servers, obtain sufficient data volume, perform clustering and classification, and greatly enhance the filtering effect of the central server.

DRAWINGS

1 is a schematic flow chart of a mail processing method according to the present invention;

2 is a schematic flow chart of still another method for processing a mail according to the present invention;

3 is a schematic structural diagram of a mail processing system according to the present invention;

4 is a schematic structural diagram of a mail server of the mail processing system of the present invention;

5 is a schematic structural diagram of a central server of the mail processing system of the present invention;

6 is another schematic structural diagram of a central server of the mail processing system of the present invention.

detailed description

[0016] The present invention will be further described in detail below with reference to the accompanying drawings.

1 is a schematic flow chart of a mail processing method according to an embodiment of the present invention.

[0018] 100. The mail server acquires corresponding feature information of the mail according to different types of mails, and sends the feature information to the central server.

[0019] The central server determines, according to the mail feature information, whether the mail corresponding to the mail feature information is spam, and feeds back a result of the judgment to the mail server.

2 is a schematic flow chart of another method for processing a mail according to an embodiment of the present invention.

[0021] 200. The mail server receives the mail sent by the user. [0022] 201. When the mail is text data, the text data is processed according to a Nilsimsa algorithm, and 64-byte sequence feature information representative of the mail is obtained.

[0023] The Nilsimsa algorithm first splits the content of the mail and extracts the adjacent 4 bytes (4 bytes is the empirical value, considering that the general Chinese character requires two bytes, and the general Chinese phrase contains two Chinese characters) . For example, for the text "This is a test", the original features extracted are: "This is", "Yes", "One", "Measure", "Test". If a character information is randomly added to this text, the text becomes "this is just a test" and the original features extracted are "this only", "just", "is one", "one", "a", " test". From the above example, if you change the original text a little, it will only affect the two of the original features extracted ("This is "→" this one, "only"). Therefore, as long as the proportion of the original features generated by the two text sequences is determined to be similar, the similar proportions of the two text sequences can be obtained indirectly. Then, for each original feature map, an integer is mapped by the mapping function (the mapping function is not particularly required, as long as a string can be mapped to an integer function. One of the simplest mapping function examples is to represent two Chinese characters. The binary number corresponding to four bytes is treated as a four-byte integer). This integer pair 512 is then modulo and saved into a histogram of 512 buckets.

[0024] Do another 0/1 processing on the histogram. First calculate the average height of this histogram, then set the bucket above the average height of the histogram to 1, and set the bucket below the average height to 0. A 512-bit (ie, 64-byte) sequence of features is then obtained.

[0025] 202. When the mail is picture data, extracting feature information of the picture according to a compression rate distribution characteristic of the picture in the mail.

Specifically, the picture is scanned to obtain a compression ratio of each sub-block of the picture, and the compression ratio of each N consecutive sub-blocks is combined into a new compression rate change element, where N is For a natural number greater than 1, N can be set according to requirements, and each compression change element and the position code in the picture in which it is located are combined to obtain the feature information of the picture.

[0027] 203. When the mail is non-text and other data of the picture, calculate the other data according to the MD5 algorithm to obtain a 32-byte sequence feature information.

[0028] It should be noted that there is no necessary sequence between the steps 201, 201, and 203, but only one of the data types in the mail is executed.

[0029] 204. Send the mail feature information extracted by the mail server from the mail to the central server.

[0030] 205. Receive feature information sent by the mail server.

[0031] 206, comparing the feature information with the existing spam feature information of the central server, and determining the feature information when the similarity between the feature information and the spam feature information exceeds a preset criterion. The corresponding email is spam. Otherwise, it is determined that the email corresponding to the feature information is a normal email. [0032] It should be noted that, by the honeypot mailbox and the user reporting, it can be determined that the mail is spam. Then determine whether the unknown sample is spam by comparing whether the message to be determined is similar to a known spam. In addition, the honeypot mailbox refers to some email accounts that we register ourselves, and the email account is posted to the Internet and collected by the spammer. Since these email accounts are not actually applicable, the emails sent to these accounts are basically spam. If the characteristics of a certain mail received by a plurality of honeypot mailboxes are similar, it is basically considered that the mail feature is a spam feature.

[0033] 207. The central server feeds back the result to the mail server.

[0034] 208: Perform pairwise comparison of the mail feature information sent to the central server. When the similarity between the mail feature information exceeds a preset criterion, determine that the mail corresponding to the mail feature information is similar. mail.

[0035] 209. Sort the similar mails into one category for storage.

[0036] It should be noted that there is no necessary relationship between 208 and 209 and other steps.

3 is a schematic structural diagram of a mail processing system 1 according to an embodiment of the present invention, including:

The mail server 11 is configured to receive the mail sent by the user, obtain the mail characteristic information of the mail, send it to the central server 12, and perform corresponding operations according to the judgment result of the mail according to the mail characteristic information by the central server 12; And receiving the mail feature information sent by the mail server 11, determining, according to the mail feature information, whether the mail corresponding to the mail feature information is spam, and feeding back the result of the judgment to the mail server.

[0038] The central server 12 further performs a clustering operation on the mail according to the mail feature information, and the clustering operation is classified into a mail with similar mail feature information.

4 is a schematic structural diagram of a mail server 11 in a mail processing system 1 according to an embodiment of the present invention, including: a text mail feature extraction unit 111, configured to: when the mail is text data, according to the Nilsimsa algorithm The text data is processed to obtain 64-byte sequence feature information representative of the mail.

[0040] It should be noted that the Nilsimsa algorithm first splits the content of the mail and extracts the adjacent 4 bytes (4 bytes is an empirical value, considering that the general Chinese character requires two bytes, the general Chinese phrase Contains two Chinese characters). For example, for the text "This is a test", the original features extracted are: "This is", "Yes", "One", "Measure", "Test". If you add a character message randomly to this text, the text becomes "This is just a test", then the original features extracted are "this only", "just", "is one", "one", "one test", "test". From the above example, if you change the original text a little, it will only affect the two of the original features extracted ("this is" → "this", "just"). Therefore, as long as the proportion of the original features generated by the two text sequences is determined, the similar proportions of the two text sequences can be obtained indirectly. Then, for each original feature map, an integer is mapped by the mapping function (the mapping function is not particularly required, as long as a string can be mapped to an integer function. One of the simplest mappings) An example of a function is to treat a binary number corresponding to four bytes representing two Chinese characters as a four-byte integer). This integer pair 512 is then modulo and saved into a histogram of 512 buckets.

[0041] Do another 0/1 processing on the histogram. First calculate the average height of this histogram, then set the bucket above the average height of the histogram to 1, and set the bucket below the average height to 0. A 512-bit (ie, 64-byte) sequence of features is then obtained.

[0042] The picture mail feature extraction unit 112 is configured to extract feature information of the picture according to a compression rate distribution characteristic of the picture in the mail when the mail is picture data.

[0043] Specifically, scanning the picture, obtaining a compression ratio of each sub-block of the picture, and combining the compression ratio of each N consecutive sub-blocks into a new compression rate change element, where N is For a natural number greater than 1, N can be set according to requirements, and each compression change element and the position code in the picture in which it is located are combined to obtain the feature information of the picture.

[0044] The other data mail feature extraction unit 113 is configured to calculate the other data according to the MD5 algorithm when the mail is non-text and other data of the picture, to obtain a 32-byte sequence feature information.

[0045] The sending unit 114 is configured to send the extracted mail feature information to the central server 12.

5 is a schematic structural diagram of a central server 12 in a mail processing system 1 according to an embodiment of the present invention, including: a receiving unit 121, configured to receive feature information sent by the mail server 11;

The determining unit 122 is configured to compare the feature information with the existing spam feature information of the central server 12, and determine that the similarity between the feature information and the spam feature information exceeds a preset criterion. The email corresponding to the feature information is spam, otherwise, the email corresponding to the feature information is determined to be a normal email;

The sending unit 123 is configured to send the result information determined by the determining unit 122 to the mail server.

6 is another schematic structural diagram of the central server 12 in the mail processing system 1 according to the embodiment of the present invention. Different from FIG. 5, the method further includes:

The comparing unit 124 is configured to perform the pairwise matching of the mail feature information sent to the central server. When the similarity between the mail feature information exceeds a preset criterion, determining that the mail corresponding to the mail feature information is Similar mail; the categorization storage unit 125 is configured to classify the similar mails into one category for storage.

[0048] By implementing the present invention, the mail server can upload the characteristics of the mail to be checked in real time and immediately obtain the judgment result returned by the center server, which is time-sensitive. The present invention extracts different features based on information in different formats within the mail, and these features retain sufficient information for the central server to cluster and classify, and determine whether the mail is spam. In addition, the central server can also process the query requests sent by a large number of mail servers, obtain sufficient data volume, perform clustering and classification, and greatly enhance the filtering effect of the central server.

The above is a preferred embodiment of the present invention, it should be noted that one of ordinary skill in the art would It is to be understood that a number of modifications and refinements may be made without departing from the principles of the invention, and such modifications and refinements are also considered to be within the scope of the invention.

Claims

Rights request

A mail processing method, comprising:

2. The mail processing method according to claim 1, further comprising:

3. The mail processing method according to claim 2, wherein the step of the mail server acquiring the corresponding feature information of the mail according to different types of mails comprises:

When the mail is text data, the text data is processed according to the Nilsimsa algorithm to obtain 64-byte sequence feature information representative of the mail;

When the mail is picture data, extracting feature information of the picture according to a compression rate distribution characteristic of the picture in the mail;

The mail processing method according to claim 3, wherein the step of determining, by the central server, whether the mail corresponding to the feature information is spam according to the feature information comprises:

Receiving feature information sent by the mail server;

The mail processing method according to claim 4, wherein the step of the clustering operation of the mail by the central server according to the feature information comprises:

The similar messages are classified into one category for storage.

6. A mail processing system, comprising:

The mail server is configured to receive the mail sent by the user, obtain the mail characteristic information of the mail, send the mail characteristic information to the central server, and perform corresponding operations according to the judgment result of the mail according to the mail characteristic information of the central server; a central server, configured to receive the mail feature information sent by the mail server, determine, according to the mail feature information, whether the mail corresponding to the mail feature information is spam, and feed back the result of the judgment to the mail server; The central server further performs a clustering operation on the mail according to the mail feature information, and the clustering operation classifies the mails with similar mail feature information into one category.

7. The mail processing system according to claim 6, wherein the mail server comprises:

8. The mail processing system according to claim 7, wherein the central server comprises:

a receiving unit, configured to receive feature information sent by the mail server;

a determining unit, configured to compare the feature information with the existing spam feature information of the central server, and determine the feature information when the similarity between the feature information and the spam feature information exceeds a preset criterion The corresponding email is spam, otherwise, the email corresponding to the feature information is determined to be a normal email;

The mail processing system according to claim 8, wherein the central server further comprises: