CN112272139A

CN112272139A - Junk mail intercepting method and system

Info

Publication number: CN112272139A
Application number: CN202011229960.7A
Authority: CN
Inventors: 张嘉子; 王金恒
Original assignee: Guangzhou Institute of Technology
Current assignee: Guangzhou Institute of Technology
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2021-01-26

Abstract

The invention provides a method and a system for intercepting junk mails, wherein the method comprises the following steps: 1, extracting a user name of a mail to be received, comparing and judging the user name with a blacklist list and a whitelist list, and judging the user name as a junk mail, a normal mail or a primary processing mail; 2, extracting key information of the primarily processed mails, judging whether the key information is matched with a preset filtering condition, and judging that the key information is a junk mail or a preprocessed mail; and 3, sending the preprocessed mails to a Bayesian filter for analysis and judgment to determine that the preprocessed mails are junk mails or normal mails. By the junk mail intercepting method, junk mails can be accurately identified, and the safety of the e-mails is improved. The invention also provides a junk mail intercepting system which comprises a receiving unit, a first processing unit, a white list, a black list, a second processing unit, a Bayesian filter, a mail identifying unit, a third processing unit and a filtering unit.

Description

Junk mail intercepting method and system

Technical Field

The invention relates to the technical field of internet, in particular to a method and a system for intercepting junk mails.

Background

Spam email refers to email that is not approved by the recipient and is also not popular with the recipient for the purpose of compelling advertising or promotion. Junk e-mail is because existing e-mail systems have no restrictions on the sender of the mail, and the sender can send an unlimited amount of mail with a false sender address, so that some people can use it as a media for compulsive advertising or promotion, and earn help with it. Junk e-mail interferes with normal use of the e-mail system by people, and work efficiency is affected by the need to find useful e-mails from a large number of junk e-mails. Therefore, anti-spam technologies, i.e., technologies aimed at suppressing spam email, have been developed.

The basic method of the prior art is to let the user set some filtering rules, and when a mail feature description and an incoming mail match the filtering rule feature description, automatically perform corresponding filtering operations (such as rejecting, receiving or placing in a review area, etc.), so that the email system can automatically filter the mails according to the filtering rules when receiving the mails, thereby reducing the number of useless mails that the user needs to process. Black and white list filtering is one such technique that white lists reliable users 'email addresses and black lists unreliable users' email addresses. Mail with incoming addresses in the white list is received and mail in the black list is filtered.

The main problem of the filtering technology is that it is often difficult to set an ideal filtering rule, that is, the filtering rule cannot filter only spam or only select useful mails. Either spam is received because it is too loose or useful mail is rejected because it is too strict. And people often use simple filtering strategies because of fear of rejecting important mail. The black-white list technology is one of simple and effective anti-spam technologies, and the key point of the technology lies in the sources of black-white list data and white-list data, most mail systems can only make black-white lists of a layer of ip, and partial systems can make a layer of users and domain names, but the sources are mainly set by users through a setting page provided by the web, and the users need to know who needs to be listed in the black list or to add the opposite user into the own white list when a certain user finds that the mail sent to the other user is mistakenly blocked by anti-spam; the data acquisition method is low in efficiency, not intelligent and high in hysteresis, and the user-defined black and white list has no universality.

The invention patent of China with the publication number of CN103220213B provides a mail filtering method and a device, which comprises a mail transmission agent receiving the mail; the mail teaching agent provides key information from the mail; the mail transmission agent judges whether the key information is matched with a preset filtering condition; if yes, the mail transmission agent intercepts the mail, otherwise, the mail transmission agent transmits the mail; the mail transmission agent submits the first intercepted mail so as to receive examination, the mail transmission agent receives the examination result, the mail transmission agent judges whether the examination is passed or not according to the examination result, if so, the mail transmission agent continues to intercept the intercepted mail, otherwise, the mail transmission agent transmits the intercepted mail. The invention uses one or some key information in the mail, i.e. any one or more of the attachment type, the subject, the text, the domain name, the number of the receivers and the sender/receiver information, to intercept the mail containing the specific key information, thereby improving the safety of the e-mail.

Then the invention depends on the preset filtering condition and does not have the training and learning ability, some advertisement mails can easily bypass the preset filtering condition, and the accuracy of recognizing the junk mails is to be further improved.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a method and a system for intercepting junk mails, and the specific technical scheme of the invention is as follows:

a junk mail intercepting method comprises the following steps:

step 1, extracting a user name of a mail to be received, comparing and judging the user name of the mail to be received with a blacklist list and a white list, if the user name of the mail to be received is in the blacklist list, judging the mail to be received as a junk mail and intercepting the junk mail, if the user name of the mail to be received is in the white list, judging the mail to be received as a normal mail and receiving the normal mail, and if the user name of the mail to be received is not in the blacklist list or the white list, judging the mail to be received as a primary processed mail;

step 2, extracting key information of the primary processed mail, judging whether the key information is matched with a preset filtering condition, if so, extracting a user name of the primary processed mail, storing the user name of the primary processed mail into a blacklist list, and intercepting the primary processed mail, otherwise, judging the primary processed mail as the pre-processed mail;

step 3, sending the preprocessed mail to a Bayesian filter for analysis and judgment to determine whether the preprocessed mail is a junk mail, if so, extracting a user name of the preprocessed mail, storing the user name of the preprocessed mail into a blacklist, and intercepting the preprocessed mail, otherwise, determining the preprocessed mail as a normal mail and receiving the normal mail;

the key information comprises any one or more combined information of attachment type, subject, text, domain name and sender.

Optionally, in step 1, the user may add or delete the blacklist and the whitelist.

Optionally, in step 2, the user may add or delete the filter condition according to a preset filter condition.

Optionally, in step 2, when comparing and determining the user name of the email to be received with the blacklist and the whitelist, an accurate matching method is used.

A spam interception system comprising:

a receiving unit for storing mails to be received;

the first processing unit is used for extracting the user name of the mail to be received and comparing and judging the user name of the mail to be received with the white list and the black list;

a white list storing a user name list of trusted mails;

a blacklist list, which stores a user name list of an untrusted mail;

the second processing unit is used for extracting the key information of the primary processed mail and matching the key information with a preset filtering condition;

the Bayesian filter is used for calculating the probability that the preprocessed mails are junk mails;

the mail identification unit is used for judging whether the preprocessed mail is the junk mail according to the probability that the preprocessed mail is the junk mail;

and the third processing unit is used for extracting the user name of the preprocessed mail and storing the user name of the preprocessed mail into the blacklist.

Optionally, the spam intercepting system further includes a fourth processing unit, where the fourth processing unit is configured to extract a username of the normal email, and store the username of the normal email in a white list.

Optionally, the spam intercepting system further includes a filtering unit, and the filtering unit stores a filtering condition defined by a user.

Optionally, the spam intercepting system further includes a saving unit, and the saving unit is configured to store the normal email.

The beneficial effects obtained by the invention comprise:

1. intercepting the mails containing specific key information by using one or some key information in the mails, namely any one or more of the attachment type, the subject, the text, the domain name, the number of the receivers and the sender/receiver information, so that the safety requirements of different users on the mails can be met, and the safety of the mails is improved;

2. after the key information of the mails is used for intercepting the mails, the preprocessed mails are analyzed and judged by the Bayesian filter, so that the accuracy of recognizing the junk mails can be further improved.

Drawings

The present invention will be further understood from the following description taken in conjunction with the accompanying drawings, the emphasis instead being placed upon illustrating the principles of the embodiments.

FIG. 1 is a schematic flow chart of a first embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a third embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to embodiments thereof; it should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. Other systems, methods, and/or features of the present embodiments will become apparent to those skilled in the art upon review of the following detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention, and be protected by the accompanying claims. Additional features of the disclosed embodiments are described in, and will be apparent from, the detailed description that follows.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by the terms "upper", "lower", "left", "right", etc. based on the orientation or positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but it is not intended to indicate or imply that the device or component referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limiting the present patent, and the specific meaning of the terms described above will be understood by those of ordinary skill in the art according to the specific circumstances.

The invention relates to a method and a system for intercepting junk mails, which explain the following embodiments according to the figures 1-3:

the first embodiment is as follows:

referring to fig. 1, a method for intercepting spam includes the following steps:

In step 1, when comparing and judging the user name of the mail to be received with the blacklist and the whitelist, an accurate matching method may be used to perform preliminary comparison and judgment on the mail to be received. If the user name of the mail to be received is completely consistent with one of the blacklists, the mail is judged to be a junk mail, and if the user name of the mail to be received is completely consistent with one of the whitelists, the mail is judged to be a normal mail. And if the user name of the mail to be received can not be completely matched with one of the blacklist list or the white list, judging the mail to be received as a primary processed mail.

Com, the user name of the extracted mail to be Received is Received. If the blacklist contains a Received list, the user name of the mail to be Received is completely matched with one of the blacklists. If the blacklist contains a receiveable or Receiving list, the user name of the mail to be received is not matched with the blacklist. The same reasoning is also true for comparing and matching the user name of the mail to be received with the white list.

Here, the blacklist and whitelist can be freely added and deleted, wherein the contents in the blacklist and whitelist include but are not limited to the user name of the mail. Between the black list and the white list, one or more of the black list can be moved to the white list or one or more of the white list can be moved to the black list by dragging and moving.

In step 2, key information of the primary processed mail is extracted, wherein the key information comprises but is not limited to any one or any combination of a plurality of information of attachment types, subjects, texts, domain names and senders. For example, the type of sender of the mail or the mail attachment may be used as key information for filtering a particular sender or attachment as a certain type of mail. Similarly, the sender of the mail and the type of the mail attachment can also be used as key information for filtering the mail of which the specific sender and the attachment are of a certain type.

And aiming at different key information, the preset filtering conditions are different. If the key information is the type of the sender of the mail or the mail attachment, the preset filtering condition will be the type of the sender of the mail or the mail attachment. Here, the key information of the primary processed mail is extracted, and the key information is analyzed and matched with the preset filtering condition, and a fuzzy matching method or an approximate matching method can be used.

When the key information of the primary processed mail is matched with the preset filtering condition, the user name of the primary processed mail, namely the sender, is extracted and stored in the blacklist so as to accurately intercept the mail sent by the sender afterwards.

Since the blacklist includes, but is not limited to, the user name of the mail, if the key information of the primary processed mail matches with the preset filtering condition, the subject and/or the domain name of the primary processed mail can also be added to the blacklist.

Example two:

a junk mail intercepting method comprises the following steps:

step 3, sending the preprocessed mail to a Bayesian filter for analysis and judgment to determine whether the preprocessed mail is a junk mail, if so, extracting a user name of the preprocessed mail, storing the user name of the preprocessed mail into a blacklist, and intercepting the preprocessed mail, otherwise, determining the preprocessed mail as a normal mail and receiving the normal mail; here, a threshold of the bayesian filter may be set, and the preprocessed email is regarded as spam only if the probability that the bayesian filter judges the preprocessed email as spam exceeds a preset threshold.

Corresponding to the method for intercepting spam email described in this embodiment, the present invention further provides a system for intercepting spam email.

Referring to fig. 2, a spam intercepting system according to the present invention includes a receiving unit, a first processing unit, a white list, a black list, a second processing unit, a bayesian filter, a mail recognition unit, a third processing unit, and a filtering unit.

The receiving unit is used for storing the mail to be received, and can be a buffer area for receiving the externally sent mail through a TCP/IP protocol.

And the first processing unit is used for extracting the user name of the mail to be received and comparing and judging the user name of the mail to be received with the white list and the black list. The first processing unit carries out accurate matching comparison on the user name of the mail to be received, the white list and the black list, and carries out preliminary judgment on the mail to be received. And if the user name of the mail to be received is completely matched with one of the blacklists, judging the mail as a junk mail, and if the user name of the mail to be received is completely matched with one of the blacklists, judging the mail as a normal mail.

And if the user name of the mail to be received is not completely matched and consistent with one in the blacklist list or not completely matched and consistent with one in the white list, judging the mail as a primary processing mail.

White list, i.e. the list of senders that the user can trust. And the blacklist is a sender list which is not trusted by the user.

An interface exists between the white list and the black list. Through this one interface, the user is free to move information in the white list into the black list or to move information in the black list into the white list.

And the second processing unit is used for extracting the key information of the primarily processed mail and matching the key information with a preset filtering condition.

Whether the primary processed mail is the junk mail or not needs further analysis and identification, key information of the primary processed mail, such as any one or more combined information of attachment type, text content, subject and sender, is extracted, and fuzzy matching analysis is carried out on the extracted key information and the preset filtering condition.

The preset filtering conditions are stored in the filtering unit, and can be added and deleted by user definition. In the filtering unit, a classification list is set, for example, the attachment type, the text content, the subject and the sender are classified into four categories, and then the user adds and deletes the filtering content of each category according to the requirement.

And the user extracts the types of the key information by selecting and setting the second processing unit, matches and compares the filtering conditions of the corresponding types, and then identifies and judges whether the primary processed mail is the junk mail.

The Bayesian filter is a filter for identifying the mails by adopting a Bayesian algorithm, and is characterized in that the mail content and/or the mail subject are/is segmented to obtain a segmentation result, then the probability that the mail corresponding to each word in the segmentation result is a junk mail is obtained according to a sample library, finally the probability that the mail corresponding to each word is the junk mail is substituted into a Bayesian formula to calculate the probability of the junk mail of the mail, and if the probability of the junk mail of the mail is larger than a preset threshold value, the mail is marked as the junk mail. The threshold may be set according to the specific requirements of the email system, for example, the threshold may be set to 0.9, and if the probability of spam of the email is greater than 0.9, the email will be marked as spam.

The process of calculating the probability of spam by using a Bayesian filter is described in detail as follows, and comprises the following steps:

dividing the subject and/or the mail content of the mail into a plurality of words by utilizing a word segmentation technology; respectively finding the times (BadCount) of the words appearing in the junk mails and the times (GoodCount) of the words appearing in the non-junk mails in the sample library, and obtaining the number of the junk mail samples (BadEmailCount) and the number of the normal mail samples (GoodEmailCount) in the sample library

Suppose that: event A: the mail is junk mail, t1, t2.. and tn is a word segmentation result of the mail;

p (A | i) represents the probability that a mail is spam when the word ti occurs in the mail.

P (A | ti) may also be referred to as garbage probability for word ti. It is clear that,

P(A|ti)＝(BadCount/BadEmailCount)/((GoodCount/GoodtEmailCount)+(BadCount/BadEmailCount))

suppose that: t1, t2.. tr the probability that a mail will be spam when these words occur is: p1, P2.. Pr.

P (a | t1, t2, t 3., tr) represents the probability that a mail is spam when the words t1, t2., tr occur in the mail at the same time.

According to the Bayesian formula:

and calculating the spam probability of the mail by P (A | t1, t2, t 3., (P1 × P2.. Pr)/[ P1 × P2.. Pn + (1-P1) ((1-P2) · (1-Pr) ].

In this embodiment, the bayesian filter is used to calculate the probability P that the preprocessed email is spam. And the mail identification unit is used for judging whether the preprocessed mail is the junk mail or not according to the junk mail probability P.

The mail identification unit can set a threshold value and identify whether the preprocessed mail is the junk mail or not according to the comparison threshold value and the junk mail probability P. If the threshold value is 0.9, when the probability of the junk mails is greater than or equal to 0.9, the preprocessed mails are junk mails.

When the mail recognition unit recognizes the preprocessed mail as the junk mail, the third processing unit extracts the user name of the preprocessed mail and stores the user name into a blacklist.

Once the user name of the preprocessed mail is extracted and stored in the blacklist, the mail can be directly judged as the junk mail when the mail of the same sender is received next time. Therefore, the recognition accuracy of the system to the junk mails can be improved, and the working efficiency is improved.

Example three:

referring to fig. 3, the spam intercepting system of the present invention includes a receiving unit, a first processing unit, a white list, a black list, a second processing unit, a bayesian filter, a mail recognition unit, a third processing unit, a filtering unit, a fourth processing unit, and a saving unit.

The fourth processing unit is used for extracting the user name of the normal mail and storing the user name of the normal mail into the white list. The fourth processing unit extracts the user name of the normal mail and stores the user name into the white list, so that the recognition accuracy of the system to the normal mail can be improved, and the working efficiency is improved.

A saving unit for storing normal mails.

In summary, the method and system for intercepting spam disclosed by the present invention have the following beneficial technical effects:

Although the invention has been described above with reference to various embodiments, it should be understood that many changes and modifications may be made without departing from the scope of the invention. That is, the methods, systems, and devices discussed above are examples, and various configurations may omit, replace, or add various processes or components as appropriate. For example, in alternative configurations, the methods may be performed in an order different than that described and/or various components may be added, omitted, and/or combined. Moreover, features described with respect to certain configurations may be combined in various other configurations, as different aspects and elements of the configurations may be combined in a similar manner. Further, elements therein may be updated as technology evolves, i.e., many of the elements are examples and do not limit the scope of the disclosure or claims.

Specific details are given in the description to provide a thorough understanding of the exemplary configurations including implementations. However, configurations may be practiced without these specific details, such as well-known circuits, processes, algorithms, structures, and techniques, which have been shown without unnecessary detail in order to avoid obscuring the configurations. This description provides example configurations only, and does not limit the scope, applicability, or configuration of the claims. Rather, the foregoing description of the configurations will provide those skilled in the art with an enabling description for implementing the described techniques. Various changes may be made in the function and arrangement of elements without departing from the spirit or scope of the disclosure.

It is intended that the foregoing detailed description be regarded as illustrative rather than limiting, and that it be understood that it is the following claims, including all equivalents, that are intended to define the spirit and scope of this invention. The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A junk mail intercepting method is characterized by comprising the following steps:

2. A spam interception method according to claim 1, characterized in that in step 1, said blacklist and whitelist can be added and deleted by the user.

3. A spam intercepting method according to claim 2, wherein in step 2, the user can add and delete the spam with preset filtering conditions.

4. A spam intercepting method according to claim 3, wherein in step 1, when comparing the user name of the mail to be received with the blacklist and whitelist, an exact matching method is used.

5. A spam interception system, comprising:

a receiving unit for storing mails to be received;

a white list storing a user name list of trusted mails;

a blacklist list, which stores a user name list of an untrusted mail;

the second processing unit is used for extracting key information of the primarily processed mail and matching the key information with a preset filtering condition;

the mail identification unit is used for judging whether the preprocessed mails are junk mails or not according to the probability that the preprocessed mails are junk mails;

6. A spam interception system according to claim 5, further comprising a fourth processing unit for extracting the username of a normal e-mail and storing the username of the normal e-mail in a white list.

7. A spam interception system according to claim 6, further comprising a filtering unit storing user-defined filtering conditions.

8. A spam interception system according to claim 7, further comprising a saving unit for storing normal mails.