CN114629872A

CN114629872A - Junk mail filtering method, device, system and storage medium

Info

Publication number: CN114629872A
Application number: CN202011463258.7A
Authority: CN
Inventors: 李天明
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2022-06-14

Abstract

The application discloses a junk mail filtering method, which comprises the following steps: reading the mail content; judging whether the mail content contains a picture or not; if yes, extracting a feature vector in the picture; judging whether the characteristic vector exists in a pre-stored vector set or not; if the mail exists, the mail is marked as junk mail, and the operation is interrupted. The junk mail filtering method provided by the application can effectively identify the junk mails without characters in the picture, and reduces the number of the junk mails received by a user.

Description

Junk mail filtering method, device, system and storage medium

Technical Field

The present application relates to the field of internet communications technologies, and in particular, to a method, an apparatus, a system, and a storage medium for filtering spam.

Background

Email is a communication method for providing information exchange by electronic means, and is the most widely used service of the internet. Through the e-mail system of the network, the user can contact the network user in any corner of the world in a very quick way (the user can send the information to any specified destination in the world within a few seconds) at a very low price (only the network fee is needed no matter where the user sends the information), and the user can contact the network user in any corner of the world.

Spam, such as advertisement mails for various commercial promotions or phishing mails for stealing user account information, or reaction mails for promoting reaction information, often exists in emails, and seriously threatens the sharing, interactivity and openness of network resources, and influences the experience of users using emails.

Compared with the ordinary junk mails with character contents, a junk mail maker can adopt another more concealed junk mail mode, namely, the mails do not have any character contents or do not relate to the junk contents, only the attachments have pictures without character reaction, pornography or fraud and other contents, so that the mail system based on text filtering cannot identify the junk mails, and a junk mail receiver can identify the information.

Therefore, designing a spam filtering method which can effectively identify spam and reduce the quantity of spam received by a user is a problem to be solved by technical personnel in the field.

Disclosure of Invention

In order to solve the technical problem, the application provides a spam filtering method which can effectively identify spam and reduce the quantity of spam received by a user.

The technical scheme provided by the application is as follows:

a junk mail filtering method comprises the following steps:

reading the mail content;

judging whether the mail content contains a picture or not;

if yes, extracting a feature vector in the picture;

judging whether the characteristic vector exists in a pre-stored vector set or not;

if the mail exists, the mail is marked as junk mail, and the operation is interrupted.

Preferably, before the reading of the mail content, the method further includes:

reading pictures in the existing junk mails;

intercepting an area image containing junk information in the picture;

extracting a feature vector of the region image;

and storing the characteristic vector into the vector set.

Preferably, after the determining whether the mail content includes a picture, the method further includes:

if not, text classification is carried out on the words of the mail content to form the content word group;

judging whether the content phrases contain sensitive phrases or not according to a filtering rule;

if yes, marking the mail as a junk mail, and interrupting the operation;

if not, the mail is marked as normal mail.

reading a mail title;

performing text classification on the titles to form a title phrase;

judging whether the title phrases contain sensitive phrases or not according to a filtering rule;

if yes, marking the mail as a junk mail, and interrupting the operation;

if not, the next step is carried out.

Preferably, if the spam email is included, the email is marked as spam email, and the interrupting operation specifically comprises:

if yes, adding one to the record value of the sending mailbox;

and marking the mail as junk mail and interrupting the operation.

Preferably, before reading the title and content of the mail, the method further includes:

reading the mail sending mailbox of the mail;

judging whether the recorded value of the sending mailbox is greater than a frequency threshold value or not;

if the judgment result is yes, the mail is marked as a junk mail, and the operation is interrupted;

if the judgment result is negative, the next step is carried out.

A spam filtering device comprising:

the reading module is used for reading the mail content;

the first judging module is used for judging whether the mail content contains pictures or not;

the extraction module is used for extracting the feature vectors in the pictures;

the second judgment module is used for judging whether the characteristic vector exists in a pre-stored vector set or not;

and the marking module is used for marking the mail as the junk mail.

Preferably, the method further comprises the following steps:

the reading module is also used for reading pictures in the existing junk mails;

the intercepting module is used for intercepting an area image containing junk information in the picture;

the processing module is used for extracting the feature vector of the region image;

and the storage module is used for storing the characteristic vectors extracted by the processing module into the vector set.

A spam filtering system comprising a spam filtering device as claimed in any preceding claim, further comprising a server for updating said vector set.

A storage medium storing a computer program which, when executed, implements a spam filtering method as in any preceding claim.

The junk mail filtering method provided by the invention has the advantages that the mail content is read, the characteristic vector extracted from the picture is compared with the pre-stored vector set, and if the characteristic vector exists in the vector set, the mail is marked as the junk mail, so that the filtering of the pure picture junk mail without characters is realized. The method can effectively identify the junk mails, reduce the quantity of the junk mails received by users, and solve the problem that the junk mails seriously threaten the shareability, the interactivity and the openness of network resources and influence the experience of the users in using the e-mails.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a spam filtering method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a spam filtering apparatus according to an embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be understood that the structures, ratios, sizes, and the like shown in the drawings are only used for matching the disclosure of the specification, so as to be understood and read by those skilled in the art, and are not used to limit the practical limit conditions of the present application, so that the modifications of the structures, the changes of the ratio relationships, or the adjustment of the sizes, do not have the technical essence, and the modifications, the changes of the ratio relationships, or the adjustment of the sizes, are all within the scope of the technical contents disclosed in the present application without affecting the efficacy and the achievable purpose of the present application.

Embodiments of the present invention are written in a progressive manner.

The embodiment discloses a spam filtering method, as shown in fig. 1, including the following steps:

s1, reading mail content;

s2, judging whether the mail content contains a picture or not;

s3, if the picture contains the feature vector, extracting the feature vector in the picture;

s4, judging whether a feature vector exists in a pre-stored vector set or not;

and S5, if the mail exists, marking the mail as a junk mail, and interrupting the operation.

After the mail content is read in step S1, step S2 determines whether the mail content includes a picture, if the mail content includes a picture, step S3 is executed to extract a feature vector from the picture, step S4 is executed to search the extracted feature vector in a vector set, and determine whether a matching state exists, if the matching exists, step S5 marks the mail as a spam mail, and completes filtering of the picture spam mail.

According to the junk mail filtering method provided by the embodiment of the invention, the mail content is read, the characteristic vector extracted from the picture is compared with the pre-stored vector set, and if the characteristic vector exists in the vector set, the mail is marked as a junk mail, so that the filtering of the pure picture junk mail without characters is realized. The method can effectively identify the junk mails, reduce the number of the junk mails received by the user, and solve the problem that the junk mails seriously threaten the sharing, the interactivity and the openness of network resources and influence the experience of the user in using the emails.

Preferably, before reading the mail content, the method further comprises:

reading pictures in the existing junk mails;

intercepting an area image containing junk information in the picture;

extracting a feature vector of the regional image;

and storing the feature vectors into a vector set.

In practical use, in order to form a feature vector library, i.e. a vector set, including a spam picture, it is necessary to capture a certain sub-region (e.g. a trademark pattern of an advertising company or other pattern with a representative mark) in the spam picture, i.e. a region image containing spam information, from a picture that has been previously determined as a spam picture, and then extract a feature vector of the captured image, i.e. the region image. And storing the newly acquired feature vectors into a data area, namely a vector set, wherein the feature vectors extracted from all the screenshot images of the junk pictures are contained, and any picture containing a certain feature vector in the vector set is treated as the junk picture.

Preferably, after determining whether the mail content includes the picture, the method further includes:

if not, text classification is carried out on the words of the mail content to form content word groups;

judging whether the content phrases contain sensitive phrases or not according to the filtering rules;

if yes, marking the mail as a junk mail, and interrupting the operation;

if not, the mail is marked as normal mail.

And after text classification is carried out on the mails which do not contain the pictures, judging whether sensitive phrases are contained in the mails according to a filtering rule, and marking the mails as junk mails or normal mails according to a judgment result.

Preferably, before reading the mail content, the method further comprises:

reading a mail title;

carrying out text classification on the titles to form a title phrase;

judging whether the title phrases contain sensitive phrases or not according to the filtering rules;

if yes, marking the mail as a junk mail, and interrupting the operation;

if not, the next step is carried out.

The title is judged first, and the content is judged again under the condition that the title does not contain sensitive phrases, so that the efficiency can be effectively improved. Compared with the method of directly judging the content, the number of the text words of the title is smaller than that of the content in most cases, the time consumed in classification and judgment is relatively less, and when the title contains sensitive phrases, the operation is interrupted, the content can not be judged any more, so that the consumption of redundant processing time or the occupation of excessive system resources can be avoided.

Preferably, if the spam email contains the spam email, the interrupting operation is specifically as follows:

if yes, adding one to the record value of the sending mailbox;

and marking the mail as junk mail and interrupting the operation.

When it is noted that, regardless of whether the mail is a text separated from a picture or whether the content text of the mail content itself contains a sensitive word group and is marked as a junk mail, the record value of the sending mailbox corresponding to the mail is increased by one, that is, the number of the junk mails sent by the sending mailbox is increased, which can be understood as placing the sending mailbox in a blacklist list.

Preferably, before reading the title and content of the mail, the method further comprises:

reading a mail sending mailbox of the mail;

if the judgment result is negative, the next step is carried out.

The junk mails are directly filtered through the judgment of the sending mailbox by judging whether the record value exceeds the preset time threshold value, if the record value is larger than the time threshold value, the mails sent by the sending mailbox are all used as the junk mails, the reading of titles and contents is not needed, the subsequent operations such as text classification and judgment of sensitive phrases are not needed, and the processing efficiency is further improved.

A spam filtering device, as shown in fig. 2, comprising:

the reading module 1 is used for reading mail contents;

the first judging module 2 is used for judging whether the mail content contains pictures or not;

the extraction module 3 is used for extracting the feature vectors in the pictures;

the second judging module 4 is used for judging whether the feature vectors exist in a pre-stored vector set or not;

and the marking module 5 is used for marking the mails as junk mails.

The operation of each module of the spam filtering device can operate steps S1 to S5 of the filtering method, and the specific data acquisition, processing and output processes are not described herein.

Preferably, as shown in fig. 2, the method further includes:

the reading module 1 is also used for reading pictures in the existing junk mails;

the intercepting module 6 is used for intercepting an area image containing spam information in the picture;

the processing module 7 is used for extracting a feature vector of the region image;

and the storage module 8 is used for storing the characteristic vectors extracted by the processing module into a vector set.

A spam filtering system comprising a spam filtering device as described in any of the above, and further comprising a server for updating said vector set, can achieve the same technical effect.

A storage medium storing a computer program which, when executed, implements the spam filtering method as described above, and achieves the same technical effects.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the modules is only one logical functional division, and other division manners may be implemented in practice, such as: multiple modules or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or modules may be electrical, mechanical or other.

In addition, all functional modules in the embodiments of the present invention may be integrated into one processor, or each module may be separately used as one device, or two or more modules may be integrated into one device; each functional module in each embodiment of the present invention may be implemented in a form of hardware, or may be implemented in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by program instructions and related hardware, where the program instructions may be stored in a computer-readable storage medium, and when executed, the program instructions perform the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A spam filtering method is characterized by comprising the following steps:

reading the mail content;

judging whether the mail content contains pictures or not;

if yes, extracting a feature vector in the picture;

2. The spam filtering method of claim 1, further comprising, prior to said reading mail content:

reading pictures in the existing junk mails;

intercepting an area image containing junk information in the picture;

extracting a feature vector of the region image;

and storing the characteristic vector into the vector set.

3. The spam filtering method according to claim 1, wherein after said determining whether the mail content contains a picture, further comprising:

if yes, marking the mail as a junk mail, and interrupting the operation;

if not, the mail is marked as normal mail.

4. The spam filtering method of claim 1, further comprising, prior to said reading mail content:

reading a mail title;

performing text classification on the titles to form a title phrase;

if yes, marking the mail as a junk mail, and interrupting the operation;

if not, the next step is carried out.

5. The spam filtering method according to claim 1, wherein if it is included, the email is marked as spam, and the interrupting operation is specifically:

if yes, adding one to the record value of the sending mailbox;

and marking the mail as junk mail and interrupting the operation.

6. The spam filtering method according to claim 5, further comprising, before said reading the title and content of the mail:

reading the mail sending mailbox of the mail;

if the judgment result is negative, the next step is carried out.

7. A spam filtering device, comprising:

the reading module is used for reading the mail content;

and the marking module is used for marking the mail as the junk mail.

8. The spam filtering device of claim 7, further comprising:

9. A spam filtering system comprising a spam filtering device according to any of claims 7 to 8, further comprising a server for updating said vector set.

10. A storage medium storing a computer program, wherein the computer program, when executed, implements the spam filtering method of any of claims 1-6.