CN114629872A - Junk mail filtering method, device, system and storage medium - Google Patents

Junk mail filtering method, device, system and storage medium Download PDF

Info

Publication number
CN114629872A
CN114629872A CN202011463258.7A CN202011463258A CN114629872A CN 114629872 A CN114629872 A CN 114629872A CN 202011463258 A CN202011463258 A CN 202011463258A CN 114629872 A CN114629872 A CN 114629872A
Authority
CN
China
Prior art keywords
mail
junk
reading
content
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011463258.7A
Other languages
Chinese (zh)
Inventor
李天明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202011463258.7A priority Critical patent/CN114629872A/en
Publication of CN114629872A publication Critical patent/CN114629872A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies

Abstract

The application discloses a junk mail filtering method, which comprises the following steps: reading the mail content; judging whether the mail content contains a picture or not; if yes, extracting a feature vector in the picture; judging whether the characteristic vector exists in a pre-stored vector set or not; if the mail exists, the mail is marked as junk mail, and the operation is interrupted. The junk mail filtering method provided by the application can effectively identify the junk mails without characters in the picture, and reduces the number of the junk mails received by a user.

Description

Junk mail filtering method, device, system and storage medium
Technical Field
The present application relates to the field of internet communications technologies, and in particular, to a method, an apparatus, a system, and a storage medium for filtering spam.
Background
Email is a communication method for providing information exchange by electronic means, and is the most widely used service of the internet. Through the e-mail system of the network, the user can contact the network user in any corner of the world in a very quick way (the user can send the information to any specified destination in the world within a few seconds) at a very low price (only the network fee is needed no matter where the user sends the information), and the user can contact the network user in any corner of the world.
Spam, such as advertisement mails for various commercial promotions or phishing mails for stealing user account information, or reaction mails for promoting reaction information, often exists in emails, and seriously threatens the sharing, interactivity and openness of network resources, and influences the experience of users using emails.
Compared with the ordinary junk mails with character contents, a junk mail maker can adopt another more concealed junk mail mode, namely, the mails do not have any character contents or do not relate to the junk contents, only the attachments have pictures without character reaction, pornography or fraud and other contents, so that the mail system based on text filtering cannot identify the junk mails, and a junk mail receiver can identify the information.
Therefore, designing a spam filtering method which can effectively identify spam and reduce the quantity of spam received by a user is a problem to be solved by technical personnel in the field.
Disclosure of Invention
In order to solve the technical problem, the application provides a spam filtering method which can effectively identify spam and reduce the quantity of spam received by a user.
The technical scheme provided by the application is as follows:
a junk mail filtering method comprises the following steps:
reading the mail content;
judging whether the mail content contains a picture or not;
if yes, extracting a feature vector in the picture;
judging whether the characteristic vector exists in a pre-stored vector set or not;
if the mail exists, the mail is marked as junk mail, and the operation is interrupted.
Preferably, before the reading of the mail content, the method further includes:
reading pictures in the existing junk mails;
intercepting an area image containing junk information in the picture;
extracting a feature vector of the region image;
and storing the characteristic vector into the vector set.
Preferably, after the determining whether the mail content includes a picture, the method further includes:
if not, text classification is carried out on the words of the mail content to form the content word group;
judging whether the content phrases contain sensitive phrases or not according to a filtering rule;
if yes, marking the mail as a junk mail, and interrupting the operation;
if not, the mail is marked as normal mail.
Preferably, before the reading of the mail content, the method further includes:
reading a mail title;
performing text classification on the titles to form a title phrase;
judging whether the title phrases contain sensitive phrases or not according to a filtering rule;
if yes, marking the mail as a junk mail, and interrupting the operation;
if not, the next step is carried out.
Preferably, if the spam email is included, the email is marked as spam email, and the interrupting operation specifically comprises:
if yes, adding one to the record value of the sending mailbox;
and marking the mail as junk mail and interrupting the operation.
Preferably, before reading the title and content of the mail, the method further includes:
reading the mail sending mailbox of the mail;
judging whether the recorded value of the sending mailbox is greater than a frequency threshold value or not;
if the judgment result is yes, the mail is marked as a junk mail, and the operation is interrupted;
if the judgment result is negative, the next step is carried out.
A spam filtering device comprising:
the reading module is used for reading the mail content;
the first judging module is used for judging whether the mail content contains pictures or not;
the extraction module is used for extracting the feature vectors in the pictures;
the second judgment module is used for judging whether the characteristic vector exists in a pre-stored vector set or not;
and the marking module is used for marking the mail as the junk mail.
Preferably, the method further comprises the following steps:
the reading module is also used for reading pictures in the existing junk mails;
the intercepting module is used for intercepting an area image containing junk information in the picture;
the processing module is used for extracting the feature vector of the region image;
and the storage module is used for storing the characteristic vectors extracted by the processing module into the vector set.
A spam filtering system comprising a spam filtering device as claimed in any preceding claim, further comprising a server for updating said vector set.
A storage medium storing a computer program which, when executed, implements a spam filtering method as in any preceding claim.
The junk mail filtering method provided by the invention has the advantages that the mail content is read, the characteristic vector extracted from the picture is compared with the pre-stored vector set, and if the characteristic vector exists in the vector set, the mail is marked as the junk mail, so that the filtering of the pure picture junk mail without characters is realized. The method can effectively identify the junk mails, reduce the quantity of the junk mails received by users, and solve the problem that the junk mails seriously threaten the shareability, the interactivity and the openness of network resources and influence the experience of the users in using the e-mails.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a spam filtering method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a spam filtering apparatus according to an embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be understood that the structures, ratios, sizes, and the like shown in the drawings are only used for matching the disclosure of the specification, so as to be understood and read by those skilled in the art, and are not used to limit the practical limit conditions of the present application, so that the modifications of the structures, the changes of the ratio relationships, or the adjustment of the sizes, do not have the technical essence, and the modifications, the changes of the ratio relationships, or the adjustment of the sizes, are all within the scope of the technical contents disclosed in the present application without affecting the efficacy and the achievable purpose of the present application.
Embodiments of the present invention are written in a progressive manner.
The embodiment discloses a spam filtering method, as shown in fig. 1, including the following steps:
s1, reading mail content;
s2, judging whether the mail content contains a picture or not;
s3, if the picture contains the feature vector, extracting the feature vector in the picture;
s4, judging whether a feature vector exists in a pre-stored vector set or not;
and S5, if the mail exists, marking the mail as a junk mail, and interrupting the operation.
After the mail content is read in step S1, step S2 determines whether the mail content includes a picture, if the mail content includes a picture, step S3 is executed to extract a feature vector from the picture, step S4 is executed to search the extracted feature vector in a vector set, and determine whether a matching state exists, if the matching exists, step S5 marks the mail as a spam mail, and completes filtering of the picture spam mail.
According to the junk mail filtering method provided by the embodiment of the invention, the mail content is read, the characteristic vector extracted from the picture is compared with the pre-stored vector set, and if the characteristic vector exists in the vector set, the mail is marked as a junk mail, so that the filtering of the pure picture junk mail without characters is realized. The method can effectively identify the junk mails, reduce the number of the junk mails received by the user, and solve the problem that the junk mails seriously threaten the sharing, the interactivity and the openness of network resources and influence the experience of the user in using the emails.
Preferably, before reading the mail content, the method further comprises:
reading pictures in the existing junk mails;
intercepting an area image containing junk information in the picture;
extracting a feature vector of the regional image;
and storing the feature vectors into a vector set.
In practical use, in order to form a feature vector library, i.e. a vector set, including a spam picture, it is necessary to capture a certain sub-region (e.g. a trademark pattern of an advertising company or other pattern with a representative mark) in the spam picture, i.e. a region image containing spam information, from a picture that has been previously determined as a spam picture, and then extract a feature vector of the captured image, i.e. the region image. And storing the newly acquired feature vectors into a data area, namely a vector set, wherein the feature vectors extracted from all the screenshot images of the junk pictures are contained, and any picture containing a certain feature vector in the vector set is treated as the junk picture.
Preferably, after determining whether the mail content includes the picture, the method further includes:
if not, text classification is carried out on the words of the mail content to form content word groups;
judging whether the content phrases contain sensitive phrases or not according to the filtering rules;
if yes, marking the mail as a junk mail, and interrupting the operation;
if not, the mail is marked as normal mail.
And after text classification is carried out on the mails which do not contain the pictures, judging whether sensitive phrases are contained in the mails according to a filtering rule, and marking the mails as junk mails or normal mails according to a judgment result.
Preferably, before reading the mail content, the method further comprises:
reading a mail title;
carrying out text classification on the titles to form a title phrase;
judging whether the title phrases contain sensitive phrases or not according to the filtering rules;
if yes, marking the mail as a junk mail, and interrupting the operation;
if not, the next step is carried out.
The title is judged first, and the content is judged again under the condition that the title does not contain sensitive phrases, so that the efficiency can be effectively improved. Compared with the method of directly judging the content, the number of the text words of the title is smaller than that of the content in most cases, the time consumed in classification and judgment is relatively less, and when the title contains sensitive phrases, the operation is interrupted, the content can not be judged any more, so that the consumption of redundant processing time or the occupation of excessive system resources can be avoided.
Preferably, if the spam email contains the spam email, the interrupting operation is specifically as follows:
if yes, adding one to the record value of the sending mailbox;
and marking the mail as junk mail and interrupting the operation.
When it is noted that, regardless of whether the mail is a text separated from a picture or whether the content text of the mail content itself contains a sensitive word group and is marked as a junk mail, the record value of the sending mailbox corresponding to the mail is increased by one, that is, the number of the junk mails sent by the sending mailbox is increased, which can be understood as placing the sending mailbox in a blacklist list.
Preferably, before reading the title and content of the mail, the method further comprises:
reading a mail sending mailbox of the mail;
judging whether the recorded value of the sending mailbox is greater than a frequency threshold value or not;
if the judgment result is yes, the mail is marked as a junk mail, and the operation is interrupted;
if the judgment result is negative, the next step is carried out.
The junk mails are directly filtered through the judgment of the sending mailbox by judging whether the record value exceeds the preset time threshold value, if the record value is larger than the time threshold value, the mails sent by the sending mailbox are all used as the junk mails, the reading of titles and contents is not needed, the subsequent operations such as text classification and judgment of sensitive phrases are not needed, and the processing efficiency is further improved.
A spam filtering device, as shown in fig. 2, comprising:
the reading module 1 is used for reading mail contents;
the first judging module 2 is used for judging whether the mail content contains pictures or not;
the extraction module 3 is used for extracting the feature vectors in the pictures;
the second judging module 4 is used for judging whether the feature vectors exist in a pre-stored vector set or not;
and the marking module 5 is used for marking the mails as junk mails.
The operation of each module of the spam filtering device can operate steps S1 to S5 of the filtering method, and the specific data acquisition, processing and output processes are not described herein.
Preferably, as shown in fig. 2, the method further includes:
the reading module 1 is also used for reading pictures in the existing junk mails;
the intercepting module 6 is used for intercepting an area image containing spam information in the picture;
the processing module 7 is used for extracting a feature vector of the region image;
and the storage module 8 is used for storing the characteristic vectors extracted by the processing module into a vector set.
A spam filtering system comprising a spam filtering device as described in any of the above, and further comprising a server for updating said vector set, can achieve the same technical effect.
A storage medium storing a computer program which, when executed, implements the spam filtering method as described above, and achieves the same technical effects.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the modules is only one logical functional division, and other division manners may be implemented in practice, such as: multiple modules or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or modules may be electrical, mechanical or other.
In addition, all functional modules in the embodiments of the present invention may be integrated into one processor, or each module may be separately used as one device, or two or more modules may be integrated into one device; each functional module in each embodiment of the present invention may be implemented in a form of hardware, or may be implemented in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by program instructions and related hardware, where the program instructions may be stored in a computer-readable storage medium, and when executed, the program instructions perform the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A spam filtering method is characterized by comprising the following steps:
reading the mail content;
judging whether the mail content contains pictures or not;
if yes, extracting a feature vector in the picture;
judging whether the characteristic vector exists in a pre-stored vector set or not;
if the mail exists, the mail is marked as junk mail, and the operation is interrupted.
2. The spam filtering method of claim 1, further comprising, prior to said reading mail content:
reading pictures in the existing junk mails;
intercepting an area image containing junk information in the picture;
extracting a feature vector of the region image;
and storing the characteristic vector into the vector set.
3. The spam filtering method according to claim 1, wherein after said determining whether the mail content contains a picture, further comprising:
if not, text classification is carried out on the words of the mail content to form the content word group;
judging whether the content phrases contain sensitive phrases or not according to a filtering rule;
if yes, marking the mail as a junk mail, and interrupting the operation;
if not, the mail is marked as normal mail.
4. The spam filtering method of claim 1, further comprising, prior to said reading mail content:
reading a mail title;
performing text classification on the titles to form a title phrase;
judging whether the title phrases contain sensitive phrases or not according to a filtering rule;
if yes, marking the mail as a junk mail, and interrupting the operation;
if not, the next step is carried out.
5. The spam filtering method according to claim 1, wherein if it is included, the email is marked as spam, and the interrupting operation is specifically:
if yes, adding one to the record value of the sending mailbox;
and marking the mail as junk mail and interrupting the operation.
6. The spam filtering method according to claim 5, further comprising, before said reading the title and content of the mail:
reading the mail sending mailbox of the mail;
judging whether the recorded value of the sending mailbox is greater than a frequency threshold value or not;
if the judgment result is yes, the mail is marked as a junk mail, and the operation is interrupted;
if the judgment result is negative, the next step is carried out.
7. A spam filtering device, comprising:
the reading module is used for reading the mail content;
the first judging module is used for judging whether the mail content contains pictures or not;
the extraction module is used for extracting the feature vectors in the pictures;
the second judgment module is used for judging whether the characteristic vector exists in a pre-stored vector set or not;
and the marking module is used for marking the mail as the junk mail.
8. The spam filtering device of claim 7, further comprising:
the reading module is also used for reading pictures in the existing junk mails;
the intercepting module is used for intercepting an area image containing junk information in the picture;
the processing module is used for extracting the feature vector of the region image;
and the storage module is used for storing the characteristic vectors extracted by the processing module into the vector set.
9. A spam filtering system comprising a spam filtering device according to any of claims 7 to 8, further comprising a server for updating said vector set.
10. A storage medium storing a computer program, wherein the computer program, when executed, implements the spam filtering method of any of claims 1-6.
CN202011463258.7A 2020-12-11 2020-12-11 Junk mail filtering method, device, system and storage medium Pending CN114629872A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011463258.7A CN114629872A (en) 2020-12-11 2020-12-11 Junk mail filtering method, device, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011463258.7A CN114629872A (en) 2020-12-11 2020-12-11 Junk mail filtering method, device, system and storage medium

Publications (1)

Publication Number Publication Date
CN114629872A true CN114629872A (en) 2022-06-14

Family

ID=81896340

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011463258.7A Pending CN114629872A (en) 2020-12-11 2020-12-11 Junk mail filtering method, device, system and storage medium

Country Status (1)

Country Link
CN (1) CN114629872A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101282310A (en) * 2008-05-23 2008-10-08 华东师范大学 Method and apparatus for preventing picture junk mail
CN111404805A (en) * 2020-03-12 2020-07-10 深信服科技股份有限公司 Junk mail detection method and device, electronic equipment and storage medium
CN111985896A (en) * 2020-08-19 2020-11-24 中国银行股份有限公司 Mail filtering method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101282310A (en) * 2008-05-23 2008-10-08 华东师范大学 Method and apparatus for preventing picture junk mail
CN111404805A (en) * 2020-03-12 2020-07-10 深信服科技股份有限公司 Junk mail detection method and device, electronic equipment and storage medium
CN111985896A (en) * 2020-08-19 2020-11-24 中国银行股份有限公司 Mail filtering method and device

Similar Documents

Publication Publication Date Title
US20060259558A1 (en) Method and program for handling spam emails
US20060149820A1 (en) Detecting spam e-mail using similarity calculations
JP2005208780A (en) Mail filtering system and url black list dynamic construction method to be used for the same
CN102760170A (en) Electronic book note-taking method based on screenshot
CN101072067A (en) Device and method for realizing short-message classified sending, receiving and displaying
CN101360074B (en) Method and system determining suspicious spam range
EP1955504B1 (en) Anti-spam application storage system
CN111221970B (en) Mail classification method and device based on behavior structure and semantic content joint analysis
CN101094197B (en) Method and mail server of resisting garbage mail
CN111010336A (en) Massive mail analysis method and device
US20050198181A1 (en) Method and apparatus to use a statistical model to classify electronic communications
CN107360331B (en) Short message display method
CN110048936B (en) Method for judging junk mail by semantic associated words
CN101340674A (en) Method and apparatus for adding description information to image in mobile terminal
CN114629872A (en) Junk mail filtering method, device, system and storage medium
US7715059B2 (en) Facsimile system, method and program product with junk fax disposal
CN104376304A (en) Identification method and device for text advertisement image
CN101552741A (en) E-mail system and its system e-mail ouput method and device
JP2005284454A (en) Junk e-mail distribution preventive system, and information terminal and e-mail server in the system
CN114629873A (en) Junk mail filtering method, device, system and storage medium
EP1733521B1 (en) A method and an apparatus to classify electronic communication
CN114629870A (en) Junk mail filtering method, device, system and storage medium
CN115080504A (en) File management method, terminal and storage medium
CN108182191B (en) Hotspot data processing method and device
KR101565821B1 (en) Method of filtering message, user terminal performing the same and storage media storing the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination