CN117834579A

CN117834579A - Personalized spam filtering method, system, equipment and medium

Info

Publication number: CN117834579A
Application number: CN202311847703.3A
Authority: CN
Inventors: 林延中; 潘庆峰
Original assignee: Guangdong Yingshi Computer Technology Co ltd
Current assignee: Guangdong Yingshi Computer Technology Co ltd
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-04-05

Abstract

The application relates to the technical field of information filtering, in particular to a personalized junk mail filtering method, a personalized junk mail filtering system, personalized junk mail filtering equipment and a personalized junk mail filtering medium, wherein historical log information of an addressee is obtained, and the historical log information is input into a recommendation model for training; acquiring a user mailbox address, and inputting the user mailbox address into a recommendation model to obtain user vector information; acquiring a mail related vector, and comparing user vector information with the mail related vector to obtain vector similarity; and acquiring a similarity threshold, and carrying out judgment on the mail under the condition that the vector similarity is larger than the similarity threshold, so that misjudgment of mail filtering is reduced, and the accuracy of filtering junk mails is improved.

Description

Personalized spam filtering method, system, equipment and medium

Technical Field

The present disclosure relates to the field of information filtering technologies, and in particular, to a method, system, device, and medium for filtering personalized spam.

Background

In the current mail anti-spam technology, global filtering is mostly performed, and differences among different users are ignored, for example, for some papers, manuscripts or manuscripts-examining mails, some users feel harassment, and target users feel useful. The difference is whether the profession of such mail is associated with the professional of the receiving user or whether the level of the meeting matches the academic level of the addressee, some non-top meeting mails may be annoying to addressees at certain universities, but may find useful to addressees at certain non-top universities. If the personalized filtering engine based on the user filters the mail which is judged to be junk by the global engine once again, the personalized filtering engine can correct the misjudgment based on the personalized filtering engine because the personalized filtering engine can learn the mail semantics or habits of addressees.

Any machine learning algorithm may always have a misjudgment problem for the judgment of the junk mail. The false judgment detection mechanism of the general anti-garbage system is very dependent on manual feedback, and can avoid the false judgment again only if the mail is misjudged and then complained by the feedback of the addressee by adding the limited information of the addressee or improving the extraction of the training sample of the training model. However, in the actual situation, since the user is unfamiliar with how the feedback function is used, most of misjudgment is not timely fed back, so that the situation of filtering misjudgment of the mail cannot be effectively solved, the use experience of sending and receiving of the mail is affected, and the problems are to be solved.

Disclosure of Invention

In order to reduce misjudgment of mail filtering and improve accuracy of filtering junk mail, the application provides a personalized junk mail filtering method, system, equipment and medium, which adopt the following technical scheme:

in a first aspect, the present application provides a method for filtering personalized spam, including:

acquiring historical log information of the addressee, and inputting the historical log information into a recommendation model for training;

acquiring a user mailbox address, and inputting the user mailbox address into a recommendation model to obtain user vector information;

acquiring a mail related vector, and comparing user vector information with the mail related vector to obtain vector similarity;

and acquiring a similarity threshold, and carrying out improvement on the mail under the condition that the vector similarity is larger than the similarity threshold.

Preferably, the method further comprises:

and acquiring mail sender information with vector similarity larger than a similarity threshold, and constructing a white list according to the mail sender information.

Preferably, the method further comprises:

the method comprises the steps of obtaining text content of mails, segmenting the text content to obtain a plurality of word information, and constructing a recommendation model according to the plurality of word information to screen junk mails.

Preferably, the method further comprises:

and acquiring the time information of mail reception, and inputting the time information of mail reception into a recommendation model to screen junk mails.

Preferably, the method further comprises:

and acquiring the information of the repeated occurrence times of the mails, and inputting the information of the repeated occurrence times of the mails into a recommendation model to screen junk mails.

In a second aspect, the present application provides a personalized spam filtering system comprising:

the training module is used for acquiring the history log information of the addressee and inputting the history log information into the recommendation model for training;

the user vector processing module is used for acquiring a user mailbox address, inputting the user mailbox address into the recommendation model and obtaining user vector information;

the comparison module is used for acquiring mail related vectors, and comparing the user vector information with the mail related vectors to obtain vector similarity;

and the judging module is used for acquiring the similarity threshold value and judging the mail under the condition that the vector similarity is larger than the similarity threshold value.

Preferably, the method further comprises:

and the white list module is used for acquiring the mail sender information with the vector similarity larger than the similarity threshold value and constructing a white list according to the mail sender information.

Preferably, the method further comprises:

and the junk mail screening module is used for acquiring text content of the mails, segmenting the text content to obtain a plurality of word information, and constructing a recommendation model according to the plurality of word information to screen the junk mails.

In a third aspect, the present application provides a personalized spam filtering device comprising a memory storing a computer program and a processor arranged to run the computer program to perform a personalized spam filtering method as described above.

In a fourth aspect, the present application provides a computer readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the personalized spam filtering method as described above when run.

To sum up, compared with the prior art, the beneficial effects brought by the technical scheme provided by the application at least include:

according to the method and the device, the recommendation model is trained through the history log information, the user vector information in the user mailbox address extraction recommendation model is obtained, then the mail related vector judged as the junk mail is obtained and compared with the user vector information, the vector similarity is obtained through comparison, the junk mail is improved according to the quantity relation between the vector similarity and the similarity threshold value, the misjudgment condition of mail filtering is reduced, and the accuracy of filtering the junk mail is improved.

Drawings

Fig. 1 is a schematic flow chart of a personalized spam filtering method according to an embodiment of the present application.

Fig. 2 is a schematic block diagram of a personalized spam filtering system according to an embodiment of the present application.

Reference numerals illustrate:

1. a training module; 2. a user vector processing module; 3. comparison module; 4. a judging module; 5. a white list module; 6. and a spam screening module.

Detailed Description

The following further details the application in connection with fig. 1-2, and the terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting.

When receiving mail, different users have different requirements on the content of mail information. For some mails of paper manuscript examination, some users feel spam, and some users feel useful. The difference is whether the profession of such mail is associated with the professional of the receiving user or whether the level of the meeting matches the academic level of the addressee, some non-top meeting mails may be annoying to addressees at certain universities, but may find useful to addressees at certain non-top universities. In another example, advertisement mail, or popularization of electronic commerce, or popularization of some foreign trade enterprises, some users just have related needs in life so as to consider mail as normal mail, and some users just have no need so as to consider spam.

Any machine learning algorithm is always likely to generate a misjudgment problem for judging the junk mail, and a misjudgment detection mechanism of a general anti-junk system is very dependent on manual feedback. On the basis of misjudgment, the senders who can make misjudgment may be added at any time, for example, a new OA system is online by a client to develop and send a new notification message, a recruitment website suddenly adds a category of events that a notification mail wants to notify a user, and an academic conference mail is sent by an academic conference group. It is therefore necessary to find out the misjudged senders at a faster rate and quickly correct the decision errors of the machine learning model by adding sender restriction information for these misjudged senders, but if it is desired to rely on user feedback to find out these misjudged senders, it is often necessary to wait several weeks to accumulate enough user feedback for us to find out these misjudgments. In order to solve the problem of misjudgment in the scene, the scheme of the application is provided.

Referring to fig. 1, a personalized spam filtering method according to the present application specifically includes:

step S1: acquiring historical log information of the addressee, and inputting the historical log information into a recommendation model for training;

step S2: acquiring a user mailbox address, and inputting the user mailbox address into a recommendation model to obtain user vector information;

step S3: acquiring a mail related vector, and comparing user vector information with the mail related vector to obtain vector similarity;

step S4: and acquiring a similarity threshold, and carrying out improvement on the mail under the condition that the vector similarity is larger than the similarity threshold.

Specifically, a global junk mail classification system is used for classifying mails, then history log information is acquired to train a recommendation model, and the history log information is extracted from all mails of each receiver, wherein the time for receiving the mails, the repetition times of the mails, the text content of the mails and the like are included. The user vector of the recommendation model trained by all mail related information of the addressee is very similar to the mail vector of normal mails frequently received by the addressee. And then, acquiring user vector information in the user mailbox address extraction recommendation model, and then, acquiring and comparing the mail related vector judged as the junk mail with the user vector information to obtain vector similarity, and carrying out improvement on the junk mail according to the quantity relation between the vector similarity and the similarity threshold value, so that the misjudgment condition of mail filtering is reduced, and the accuracy of filtering the junk mail is improved.

According to the embodiment of the application, a recommendation system is introduced to the scene of the anti-spam, and whether the judgment of whether the spam is changed into whether the degree of the mail recommended by the addressee is high enough or not is judged, so that personalized anti-spam filtering is realized.

The recommendation system calculates a vector for each user based on its history information, such as title text including search text, clicked merchandise, etc. For each commodity of the system, a vector is calculated based on historical information, such as the historical information clicked by which users, the title and introduction text of the commodity, and the like, and the recommendation degree of recommending the commodity to the users can be evaluated by comparing the similarity of the two vectors.

As one implementation mode, text content of the mail is obtained, word segmentation is carried out on the text content to obtain a plurality of word information, and a recommendation model is constructed according to the plurality of word information to screen junk mails.

In particular, in the anti-spam scene of the mail, firstly, the text of the mail is divided into word information, for example, the mail content is "this is a test", then the word is divided into three words of "this is", "one", "test", then the word segmentation is correspondingly used as 'commodity' in the recommendation system, namely the mail that the user receives 'the test' is equivalent to the user clicking 'the test' and 'one test' commodity. The commodity word is actually clicked by a plurality of different users, so that a model can be generated after training by using a conventional recommendation system, a user and mail text are input, the recommendation degree is output, and the user can be considered to hope to receive the mail with high probability as long as the recommendation degree exceeds a threshold value.

As one embodiment, the method further comprises:

Specifically, the input of the recommendation algorithm can support other discrete variables in addition to such discrete variables as text. Such as the time of mail receipt. The time of receiving the mail is also used as an input characteristic, so that the content of the same mail can be achieved, the same mail is sent to the same addressee, if the sending can be judged to be normal in the daytime, and the sending can be judged to be garbage in the middle night.

As one embodiment, the method further comprises:

Specifically, the input of the current recommendation algorithm is divided by discrete variables, and also supports continuous variables. Such as the number of repeated occurrences of mail. The repeated occurrence times of the mails are used as input characteristics, so that the content of the same mail can be achieved, the same addressee is sent, if the repeated occurrence times are relatively small, the mail is judged to be normal, but the mail is repeatedly sent 1000 times, and the mail is judged to be junk mail.

And (3) using the receiving and sending records of each user at the cloud to acquire normal mail and junk mail samples in the receiving and sending records, so that an independent model can be trained, and a personalized model of hundred million users is supported. After the global anti-spam filtering engine is used for filtering the mails, the anti-spam system uses the personalized model for filtering once again, so that the matching degree between the mails and the addressees can be obtained, and the mails can be normal mails as long as the matching degree is greater than a threshold value.

After the junk mail is classified, the historical log information of the addressee is obtained, and the historical log information is input into a recommendation model for training. Specifically, the history log information of each addressee is extracted, including the time of receiving the mail, the repetition number of the mail, the text content of the mail and the like, and a vector related to the mail is constructed based on the history log information. The vectors are used as training samples, a training program is run, and a corresponding recommendation model can be trained, wherein the user vectors in the model are very similar to mail vectors of regular mails which are frequently received, and are inconsistent with mail vectors of junk mails which are frequently received.

And acquiring a user mailbox address, and inputting the user mailbox address into the recommendation model to obtain user vector information. After training the model, the user vector corresponding to each user is stored in the model, and the user vector information corresponding to the user can be searched out from the model only by inputting the email address of the user.

And acquiring a mail related vector, and comparing the user vector information with the mail related vector to obtain the vector similarity. Specifically, the text content of the junk mail, the mail receiving time and the mail repetition number are used as parameters to be transmitted to a recommendation system model, vectors related to the mail are calculated, and the similarity degree of the addressee vector and the mail related vector is obtained through comparison of user vector information and the mail related vector. And acquiring a preset similarity threshold value, and if the similarity degree exceeds the preset threshold value, judging the mail as normal mail, and delivering the mail to an inbox of the user.

As one implementation, mail sender information with vector similarity greater than a similarity threshold is obtained, and a white list is constructed according to the mail sender information.

Specifically, the updated transmitted emails and transmitted domain names are recorded in real time, and the number of times of the updating of each transmitted email and transmitted domain name is recorded. The number of the threshold values of the number of the adaptation times is set to be a plurality of, and T1 to T4 are set in the embodiment of the application. Triggering an audit notice under the condition that the number of times of the improvement of the transmission email exceeds a threshold T1; triggering an audit notice if the number of times of the domain name transmitting and judging exceeds a threshold value T2; if the number of times of the improvement of the transmission email exceeds a threshold T3 and the transmission email is not yet approved, automatically adding limit information in advance; if the number of times of the domain name change exceeds a threshold T4 and the domain name is not checked yet, the sender restriction information is automatically added in advance; after receiving the manual auditing notification, the constraint information maintainer judges whether the constraint information of the sender needs to be added, and if the constraint information is considered to be added and the system does not automatically add, the constraint information is manually added; after receiving the manual auditing notification, the maintainer judges whether the sender restriction information needs to be added, if yes, checks whether the restriction information is automatically added, and if yes, manually deletes the corresponding restriction information.

Referring to fig. 2, a personalized spam filtering system is provided for an embodiment of the present application, the system comprising:

the training module 1 is used for acquiring the history log information of the addressee and inputting the history log information into the recommendation model for training;

the user vector processing module 2 is used for acquiring a user mailbox address, inputting the user mailbox address into the recommendation model and obtaining user vector information;

the comparison module 3 is used for obtaining mail related vectors, and comparing the user vector information with the mail related vectors to obtain vector similarity;

and the judging module 4 is used for acquiring the similarity threshold value and judging the mail under the condition that the vector similarity is larger than the similarity threshold value.

Specifically, the system further comprises a whitelist module 5, which is used for acquiring mail sender information with vector similarity greater than a similarity threshold value, and constructing a whitelist according to the mail sender information.

The system also comprises a junk mail screening module 6 which is used for acquiring text content of mails, word segmentation is carried out on the text content to obtain a plurality of word information, and a recommendation model is constructed according to the plurality of word information to screen junk mails. And the method is also used for acquiring the time information of mail reception, and inputting the time information of mail reception into a recommendation model to screen junk mails. And the method is used for acquiring the information of the repeated occurrence times of the mails, and inputting the information of the repeated occurrence times of the mails into a recommendation model to screen junk mails.

The embodiments of the present application provide a personalized spam filtering device comprising a memory storing a computer program and a processor arranged to run the computer program to perform the personalized spam filtering method as described above.

Embodiments of the present application provide a computer readable storage medium having a computer program stored therein, wherein the computer program is configured to perform a personalized spam filtering method as described above when run.

It will be clear to those skilled in the art that, for convenience and brevity of description, the specific working procedures of the apparatus and the product described above may refer to the corresponding procedures in the foregoing method embodiments, which are not described herein again.

In the several embodiments provided herein, it should be understood that the disclosed methods, systems, apparatus, and program products may be embodied in other ways.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method for personalized spam filtering, comprising:

2. The personalized spam filtering method of claim 1, further comprising:

3. The personalized spam filtering method of claim 1, further comprising:

4. A personalized spam filtering method according to claim 3, further comprising:

5. A personalized spam filtering method according to claim 3, further comprising:

6. A personalized spam filtering system, comprising:

7. The personalized spam filtering system of claim 6, further comprising:

8. The personalized spam filtering system of claim 6, further comprising:

9. A personalized spam filtering device comprising a memory storing a computer program and a processor arranged to run the computer program to perform the personalized spam filtering method of any of claims 1-5.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, wherein the computer program is arranged to execute the personalized spam filtering method of any of claims 1-5 at run-time.