CN110569357A

CN110569357A - method and device for constructing mail classification model, terminal equipment and medium

Info

Publication number: CN110569357A
Application number: CN201910767882.7A
Authority: CN
Inventors: 陈磊华; 潘文辉; 朱南皓; 杨芸
Original assignee: On Keke Science And Technology (guangzhou) Co Ltd
Current assignee: On Keke Science And Technology (guangzhou) Co Ltd
Priority date: 2019-08-19
Filing date: 2019-08-19
Publication date: 2019-12-13

Abstract

The invention discloses a method, a device, terminal equipment and a medium for constructing a mail classification model. The method comprises the following steps: constructing a target data set and a corpus by using a sample mail data set; the corpus is used for training word2vec models which correspond to the text data set, the URL link data set and the script data set one by one, and the text data set, the URL link data set and the script data set are converted into feature vectors by using the word2vec models; constructing classifiers which correspond to the data sets except the fusion data set in the target data set one by one, and training the classifiers to obtain corresponding classification models; using the fusion data set for training a classification model to obtain decision weights of various data in the fusion data set; and according to the decision weight, performing index evaluation verification and optimization on the classification model by using the test mail data set. The invention can establish a mail classification model aiming at various data in the mails, so that the mails can be detected in multiple dimensions through the mail classification model, and the high-efficiency classification of the mails is realized.

Description

Method and device for constructing mail classification model, terminal equipment and medium

Technical Field

The invention relates to the field of information security, in particular to a method, a device, terminal equipment and a medium for constructing a mail classification model.

Background

In the present society, email is commonly used in both social and business, financial and other aspects, but with the concomitant flooding of spam. In the mail flow of 2018, the proportion of junk mails is over 50%. The junk mails not only occupy huge network flow and consume a great deal of time, energy and money of recipients, but also malicious links, malicious scripts and horse hanging attachments of a lot of junk mails can cause information leakage of users, and various losses are directly caused.

With the rapid development of the internet, spam has also evolved from only containing a single type of content to containing multiple types of content, such as text, images, URL links, attachments, JavaScript scripts, etc. The traditional spam detection system based on content is based on spam detection of a single dimension, only aims at pictures or characters to construct machine learning model detection, and does not consider URL detection based on promotion links/malicious links and detection of mail text script jump links. The detection means can realize catch on the spam detection with various types of feature fusion, cannot achieve good detection efficiency, and has limitations.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method, an apparatus, a terminal device and a medium for constructing a mail classification model, which can establish a mail classification model for various data in a mail, so that the mail can be subjected to multidimensional detection through the mail classification model, and efficient classification of the mail is realized.

In order to solve the technical problem, the invention provides a method for constructing a mail classification model, which comprises the following steps:

constructing a target data set and a corpus by using a sample mail data set; the target data set comprises a text data set, a URL link data set, a script data set, an image data set and a fusion data set, the fusion data set comprises data sets of various combinations of the text data, the URL link data, the script data and the image data, and the corpus comprises a text corpus, a URL link corpus and a script corpus;

the corpus is used for training word2vec models which are in one-to-one correspondence with the text data set, the URL link data set and the script data set, and the text data set, the URL link data set and the script data set are converted into feature vectors by utilizing the word2vec models;

constructing classifiers which correspond to the data sets in the target data set except the fused data set one by one, and training the classifiers to obtain corresponding classification models;

Using the fusion data set to train the classification model to obtain decision weights of various data in the fusion data set;

And according to the decision weight, performing index evaluation verification and optimization on the classification model by using a test mail data set.

further, the text corpus, the URL link corpus and the script corpus are respectively constructed according to the text data set, the URL link data set and the script data set.

Further, the classification model is a deep learning model.

Further, the corresponding classification model comprises:

The classification models of the text data set, the URL link data set, the script data set and the image data set are a CNN model, an RNN model, an LSTM model and a CNN model respectively.

The invention also provides a device for constructing the mail classification model, which comprises the following components:

The data acquisition module is used for constructing a target data set and a corpus by utilizing the sample mail data set; the target data set comprises a text data set, a URL link data set, a script data set, an image data set and a fusion data set, the fusion data set comprises data sets of various combinations of the text data, the URL link data, the script data and the image data, and the corpus comprises a text corpus, a URL link corpus and a script corpus;

The vector conversion module is used for training word2vec models which are in one-to-one correspondence with the text data set, the URL link data set and the script data set, and converting the text data set, the URL link data set and the script data set into feature vectors by using the word2vec models;

the model pre-building module is used for building classifiers which correspond to the data sets in the target data set except the fusion data set one by one, and training the classifiers to obtain corresponding classification models;

The weight acquisition module is used for using the fusion data set to train the classification model to obtain decision weights of various data in the fusion data set;

And the model optimization module is used for performing index evaluation verification and optimization on the classification model by using the test mail data set according to the decision weight.

Further, the classification model is a deep learning model.

Further, the corresponding classification model comprises:

the embodiment of the invention has the following beneficial effects:

The embodiment of the invention can establish a mail classification model aiming at various data in the mails, so that the mails can be subjected to multi-dimensional detection through the mail classification model, and the high-efficiency classification of the mails is realized.

The invention also provides a terminal device for constructing a mail classification model, which comprises a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, wherein the memory is coupled with the processor, and the processor implements the method for constructing the mail classification model when executing the computer program.

The invention also provides a computer-readable storage medium, which includes a stored computer program, wherein when the computer program runs, the apparatus on which the computer-readable storage medium is located is controlled to execute the method for constructing the mail classification model.

Drawings

FIG. 1 is a schematic flow chart of a method for constructing a mail classification model according to a first embodiment of the present invention;

Fig. 2 is a schematic structural diagram of an apparatus for constructing a mail classification model according to a second embodiment of the present invention.

Detailed Description

the technical solutions in the present invention will be described clearly and completely with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that, the step numbers in the text are only for convenience of explanation of the specific embodiments, and do not serve to limit the execution sequence of the steps. The method provided by the embodiment can be executed by the relevant server, and the server is taken as an example for explanation below.

A first embodiment. Please refer to fig. 1.

as shown in fig. 1, a method for constructing a mail classification model according to a first embodiment includes steps S1 to S5:

S1, constructing a target data set and a corpus by using the sample mail data set; the target data set comprises a text data set, a URL link data set, a script data set, an image data set and a fusion data set, the fusion data set comprises data sets of various combinations of the text data, the URL link data, the script data and the image data, and the corpus comprises a text corpus, a URL link corpus and a script corpus.

S2, the corpus is used for training word2vec models corresponding to the text data set, the URL link data set and the script data set one by one, and the text data set, the URL link data set and the script data set are converted into feature vectors by using the word2vec models.

And S3, constructing classifiers corresponding to the data sets in the target data set except the fused data set one by one, and training the classifiers to obtain corresponding classification models.

And S4, using the fusion data set to train the classification model to obtain decision weights of various data in the fusion data set.

and S5, according to the decision weight, performing index evaluation verification and optimization on the classification model by using a test mail data set.

It should be noted that the sample mail data set includes normal mail data and spam mail data.

in a specific embodiment, the normal mails and the junk mails can be obtained by a mail receiving and sending system, a mail anti-malicious anti-spam system, user marks, expert marks and the like.

It is understood that step S1 is to construct the text data set and the text corpus by using the text data in the sample mail data set; constructing the URL link data set and the URL link corpus by using the URL link data in the sample mail set; constructing the script data set and the script corpus by using the script data in the sample mail data set; constructing the image dataset with the image data in the sample mail dataset; constructing different sets of the fused data using a plurality of combinations of the text data, the URL link data, the script data, and the image data in the sample mail data set.

Step S2, training the word2vec model corresponding to the text data set by using the text corpus, so that the word2vec model converts the text data set into a feature vector; training the word2vec model corresponding to the URL link data set by using the URL link corpus, and converting the URL link data set into a feature vector by using the word2vec model; and training the word2vec model corresponding to the script data set by using the script corpus, so that the word2vec model converts the script data set into a feature vector.

The word2vec model is trained by using CBOW or skip-gram to convert the corresponding data into a computer-understandable vector. By converting the text data set, the URL link data set and the script data set into vectors that can be recognized by a computer, the computer is prevented from being interrupted due to the fact that the text data set, the URL link data set and the script data set cannot be recognized.

Step S3, constructing the classifier according to the text data set converted into the feature vector, and training the classifier to obtain the classification model corresponding to the text data set; constructing the classifier according to the URL link data set converted into the feature vector, and training the classifier to obtain the classification model corresponding to the URL link data set; constructing the classifier according to the script data set converted into the feature vector, and training the classifier to obtain the classification model corresponding to the script data set; and constructing the classifier according to the image data set, and training the classifier to obtain the classification model corresponding to the image data set. The classification models are all single-dimensional classification models and are only used for classifying according to one type of data.

The step S4, training the classification model corresponding to the text data set by using the fused data set, and obtaining a decision weight of the text data in the fused data set; training the classification model corresponding to the URL link data set by using the fusion data set to obtain decision weight of the URL link data in the fusion data set; training the classification model corresponding to the script data set by using the fusion data set to obtain a decision weight of the script data in the fusion data set; training the classification model corresponding to the image data set by using the fusion data set to obtain decision weight of the image data in the fusion data set.

In step S5, according to the decision weights of the text data, the URL link data, the script data, and the image data, a test mail dataset is used to perform index evaluation verification and optimization on the classification models corresponding to the text dataset, the URL link dataset, the script dataset, and the image dataset.

in the embodiment, the corresponding single-dimensional classification models are constructed according to the text data, the URL link data, the script data and the image data in the sample mails, and the decision weights of different data are utilized to fuse the single-dimensional classification models, so that the multi-dimensional classification model is obtained.

similarly, if the mail is classified according to other data in the mail, a corresponding one-dimensional model may be added according to the data, and the decision weight of the data is added to the multi-dimensional classification model in this embodiment.

In a specific embodiment, the text corpus, the URL link corpus, and the script corpus are constructed according to the text dataset, the URL link dataset, and the script dataset, respectively.

It is to be understood that the text corpus is constructed from the text data sets; constructing the URL link corpus according to the URL link data set; and constructing the script corpus according to the script data set.

In this embodiment, for the text data set, a chinese word segmentation tool and a chinese stop word are used to perform word segmentation to construct the text corpus; for the URL link data set, using URL link common symbols such as link address symbols "-", "/" and the like for division to construct the URL link corpus; and for the script data set, using an abstract syntax tree parsing script to construct the script corpus, for example, a JavaScript script, using Esprima.js to parse corresponding JavaScript codes into an abstract syntax tree, and further constructing the JavaScript script corpus.

in a specific embodiment, the classification model is a deep learning model.

it can be understood that the deep learning model is adopted as the classification model, which is beneficial to improving the accuracy of classification. And the deep learning model can automatically extract primary features and combine the primary features into advanced features for learning, i.e. additional manual intervention for feature extraction is not needed, thus being beneficial to improving the classification efficiency.

in a specific embodiment, the corresponding classification model in step S3 includes: the classification models of the text data set, the URL link data set, the script data set and the image data set are a CNN model, an RNN model, an LSTM model and a CNN model respectively.

It is understood that, by using a CNN model as a classification model of the text data set/the image data set, local features in the text data set/the image data set can be effectively identified; an RNN model is adopted as a classification model of the URL link data set, so that time series characteristics in the URL link data set can be effectively identified; and by adopting the LSTM model as the classification model of the script data set, the context code association characteristics in the script data set can be effectively identified.

The embodiment of the invention has the following beneficial effects:

A second embodiment. Please refer to fig. 2.

As shown in fig. 2, a second embodiment provides a mail classification model building apparatus, including: a data obtaining module 21, configured to construct a target data set and a corpus by using a sample email data set; the target data set comprises a text data set, a URL link data set, a script data set, an image data set and a fusion data set, the fusion data set comprises data sets of various combinations of the text data, the URL link data, the script data and the image data, and the corpus comprises a text corpus, a URL link corpus and a script corpus; a vector conversion module 22, configured to use the corpus to train a word2vec model in one-to-one correspondence with the text data set, the URL link data set, and the script data set, and convert the text data set, the URL link data set, and the script data set into feature vectors by using the word2vec model; the model pre-modeling block 23 is configured to construct classifiers that correspond to the data sets in the target data set, except for the fused data set, one to one, and train the classifiers to obtain corresponding classification models; a weight obtaining module 24, configured to use the fused data set for training the classification model to obtain decision weights of various data in the fused data set; and the model optimization module 25 is configured to perform index evaluation verification and optimization on the classification model by using the test mail data set according to the decision weight.

it is understood that the data obtaining module 21 constructs the text data set and the text corpus by using the text data in the sample mail data set; constructing the URL link data set and the URL link corpus by using the URL link data in the sample mail set; constructing the script data set and the script corpus by using the script data in the sample mail data set; constructing the image dataset with the image data in the sample mail dataset; constructing different sets of the fused data using a plurality of combinations of the text data, the URL link data, the script data, and the image data in the sample mail data set.

the vector conversion module 22 trains the word2vec model corresponding to the text data set by using the text corpus, so that the word2vec model converts the text data set into a feature vector; training the word2vec model corresponding to the URL link data set by using the URL link corpus, and converting the URL link data set into a feature vector by using the word2vec model; and training the word2vec model corresponding to the script data set by using the script corpus, so that the word2vec model converts the script data set into a feature vector.

The model pre-building module 23 is configured to build the classifier according to the text data set converted into the feature vector, and train the classifier to obtain the classification model corresponding to the text data set; constructing the classifier according to the URL link data set converted into the feature vector, and training the classifier to obtain the classification model corresponding to the URL link data set; constructing the classifier according to the script data set converted into the feature vector, and training the classifier to obtain the classification model corresponding to the script data set; and constructing the classifier according to the image data set, and training the classifier to obtain the classification model corresponding to the image data set. The classification models are all single-dimensional classification models and are only used for classifying according to one type of data.

The weight obtaining module 24 is configured to train the classification model corresponding to the text data set by using the fused data set, so as to obtain a decision weight of the text data in the fused data set; training the classification model corresponding to the URL link data set by using the fusion data set to obtain decision weight of the URL link data in the fusion data set; training the classification model corresponding to the script data set by using the fusion data set to obtain a decision weight of the script data in the fusion data set; training the classification model corresponding to the image data set by using the fusion data set to obtain decision weight of the image data in the fusion data set.

The model optimization module 25 performs index evaluation verification and optimization on the classification models corresponding to the text data set, the URL link data set, the script data set, and the image data set one by one using a test mail data set according to decision weights of the text data, the URL link data, the script data, and the image data.

In a specific embodiment, the classification model is a deep learning model.

In a specific embodiment, the corresponding classification model includes: the classification models of the text data set, the URL link data set, the script data set and the image data set are a CNN model, an RNN model, an LSTM model and a CNN model respectively.

The embodiment of the invention has the following beneficial effects:

A third embodiment.

A third embodiment provides a terminal device for constructing a mail classification model, which includes a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, wherein the memory is coupled to the processor, and the processor executes the computer program to implement the method for constructing a mail classification model as described above, and has the same beneficial effects as the method for constructing a mail classification model.

A fourth embodiment.

a fourth embodiment provides a computer-readable storage medium, which includes a stored computer program, where when the computer program runs, the apparatus where the computer-readable storage medium is located is controlled to execute the method for constructing a mail classification model as described above, and has the same beneficial effects as the method for constructing the mail classification model.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

it will be understood by those skilled in the art that all or part of the processes of the above embodiments may be implemented by hardware related to instructions of a computer program, and the computer program may be stored in a computer readable storage medium, and when executed, may include the processes of the above embodiments. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Claims

1. a method for constructing a mail classification model is characterized by comprising the following steps:

2. the method for constructing the mail classification model according to claim 1, wherein the text corpus, the URL link corpus and the script corpus are constructed from the text dataset, the URL link dataset and the script dataset, respectively.

3. The method of constructing a mail classification model according to claim 1, wherein the classification model is a deep learning model.

4. The method of constructing a mail classification model according to claim 1, wherein the corresponding classification model comprises:

5. an apparatus for constructing a mail classification model, comprising:

6. The apparatus for constructing a mail classification model according to claim 5, wherein the text corpus, the URL link corpus, and the script corpus are constructed from the text dataset, the URL link dataset, and the script dataset, respectively.

7. The apparatus for constructing a mail classification model according to claim 5, wherein the classification model is a deep learning model.

8. The apparatus for constructing a mail classification model according to claim 5, wherein the corresponding classification model comprises:

9. terminal device for the construction of a mail classification model, characterized in that it comprises a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the memory being coupled to the processor and the processor implementing the method for the construction of a mail classification model according to claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the method of constructing a mail classification model according to claims 1 to 4.