CN114499980A - Phishing mail detection method, device, equipment and storage medium - Google Patents

Phishing mail detection method, device, equipment and storage medium Download PDF

Info

Publication number
CN114499980A
CN114499980A CN202111632166.1A CN202111632166A CN114499980A CN 114499980 A CN114499980 A CN 114499980A CN 202111632166 A CN202111632166 A CN 202111632166A CN 114499980 A CN114499980 A CN 114499980A
Authority
CN
China
Prior art keywords
mail
training
classifier
data set
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202111632166.1A
Other languages
Chinese (zh)
Inventor
黄章镕
范渊
刘博�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DBAPPSecurity Co Ltd
Original Assignee
DBAPPSecurity Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DBAPPSecurity Co Ltd filed Critical DBAPPSecurity Co Ltd
Priority to CN202111632166.1A priority Critical patent/CN114499980A/en
Publication of CN114499980A publication Critical patent/CN114499980A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/07User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail characterised by the inclusion of specific contents
    • H04L51/08Annexed information, e.g. attachments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing

Abstract

The invention discloses a phishing mail detection method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a mail data set, extracting multi-dimensional characteristics of mails in the mail data set to obtain a characteristic data set, and acquiring partial multi-dimensional characteristics in the characteristic data set as a training set; respectively training a prediction classifier by using the characteristics of each dimension in a training set in a cross training prediction mode to obtain a plurality of base classifiers corresponding to the characteristics of each dimension one to one, and a prediction result predicted by each base classifier on the training set, and training the classifiers based on the prediction result to obtain a meta classifier; and respectively inputting the characteristic of each dimension in the multi-dimension characteristic of the mail to be detected into a corresponding base classifier to obtain a plurality of sub-prediction results, and inputting the plurality of sub-prediction results into a meta classifier to obtain a total detection result of whether the mail to be detected is a phishing mail. The method and the device can improve the robustness and the generalization of mail detection, and further effectively improve the accuracy of mail detection.

Description

Phishing mail detection method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of information detection, in particular to a phishing mail detection method, a device, equipment and a storage medium.
Background
As one of the important infrastructures of the Internet, Mail systems have been designed in the early days with protocols such as SMTP (Simple Mail Transfer Protocol), POP3(Post Office Protocol-Version 3), IMAP (Internet Message Access Protocol), and the like, which have insufficient security of protocols and services and cause spam to flood. DKIM (Domain Keys Identified Mail standard) and SPF (Sender Policy Framework) are designed to solve the problem of Mail authorization and authentication, alleviating the problem of spam flooding. With the development of security detection technology and security equipment, the defense capability of services and applications of enterprises is greatly enhanced; therefore, phishing mails, as an attack means based on the social engineering principle, are gradually adopted by more and more malicious attackers, and are used for actions such as invading computer systems and stealing sensitive data.
Phishing mails are typically hacker elaborated mails aimed at tricking recipients into clicking on malicious links of the mail or downloading malicious attachments; therefore, phishing mails are usually well disguised, so that recipients are difficult to distinguish true from false, and meanwhile, the phishing mails have strong inductivity, and how to provide a technical scheme capable of realizing phishing mail detection is a problem to be solved by technical staff in the field at present.
Disclosure of Invention
The invention aims to provide a phishing mail detection method, a phishing mail detection device, phishing mail detection equipment and a phishing mail detection storage medium, which can improve the robustness and the generalization of mail detection and further effectively improve the accuracy of mail detection.
In order to achieve the above purpose, the invention provides the following technical scheme:
a phishing mail detection method comprising:
acquiring a mail data set, extracting multi-dimensional features of mails in the mail data set to obtain a feature data set containing the multi-dimensional features of the mails in the mail data set, and acquiring part of the multi-dimensional features contained in the feature data set as a training set;
respectively training a prediction classifier by using the features of each dimension in the training set in a cross training prediction mode to obtain a plurality of base classifiers corresponding to the features of each dimension one to one, and a prediction result predicted by each base classifier on the training set, and training the classifiers based on the prediction result to obtain meta classifiers;
and respectively inputting the characteristic of each dimension in the multi-dimension characteristic of the mail to be detected into a corresponding base classifier to obtain a plurality of sub-prediction results, and inputting the plurality of sub-prediction results into the meta classifier to obtain a total detection result of whether the mail to be detected is a phishing mail.
Preferably, extracting the multidimensional feature of each email in the email data set includes:
and analyzing each mail in the mail data set respectively to extract field data contained in each mail in the mail data set, and extracting domain name characteristics, link characteristics, mail text characteristics and mail attachment characteristics of each mail from the field data as corresponding multi-dimensional characteristics.
Preferably, after extracting the multidimensional feature of each email in the email data set, the method further includes:
and performing missing value filling processing on the multidimensional characteristics of the mails in the extracted mail data set, and performing standardization processing on the multidimensional characteristics subjected to the missing value filling processing.
Preferably, after each of the base classifier and the meta classifier is obtained by training, the method further includes:
and performing parameter tuning on each base classifier and each meta classifier in a cross validation mode.
Preferably, after performing parameter tuning on each of the base classifiers and the meta classifier in a cross validation manner, the method further includes:
training each base classifier on the training set, and training the meta classifier based on a prediction result when each base classifier is trained on the training set.
Preferably, obtaining a training set based on the feature data set includes:
dividing the characteristic data set into a training set and a testing set;
correspondingly, after training the meta classifier based on the prediction result of training each of the base classifiers on the training set, the method further includes:
and testing each base classifier on the test set, testing the meta classifier based on a prediction result obtained when each base classifier is used for testing on the test set to obtain corresponding accuracy and false alarm rate, determining that the training of the meta classifier is finished if the accuracy and the false alarm rate meet requirements, and otherwise, outputting corresponding error prompt.
Preferably, the domain name characteristics include domain name information in a sender, a receiver and all links in the corresponding mail, the link characteristics include character strings of all links in the corresponding mail, the mail text characteristics include mail titles and mail body contents in the corresponding mail, and the mail attachment characteristics include network behavior information of attachments of the corresponding mail during operation.
A phishing mail detection apparatus comprising:
an extraction module to: acquiring a mail data set, extracting multi-dimensional features of mails in the mail data set to obtain a feature data set containing the multi-dimensional features of the mails in the mail data set, and acquiring part of the multi-dimensional features contained in the feature data set as a training set;
a training module to: respectively training a prediction classifier by using the features of each dimension in the training set in a cross training prediction mode to obtain a plurality of base classifiers corresponding to the features of each dimension one to one, and a prediction result predicted by each base classifier on the training set, and training the classifiers to obtain meta classifiers based on the prediction result;
a detection module to: and respectively inputting the characteristic of each dimension in the multi-dimension characteristic of the mail to be detected into a corresponding base classifier to obtain a plurality of sub-prediction results, and inputting the plurality of sub-prediction results into the meta classifier to obtain a total detection result of whether the mail to be detected is a phishing mail.
A phishing mail detecting apparatus comprising:
a memory for storing a computer program;
a processor for implementing the steps of the phishing mail detection method as described in any one of the above when the computer program is executed.
A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the phishing detection method as defined in any one of the above.
The invention provides a phishing mail detection method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a mail data set, extracting multi-dimensional features of mails in the mail data set to obtain a feature data set containing the multi-dimensional features of the mails in the mail data set, and acquiring part of the multi-dimensional features contained in the feature data set as a training set; respectively training a prediction classifier by using the features of each dimension in the training set in a cross training prediction mode to obtain a plurality of base classifiers corresponding to the features of each dimension one to one, and a prediction result predicted by each base classifier on the training set, and training the classifiers to obtain meta classifiers based on the prediction result; and respectively inputting the characteristic of each dimension in the multi-dimensional characteristics of the mail to be detected into a corresponding base classifier to obtain a plurality of sub-prediction results, and inputting the plurality of sub-prediction results into the meta classifier to obtain a total detection result of whether the mail to be detected is a phishing mail. The method includes the steps of obtaining a mail data set, extracting multi-dimensional features of mails in the mail data set to obtain a feature data set, obtaining a training set contained in the feature data set, respectively training a prediction classifier by using the features of each dimension in the training set in a cross training prediction mode, obtaining a plurality of base classifiers corresponding to the features of each dimension one by one and prediction results obtained by predicting each base classifier on the training set, training the classifiers by using the prediction results to obtain a meta classifier, inputting the features of each dimension in the multi-dimensional features of any mail into the corresponding base classifier to obtain a plurality of sub prediction results when the detection of whether any mail is a phishing mail is realized, and inputting the plurality of sub prediction results into the meta classifier to obtain a total detection result of whether any mail is a phishing mail. Therefore, the method and the device respectively model the characteristics of different dimensions, and improve the robustness and the generalization of mail detection by adopting an ensemble learning method and combining the results of all the base classifiers, thereby effectively improving the accuracy of mail detection.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a phishing mail detection method according to an embodiment of the invention;
fig. 2 is a schematic structural diagram of a phishing mail detection device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The current phishing mail camouflage mainly comprises the following four aspects: 1. the method includes the steps that official image-text information is stolen, even the whole official website is stolen to imitate a trusted mail, and a receiver is induced to submit own account information, personal privacy and the like or induce payment according to a false reason; 2. the title of the mail can be strongly attractive, often something that is very attractive or scary, and thus attracts the attention of the recipient; 3. the sender and the link address in the e-mail are also not trusted, and hackers often forge the trusted address to make the receiver think that the e-mail is from the official mail of the trusted user or the website; 4. and malicious pictures and attachments are induced to be clicked, and an attacker deceives the receiver into clicking the pictures or downloading the attachments by disguising the pictures which cannot be displayed or adopting an inductive language, so that the aim of stealing sensitive information or controlling the mailbox is fulfilled.
The carefully disguised phishing mails are difficult to represent through single-dimensional features, and partial information loss exists in vectorized results. In view of the above, the detection model based on ensemble learning is provided, the feature information is extracted through multiple dimensions, and the generalization performance and robustness of the detection model are improved by using the fusion result of different machine learning models (including a set classifier and a meta classifier), so that the accuracy of the detection of the fishing mails is improved; the following provides a specific description of the phishing mail detection scheme provided by the present application.
Referring to fig. 1, a flowchart of a phishing mail detection method according to an embodiment of the present invention is shown, which may include:
s11: the method comprises the steps of obtaining a mail data set, extracting multi-dimensional features of mails in the mail data set to obtain a feature data set containing the multi-dimensional features of the mails in the mail data set, and obtaining part of the multi-dimensional features contained in the feature data set as a training set.
In the embodiment of the application, a mail data set is prepared, the mail data set comprises a plurality of mails with tags, the tags of any mail are information indicating whether the any mail is a phishing mail, and the data forms of the plurality of mails include but are not limited to an EML file, a JSON file or a TXT file; then analyzing the files of the mails stored in the mail data set, extracting data of each field contained in each mail in the mail data set, and processing the problems of various character codes, spaces in character strings and line feed; and then formatting, storing and outputting data contents respectively contained in the extracted mails according to different feature dimensions, extracting multi-dimensional features from the data contents as data for realizing model processing to obtain a feature data set containing the multi-dimensional features corresponding to the mails and labels of the mails, and acquiring partial multi-dimensional features from the feature data set and the labels of the mails to which the multi-dimensional features belong to form a training set. The multidimensional characteristics can be set according to actual needs, and include but are not limited to domain name characteristics, link characteristics, mail text characteristics and mail attachment characteristics.
S12: and respectively training the prediction classifier by using the characteristics of each dimension in the training set in a cross training prediction mode to obtain a plurality of base classifiers which are in one-to-one correspondence with the characteristics of each dimension, and a prediction result predicted by each base classifier on the training set, and training the classifier based on the prediction result to obtain the meta classifier.
After the training set is obtained, the prediction classifier is respectively trained by using the features of each dimension and the corresponding labels in the training set in a cross training prediction mode, so that the base classifier corresponding to the features of each dimension and the prediction result predicted by each base classifier on the training set are obtained, and then the classifier is trained by using the prediction results and the labels of the mails corresponding to the prediction results, so that the meta classifier capable of combining the prediction results of each base classifier to obtain the final prediction result is obtained. Specifically, in the embodiment of the present application, the training set may be divided into K subsets, and the prediction process of the training is performed K times through an iterative loop to obtain all prediction results of the base classifier and the base classifier on the training set, where the prediction process of a single training may include: and taking K-1 parts of subsets in the training set as training subsets, taking the rest 1 part of subsets as prediction subsets, respectively training classifiers by using the features and corresponding labels of each dimension in the training subsets to obtain base classifiers respectively corresponding to the features of each dimension, and then predicting on the prediction subsets by using each base classifier to obtain a prediction result corresponding to each multi-dimension feature in the prediction subsets. And in the K times of training prediction process of iterative cycle, the 1 st to K th subsets are all used as over-prediction subsets to obtain the prediction result predicted by each base classifier on all multi-dimensional features of the training set.
S13: and respectively inputting the characteristic of each dimension in the multi-dimension characteristic of the mail to be detected into a corresponding base classifier to obtain a plurality of sub-prediction results, and inputting the plurality of sub-prediction results into a meta classifier to obtain a total detection result of whether the mail to be detected is a phishing mail.
When whether any mail is a phishing mail needs to be detected, the any mail is used as the mail to be detected, the multi-dimensional features of the mail to be detected are extracted (the multi-dimensional features of the mail to be detected are obtained according to the same implementation principle as the multi-dimensional features of any mail in the mail data set), the features of each dimension in the multi-dimensional features of the mail to be detected are respectively input into corresponding base classifiers, the result output by each base classifier is respectively a sub-prediction result, then the sub-prediction results output by each classifier are combined into a new feature vector, the new feature vector combined by the sub-prediction results is input into a meta-classifier, and the result, output by the meta-classifier, of whether the mail to be detected is a phishing mail is obtained as a total detection result.
The method includes the steps of obtaining a mail data set, extracting multi-dimensional features of mails in the mail data set to obtain a feature data set, obtaining a training set contained in the feature data set, respectively training a prediction classifier by using the features of each dimension in the training set in a cross training prediction mode, obtaining a plurality of base classifiers corresponding to the features of each dimension one by one and prediction results obtained by predicting each base classifier on the training set, training the classifiers by using the prediction results to obtain a meta classifier, inputting the features of each dimension in the multi-dimensional features of any mail into the corresponding base classifier to obtain a plurality of sub prediction results when the detection of whether any mail is a phishing mail is realized, and inputting the plurality of sub prediction results into the meta classifier to obtain a total detection result of whether any mail is a phishing mail. Therefore, the method and the device respectively model the characteristics of different dimensions, and improve the robustness and the generalization of mail detection by combining the results of all the base classifiers by adopting an ensemble learning method, thereby effectively improving the accuracy of mail detection.
The phishing mail detection method provided by the embodiment of the invention extracts the multi-dimensional characteristics of each mail in the mail data set, and can comprise the following steps:
analyzing each mail in the mail data set respectively to extract field data contained in each mail in the mail data set, and extracting domain name characteristics, link characteristics, mail text characteristics and mail attachment characteristics of each mail from the field data as corresponding multi-dimensional characteristics; the domain name characteristics comprise domain name information in a sender, a receiver and all links in a corresponding mail, the link characteristics comprise character strings of all links in the corresponding mail, the mail text characteristics comprise mail titles in the corresponding mail and content of a mail body, and the mail attachment characteristics comprise network behavior information of attachments of the corresponding mail during operation.
When the multi-dimensional feature extraction of the mail is realized, the following four-dimensional features can be specifically extracted:
c1, extracting domain name features in the mail: extracting domain name information in a sender, a receiver and all links in the mail, and carrying out vectorization; domain name features include, but are not limited to: counting the number of domain names appearing in the mail, counting the number of domain name hits IoC (Indicators of complermise, lost identifiers), calculating the minimum value of the hamming distance between the domain name and the common domain name, counting the number of domain name hits the Alexa white list, counting the number of domain name misses the Alexa white list, and the like;
c2, extracting link characteristics in the mail: extracting all linked character strings in the mail and vectorizing all the character strings; link characteristics include, but are not limited to: counting the number of resource types and various resource links represented in the links, the score of the resource corresponding to the URL of the query VirusTotal website, the number of associated IP (Internet protocol) and the like; wherein VirusTotal is a well-known threat intelligence inquiry website;
c3, extracting mail text features (mail title and body features): extracting the contents of a title and a text in the mail, segmenting Chinese contents by using Jieba, segmenting English contents by using a space separator, extracting the characteristics of the title and the text in the mail from the segmentation result, and converting the characteristics into vector representation; mail text features include, but are not limited to: using TF-IDF (term frequency-inverse document frequency) algorithm to count the importance of each word, count the number of inductive keywords, count the number of threat keywords and the like; the method comprises the following steps of A, obtaining a set of keywords, wherein the Jieba is a well-known Chinese word segmentation tool, and the inducing keywords and the threatening keywords are a keyword set predefined based on expert knowledge;
c4, extracting mail attachment features: the method comprises the steps of obtaining network behavior characteristics of an attachment when the attachment runs by running the attachment of a mail in a sandbox, and vectorizing the behavior of the attachment; mail attachment features include, but are not limited to: the number of domain name hits IoC for the attachment runtime request, the number of URL hits IoC for the attachment runtime request, the score of the attachment runtime request URL on VirusTotal, etc.
Therefore, the process of extracting the multi-dimensional characteristics obtains information as comprehensive as possible by representing the mails from multiple aspects, and the problems that a single-dimensional detection model shows insufficient robustness and generalization after being deployed on line due to incomplete characteristic information extraction can be solved.
When the multidimensional features include four-dimensional features, namely, a domain name feature, a link feature, an email text feature, and an email attachment feature, the process of implementing the corresponding four base classifiers and one-element classifier may include:
1. base classifier training-learning with different machine learning models for each dimension's features:
constructing a machine learning model based on the domain name features for the domain name features in C1;
constructing a machine learning model based on the link characteristics for the link characteristics in C2;
constructing a machine learning model based on the mail text features for the mail text features in C3;
for the mail attachment features in C4, a machine learning model based on the mail attachment features is constructed.
The method for training the base classifier based on the features of different dimensions is realized in the same principle, firstly, a training set is divided into K parts, and the training and predicting process of each round of base classifier is specifically divided into the following steps: initializing a weight parameter of a base classifier; selecting K-1 parts of training sets as training subsets of the base classifier, using the rest parts of training sets as prediction subsets, and training the base classifier on the data; and training and predicting the trained base classifier on the prediction subset. Performing K times of training prediction process in an iterative loop to obtain all prediction results of the base classifier on a training set; the results predicted by the base classifiers on the training set are respectively recorded as E1_ Pred, E2_ Pred, E3_ Pred and E4_ Pred.
It should be noted that in the model training of the embodiment of the application, respective machine learning models are constructed for features of different dimensions, each machine learning model can learn different data intrinsic patterns, and whether the mail to be detected is a phishing mail is predicted based on the respective data intrinsic patterns in the subsequent needs.
2. Base classifier result fusion and meta classifier training-combine the prediction results of four different base classifiers, E1_ Pred, E2_ Pred, E3_ Pred and E4_ Pred, into a new feature vector, construct a machine learning model based on the new feature vector, which is called a meta classifier, and combine the prediction results of four different base classifiers, E1_ Pred, E2_ Pred, E3_ Pred and E4_ Pred, into a new feature vector, which includes but is not limited to majority voting, maximum value, minimum value, and average value.
The phishing mail detection method provided by the embodiment of the invention can further comprise the following steps of, after extracting the multi-dimensional characteristics of each mail in the mail data set:
and carrying out missing value filling processing on the multidimensional characteristics of the mails in the extracted mail data set, and carrying out standardization processing on the multidimensional characteristics subjected to the missing value filling processing.
In order to further ensure that the model learns the information representing the characteristics of the mails more sufficiently and accurately, in the embodiment of the application, after the multidimensional characteristics of each mail in the mail data set are extracted, before the extracted multidimensional characteristics are used for realizing model training, data preprocessing is also performed on the extracted multidimensional characteristics, and the data preprocessing may include: processing missing values in the multi-dimensional features, wherein the missing values include but are not limited to constant filling, maximum filling, minimum filling or average filling and other missing value processing methods; due to the fact that dimensions of different features are different, the features with different dimensions can be subjected to standardization processing.
The phishing mail detection method provided by the embodiment of the invention can further comprise the following steps after each base classifier and each meta classifier are obtained through training:
and performing parameter tuning on each base classifier and each meta classifier in a cross validation mode.
It should be noted that, after the base classifier and the meta classifier are obtained through training, model parameter tuning can be performed on each base classifier and meta classifier through a leave-one verification method, a K-fold cross verification method, a repeated K-fold cross verification method and the like, and the model parameter tuning is realized by preferentially selecting a cross verification method in the embodiment of the present application, which specifically includes: respectively applying a K-fold cross-validation method to the four base classifiers obtained by training to realize model parameter optimization, and selecting a group of parameters with the best detection performance for the four base classifiers as final parameters of the corresponding base classifiers; and (3) applying a K-fold cross verification method to the meta classifier to realize model parameter optimization, and selecting a group of parameters with the best detection performance as final parameters of the meta classifier.
The phishing mail detection method provided by the embodiment of the invention can further comprise the following steps after performing parameter tuning on each base classifier and each meta classifier in a cross validation mode: each base classifier is trained on a training set, and meta classifiers are trained based on prediction results when each base classifier is trained on the training set.
Obtaining a training set based on the feature data set, which may include: the feature data set is divided into a training set and a testing set.
After training the meta classifier based on the prediction result of training each base classifier on the training set, the method may further include: and testing each base classifier on the test set, testing the meta classifier based on the prediction result of each base classifier when testing on the test set to obtain corresponding accuracy and false alarm rate, determining that the training of the meta classifier is finished if the accuracy and the false alarm rate meet the requirements, and otherwise, outputting corresponding error prompt.
After the feature data set is obtained, the feature data set is divided into the training set and the testing set, so that the training of the four base classifiers and the meta classifier is realized on the training set, and the testing of the four base classifiers and the meta classifier is realized on the testing set. Specifically, after parameters of four base classifiers and a meta classifier are adjusted and optimized, training prediction is carried out on all data contained in a training set by each base classifier, and the meta classifier is trained on the basis of the result of the training prediction of each base classifier on the training set; then, using each base classifier to perform test prediction on the test set, and using the meta classifier to perform prediction again on the test prediction result, thereby obtaining the final mail detection accuracy and false alarm rate; and when the accuracy and the false alarm rate meet the requirements, determining that each base classifier and each meta classifier are successfully obtained, otherwise, outputting a corresponding error prompt to indicate manual triggering to realize model training again or automatically triggering to realize model training again. Therefore, the optimal model training and verification are realized through the steps, and the performance of the model is further ensured to meet the requirements. In order to overcome the problems of insufficient robustness and generalization in the existing phishing mail detection technology, the application provides a technical scheme of phishing mail detection based on ensemble learning; in a specific implementation manner, the technical solution provided by the present application may specifically include:
A) preparing a data set: preparing a mail data set with tags, wherein the data form comprises but is not limited to an EML file, a JSON file or a TXT file, and the like, and dividing the data set into a training set and a testing set.
B) Analyzing data:
b1, analyzing the files for storing the mails, extracting data of each field, and processing various character codes, spaces in character strings and line feed problems;
and B2, classifying and storing the content in the mail according to the domain name, the link, the title, the body and the attachment file path dimension.
C) Extracting multi-dimensional features:
c1, extracting domain name features in the mail;
c2, extracting link characteristics in the mail;
c3, extracting mail text features in the mail;
and C4, extracting the mail attachment features in the mail.
D) Data preprocessing:
d1, processing missing values in the extracted multi-dimensional features;
and D2, carrying out standardization processing on the features with different dimensions.
E) Base classifier training-learning with different machine learning models for each dimension's features:
e1, constructing a domain name feature-based perceptron model for the domain name features in C1;
e2, constructing a support vector machine model based on the link characteristics for the link characteristics in the C2;
e3, constructing a random forest model based on the mail text characteristics for the mail text characteristics in the C3;
e4, constructing an AdaBoost model based on the mail attachment features for the mail attachment features in the C4;
the method for training the base classifier in E5 and E1-E4 is the same, the training set is divided into five parts, and the training and predicting process of each round is specifically divided into the following steps:
e5.1, initializing weight parameters of the base classifier;
e5.2, selecting four parts of the data as training subsets of the base classifier, and training the base classifier on the data;
e5.3, predicting the trained base classifier on the rest of the training set;
e5.4, carrying out five times of training prediction processes by iterative loop to obtain all prediction results of the base classifier on a training set;
the results predicted by the base classifiers in E6 and E1-E4 on the training set are respectively marked as E1_ Pred, E2_ Pred, E3_ Pred and E4_ Pred.
F) And (3) base classifier result fusion and meta classifier training: combining the prediction results E1_ Pred, E2_ Pred, E3_ Pred and E4_ Pred of four different base classifiers into a new feature vector; and constructing a neural network model based on the new feature vectors, and the neural network model is called a meta classifier.
G) And (3) optimizing parameters: and performing model parameter optimization on each base classifier and each meta classifier in a cross validation mode.
H) Training and verifying an optimal model: and training all the optimal base classifiers on all the training data, and performing performance test on the test set to obtain the accuracy and the false alarm rate of the final model detection.
In summary, the method extracts the multi-dimensional characteristics of the mails through characteristic engineering, and uses the characteristics of each dimension for training a base classifier respectively, wherein each base classifier only uses the characteristics of one dimension for training and predicts on a training set in a five-fold intersection mode; taking the result predicted by the four base classifiers on the training set as a new feature, training the meta classifier and finally predicting whether the mail is a phishing mail; and finally, selecting the classifier parameters with the best detection performance by a K-fold cross validation method. Wherein the ensemble learning method uses multiple learning algorithms in statistics and machine learning to obtain better prediction performance than any of the individual learning algorithms alone; feature engineering is a process of transforming raw data into features that better represent the nature of a problem using relevant knowledge in the data domain; the phishing mail is a malicious mail which utilizes a disguised e-mail to deceive a receiver to reply information such as an account number, a password and the like to a specified receiver, or guides the receiver to access a malicious website and download a malicious attachment to invade a computer system of a user. The method and the device can fuse the multi-dimensional information of the mails and extract the characteristic information of the phishing mails more comprehensively, so that the robustness, the generalization and the accuracy of the detection model are improved, namely, the technical scheme of the phishing mail detection based on ensemble learning provided by the embodiment of the application has the following characteristics:
1. the method solves the problems of information loss representation, poor robustness and poor generalization existing in the existing phishing mail detection scheme based on single-dimensional characteristics;
2. the multi-dimensional characteristics are considered from different angles, original information of the phishing mails is comprehensively represented, and the problem of information loss based on a single-dimensional characteristic detection method is solved;
3. and modeling is respectively carried out on the characteristics of different dimensions, and the robustness and the generalization of the detection method and the detection system are improved by adopting an ensemble learning method and combining the results of all the base classifiers.
An embodiment of the present invention further provides a phishing mail detection device, as shown in fig. 2, the device may include:
an extraction module 11 configured to: acquiring a mail data set, extracting multi-dimensional features of mails in the mail data set to obtain a feature data set containing the multi-dimensional features of the mails in the mail data set, and acquiring partial multi-dimensional features contained in the feature data set as a training set;
a training module 12 for: respectively training a prediction classifier by using the characteristics of each dimension in a training set in a cross training prediction mode to obtain a plurality of base classifiers which are in one-to-one correspondence with the characteristics of each dimension, and a prediction result which is obtained by predicting each base classifier on the training set, and training the classifiers based on the prediction result to obtain a meta classifier;
a detection module 13 for: and respectively inputting the characteristic of each dimension in the multi-dimension characteristic of the mail to be detected into a corresponding base classifier to obtain a plurality of sub-prediction results, and inputting the plurality of sub-prediction results into a meta classifier to obtain a total detection result of whether the mail to be detected is a phishing mail.
An embodiment of the present invention further provides a phishing mail detection device, which may include:
a memory for storing a computer program;
a processor for implementing the steps of any one of the above phishing mail detection methods when executing the computer program.
The embodiment of the invention also provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program can realize the steps of any one of the phishing mail detection methods.
It should be noted that for the description of the relevant parts in the phishing mail detection device, the equipment and the storage medium provided by the embodiment of the invention, reference is made to the detailed description of the corresponding parts in the phishing mail detection method provided by the embodiment of the invention, and the detailed description is omitted here. In addition, parts of the above technical solutions provided in the embodiments of the present invention that are consistent with the implementation principles of the corresponding technical solutions in the prior art are not described in detail, so as to avoid redundant description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A phishing mail detection method, comprising:
acquiring a mail data set, extracting multi-dimensional features of mails in the mail data set to obtain a feature data set containing the multi-dimensional features of the mails in the mail data set, and acquiring part of the multi-dimensional features contained in the feature data set as a training set;
respectively training a prediction classifier by using the features of each dimension in the training set in a cross training prediction mode to obtain a plurality of base classifiers corresponding to the features of each dimension one to one, and a prediction result predicted by each base classifier on the training set, and training the classifiers to obtain meta classifiers based on the prediction result;
and respectively inputting the characteristic of each dimension in the multi-dimension characteristic of the mail to be detected into a corresponding base classifier to obtain a plurality of sub-prediction results, and inputting the plurality of sub-prediction results into the meta classifier to obtain a total detection result of whether the mail to be detected is a phishing mail.
2. The method of claim 1, wherein extracting multidimensional features of each email in the email dataset comprises:
and analyzing each mail in the mail data set respectively to extract field data contained in each mail in the mail data set, and extracting domain name characteristics, link characteristics, mail text characteristics and mail attachment characteristics of each mail from the field data as corresponding multi-dimensional characteristics.
3. The method of claim 2, wherein after extracting the multidimensional feature of each email in the email dataset, further comprising:
and performing missing value filling processing on the multi-dimensional features of the mails in the extracted mail data set, and performing standardization processing on the multi-dimensional features subjected to the missing value filling processing.
4. The method of claim 3, wherein after training each of the base classifiers and the meta classifiers, further comprising:
and performing parameter tuning on each base classifier and each meta classifier in a cross validation mode.
5. The method of claim 4, wherein after performing parameter tuning on each of the base classifier and the meta classifier by means of cross validation, the method further comprises:
training each base classifier on the training set, and training the meta classifier based on a prediction result when each base classifier is trained on the training set.
6. The method of claim 5, wherein deriving a training set based on the feature dataset comprises:
dividing the characteristic data set into a training set and a testing set;
correspondingly, after training the meta classifier based on the prediction result of training each of the base classifiers on the training set, the method further includes:
and testing each base classifier on the test set, testing the meta classifier based on a prediction result obtained when each base classifier is used for testing on the test set to obtain corresponding accuracy and false alarm rate, determining that the training of the meta classifier is finished if the accuracy and the false alarm rate meet requirements, and otherwise, outputting corresponding error prompt.
7. The method of claim 6, wherein the domain name characteristics comprise domain name information of a sender, a receiver and all links in the corresponding mail, the link characteristics comprise character strings of all links in the corresponding mail, the mail text characteristics comprise mail titles and mail body contents in the corresponding mail, and the mail attachment characteristics comprise network behavior information of attachments of the corresponding mail during operation.
8. A phishing mail detection apparatus comprising:
an extraction module to: acquiring a mail data set, extracting multi-dimensional features of mails in the mail data set to obtain a feature data set containing the multi-dimensional features of the mails in the mail data set, and acquiring part of the multi-dimensional features contained in the feature data set as a training set;
a training module to: respectively training a prediction classifier by using the features of each dimension in the training set in a cross training prediction mode to obtain a plurality of base classifiers corresponding to the features of each dimension one to one, and a prediction result predicted by each base classifier on the training set, and training the classifiers to obtain meta classifiers based on the prediction result;
a detection module to: and respectively inputting the characteristic of each dimension in the multi-dimension characteristic of the mail to be detected into a corresponding base classifier to obtain a plurality of sub-prediction results, and inputting the plurality of sub-prediction results into the meta classifier to obtain a total detection result of whether the mail to be detected is a phishing mail.
9. A phishing mail detection apparatus, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the phishing mail detection method as claimed in any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, carries out the steps of the phishing mail detection method of any one of claims 1 to 7.
CN202111632166.1A 2021-12-28 2021-12-28 Phishing mail detection method, device, equipment and storage medium Withdrawn CN114499980A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111632166.1A CN114499980A (en) 2021-12-28 2021-12-28 Phishing mail detection method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111632166.1A CN114499980A (en) 2021-12-28 2021-12-28 Phishing mail detection method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114499980A true CN114499980A (en) 2022-05-13

Family

ID=81496641

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111632166.1A Withdrawn CN114499980A (en) 2021-12-28 2021-12-28 Phishing mail detection method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114499980A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116223962A (en) * 2023-05-08 2023-06-06 中科航迈数控软件(深圳)有限公司 Method, device, equipment and medium for predicting electromagnetic compatibility of wire harness

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138913A (en) * 2015-07-24 2015-12-09 四川大学 Malware detection method based on multi-view ensemble learning
CN108965245A (en) * 2018-05-31 2018-12-07 国家计算机网络与信息安全管理中心 Detection method for phishing site and system based on the more disaggregated models of adaptive isomery

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138913A (en) * 2015-07-24 2015-12-09 四川大学 Malware detection method based on multi-view ensemble learning
CN108965245A (en) * 2018-05-31 2018-12-07 国家计算机网络与信息安全管理中心 Detection method for phishing site and system based on the more disaggregated models of adaptive isomery

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116223962A (en) * 2023-05-08 2023-06-06 中科航迈数控软件(深圳)有限公司 Method, device, equipment and medium for predicting electromagnetic compatibility of wire harness
CN116223962B (en) * 2023-05-08 2023-07-07 中科航迈数控软件(深圳)有限公司 Method, device, equipment and medium for predicting electromagnetic compatibility of wire harness

Similar Documents

Publication Publication Date Title
US11516223B2 (en) Secure personalized trust-based messages classification system and method
Karim et al. A comprehensive survey for intelligent spam email detection
US11159545B2 (en) Message platform for automated threat simulation, reporting, detection, and remediation
JP7391110B2 (en) Phishing campaign detection
Ramanathan et al. phishGILLNET—phishing detection methodology using probabilistic latent semantic analysis, AdaBoost, and co-training
Ramanathan et al. Phishing detection and impersonated entity discovery using Conditional Random Field and Latent Dirichlet Allocation
Li et al. LSTM based phishing detection for big email data
US11595435B2 (en) Methods and systems for detecting phishing emails using feature extraction and machine learning
Rahim et al. Detecting the Phishing Attack Using Collaborative Approach and Secure Login through Dynamic Virtual Passwords.
Gandotra et al. Improving spoofed website detection using machine learning
Kumar Birthriya et al. A comprehensive survey of phishing email detection and protection techniques
Akinyelu Machine learning and nature inspired based phishing detection: a literature survey
Khan Detection of phishing websites using deep learning techniques
Wang et al. DeepC2: Ai-powered covert command and control on OSNs
CN114499980A (en) Phishing mail detection method, device, equipment and storage medium
Alkawaz et al. Identification and analysis of phishing website based on machine learning methods
CN112039874B (en) Malicious mail identification method and device
US20230164180A1 (en) Phishing detection methods and systems
US20230104884A1 (en) Method for detecting webpage spoofing attacks
Sokolov et al. Visual spoofing in content-based spam detection
Yazhmozhi et al. Natural language processing and Machine learning based phishing website detection system
Charan et al. Phishing Websites Detection using Machine Learning with URL Analysis
Sushma et al. Deep learning for phishing website detection
Khadir et al. Efforts and Methodologies used in Phishing Email Detection and Filtering: A Survey.
Mittal et al. Phishing detection using natural language processing and machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20220513

WW01 Invention patent application withdrawn after publication