CN111614543B

CN111614543B - URL-based spear phishing mail detection method and system

Info

Publication number: CN111614543B
Application number: CN202010279729.2A
Authority: CN
Inventors: 汪秋云; 姜政伟; 汪姝玮; 辛丽玲; 丁雄; 刘宝旭
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2020-04-10
Filing date: 2020-04-10
Publication date: 2021-09-14
Anticipated expiration: 2040-04-10
Also published as: CN111614543A

Abstract

The invention discloses a URL-based spear phishing mail detection method and a URL-based spear phishing mail detection system, which relate to the field of information security detection and the field of network space security, and select a mail containing a URL link by detecting whether a mail body contains the URL link or not; extracting a feature vector of a URL link from a mail containing the URL link based on a mail history record; classifying the characteristic vectors of the URL links by using a trained link classifier, and selecting the mails with malicious links; extracting metadata with malicious linked mails, and extracting mail feature vectors from the metadata by using mail history records; and classifying the mail feature vectors by using a trained spear classifier, and detecting the spear phishing mails based on the URL. The invention can achieve lower false alarm rate and higher detection rate only by the support of the historical mails.

Description

URL-based spear phishing mail detection method and system

Technical Field

The invention relates to the field of information security detection and the field of network space security, in particular to a URL-based spearphishing (spearphishing) mail detection method and system.

Background

With the rapid development of computers and the internet, electronic mails have become an indispensable and important part of people's daily life and work. However, the e-mail brings convenience to people and brings convenience to attackers. Attackers steal money or sensitive information by sending phishing mails or spear phishing mails to users and employees by means of social engineering.

The spear fishing mail is a very targeted fishing mail, and an attacker firstly collects information of a user, then elaborately customizes mail content and sends the mail to a receiver through a fake sender or a fake third-party service provider. Attackers often implement spearphishing mail in two ways, the first being sending spearphishing mail with malicious attachments and the second being sending spearphishing mail with phishing links. For an attacker, the attachment of the mail usually uses 0-day or related loopholes, and the digging of the loopholes is a very time-consuming and labor-consuming work, so the threshold difficulty of making the attachment is high, and the requirement on the attacker is also high. Many current mail gateways check mail attachments more frequently, such as virtual execution using sandbox, and these measures all make the attachment-based attack success rate lower. For the link-based spearphishing mail, an attacker needs to make only one phishing website, the cost and the technical content are relatively low, the implementation is easy, and the existing detection technology is difficult to find the spearphishing mail containing the elaborated link. Based on the above background, the present invention provides a detection scheme for a spearphishing mail for a URL.

Currently, the following three problems mainly exist for the URL-based spearphishing mail detection scheme. The first is that the false alarm rate is too high, the false alarm rate of most detection methods is about 1%, for tens of thousands or hundreds of thousands of mails per day in practical application, 1% of false alarms can result in thousands or even tens of thousands of false alarms, and such false alarm rate is obviously unacceptable for practicability; the second major problem is that the URL-based spearphishing mails cannot be detected accurately, the detection rate is low, about 80%, for companies and enterprises, 1 spearphishing mail which is missed to be detected causes huge economic loss and consequences, and the detection rate is unacceptable for the companies; the third major problem is that there is currently an academic method, which has a high detection rate and a very low false alarm rate, but this method needs the support of complete logs (such as NIDS, SMTP, etc.), and for most companies and organizations, there is no facility for recording detailed and complete related logs, so this method is not widely practical.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a spear phishing mail detection method and system based on URL, which can achieve lower false alarm rate and higher detection rate only by the support of historical mails.

In order to achieve the purpose, the invention adopts the technical scheme that:

a spear phishing mail detection method based on URL comprises the following steps:

detecting whether the mail body contains URL links or not, and selecting the mail containing the URL links;

extracting a feature vector of a URL link from a mail containing the URL link based on a mail history record;

classifying the characteristic vectors of the URL links by using a trained link classifier, and selecting the mails with malicious links;

extracting metadata with malicious linked mails, and extracting mail feature vectors from the metadata by using mail history records;

and classifying the mail feature vectors by using a trained spear classifier, and detecting the spear phishing mails based on the URL.

Further, extracting the feature vector of the URL link includes the following steps:

extracting URLs and corresponding domain names from the mail body, and calculating the number of unique URLs and the number of domain names after duplication removal;

inquiring the ranking of each domain name, and taking the lowest ranking as the global ranking of the mail links;

inquiring the registration date of each domain name, and taking the latest registration date as the registration date of the mail link;

inquiring the score of each URL and domain name analyzed as malicious, wherein the score is equal to the ratio of the number of the analysis engines for judging the URL or the domain name as malicious to the total number of the analysis engines, and the highest score is taken as the score of the URL and the domain name of the mail;

inquiring whether each URL is a phishing link or not, and taking the worst inquiry result as the score of the mail as a phishing mail;

counting the occurrence times of a Fully Qualified Domain Name (FQDN) corresponding to a URL in historical data and the time interval of the last occurrence of the Fully Qualified Domain Name;

and forming a feature vector of the URL link by using the result obtained in the step.

Further, the ranking of each domain name is queried from the Alexa global domain name ranking website.

Further, the registration date of each domain name is queried from the WHOIS domain name query website.

Further, a score is analyzed from the Virus Total malicious code analysis website for each URL and domain name that is analyzed as malicious.

Further, from the Phish Tank phishing link analysis website, it is queried whether each URL is a phishing link.

Further, the feature vector of the URL link includes a reputation feature and a statistical feature.

Further, the metadata includes sender IP, sender address, sender name, recipient address, mail subject, mail body, and mail attachment.

Further, the mail feature vector comprises a reputation feature, a forwarding relation feature and a habit feature.

Further, the step of extracting the mail feature vector from the metadata comprises the following steps:

inquiring the maliciousness scores of the sender IP and the sender mailbox address;

counting the times of appearance of the names and addresses of the senders in the historical data and the time interval of the latest appearance;

counting the occurrence times of the names of the senders in the historical data and the time interval of the latest occurrence;

counting the occurrence times of the sender address in the historical data and the time interval of the latest occurrence;

extracting the forwarding scale quantity of the mail from the recipient list;

counting the times of the simultaneous occurrence of all recipients of the mail in the historical data and the time interval of the latest occurrence;

judging whether the mail text has a telephone number or not;

judging whether a bank account number appears in the mail body;

calculating the subject richness of the mail, wherein the subject richness is equal to the ratio of the number of words in the subject of the mail to the number of characters;

calculating the text richness of the mail, wherein the text richness is equal to the ratio of the number of words in the mail text to the number of characters;

and constructing a mail feature vector by using the results obtained in the steps.

Further, a query is made from the Virus Total malicious code analysis website for the maliciousness score inquiring about the sender IP and sender mailbox address.

Further, a link classifier and a spear classifier are respectively trained through a random forest classification algorithm, wherein a training set of the link classifier is mails containing malicious links and mails not containing the malicious links, and a spear classifier data set is malicious mails containing the malicious links, such as spear fishing mails and non-spear fishing mails.

Further, the data of the spear fishing mail is enhanced by using a K-means based SMOTE (Synthetic Minority Over-sampling Technique) algorithm, and the obtained sample is used for training a spear classifier, wherein the enhancement comprises the following steps:

taking a minority class URL-based spearphishing mail data set as a minority class sample, and taking a majority class non-URL-based spearphishing mail data set as a majority class sample;

calculating K neighbors of each minority sample on the whole data set by using Euclidean distance;

if the K neighbors of the few samples are all the majority samples, recording the samples as noise samples;

carrying out unsupervised clustering on the few samples with the noise samples removed by using a K-means algorithm to obtain a plurality of clusters;

enhancing each cluster by using a SMOTE algorithm;

calculating K neighbor of each sample in the cluster by using the Euclidean distance, randomly selecting one sample in the K neighbor, interpolating based on the sample, the randomly selected sample and a generated random number to obtain a new sample, and repeating the step for R times, wherein R is an enhanced proportion;

and finally obtaining a new sample as a URL-based spearphishing mail sample set.

A URL-based spearphishing mail detection system comprising a memory storing a computer program configured to perform the steps of the above method by a processor and a processor.

Compared with the prior art, the method can quickly screen out suspicious URL-based spearphishing mails. Under the condition of only adopting historical mails, the requirement on data is greatly reduced, the false alarm rate of detection is greatly reduced due to the double-layer classifier structure, and the classification accuracy is greatly improved due to the combination of multi-source characteristics.

Drawings

Fig. 1 is a flowchart of a method for detecting a spearphishing mail based on a URL according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a spearphishing mail detection system based on a URL according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of three key methods provided in the embodiment of the present invention.

Fig. 4 is a schematic diagram of extracting mail reputation features, forwarding relationship features, and habit features using a third-party platform and historical data according to an embodiment of the present invention.

FIG. 5 is a schematic diagram of a process of enhancing a URL-based spearphishing mail by the K-means-based SMOTE algorithm according to an embodiment of the present invention.

Detailed Description

In order to make the technical solution of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

The embodiment provides a method for detecting a spear phishing mail based on a URL, which detects the spear phishing mail based on the URL by using a double-layer classifier structure, and as shown in FIG. 1, the method comprises the following steps:

firstly, extracting the text of a new mail, judging whether the mail contains a URL link, if not, judging the mail as a non-URL-based spearphishing mail, and if so, entering the second step;

secondly, extracting the characteristics of the URL link by using a third-party platform and a history record to obtain a characteristic vector of the mail URL link;

thirdly, classifying the feature vectors of the URL links by using a trained link classifier, if the link classifier judges that the links have no maliciousness, judging the mail as a non-URL-based spearphishing mail, and if the mail is judged to be maliciousness, entering the fourth step;

fourthly, extracting the characteristics of the mail metadata by using a third-party platform and a history record, and extracting reputation characteristics, forwarding relation characteristics and habit characteristics to obtain a mail characteristic vector;

and fifthly, classifying the mail feature vectors by using the trained spear classifier, and outputting an alarm by a system output module if the spear classifier judges that the mail is suspicious URL-based spear mail.

The present embodiment provides a URL-based spearphishing mail detection system comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the program comprising instructions for performing the steps of the above method. As an embodiment, the system can be divided into six modules according to functions: the system comprises a data preprocessing module, a feature extraction module, a spear phishing mail enhancement module, a two-layer classifier pre-training module, an external interface calling module, a system output module and a two-layer classifier detection model, which are shown in figure 2. It is noted that the present invention is not limited to these six modules.

In the model training phase:

the data preprocessing module extracts the metadata of the mail from the original mail flow, wherein the metadata comprises a sender IP, a sender address, a sender name, a receiver address, a mail subject, a mail body, a mail attachment and the like. And then detecting whether the URL link is embedded in the mail body, and entering the mail containing the URL link into the subsequent flow.

The characteristic extraction module only extracts relevant characteristics of the mail URL link when entering a first-layer link classifier of the model, wherein the relevant characteristics comprise reputation characteristics and statistical characteristics; and when entering a second-layer spear type classifier, habit features, forwarding relation features, reputation features and the like are extracted from the metadata of the mails. Finally, all features are uniformly coded to realize sample standardization.

And the spear fishing mail enhancement module is used for enhancing spear fishing mail data by using a K-means-based SMOTE algorithm.

And the double-layer classifier pre-training module is used for training the link classifier of the first layer and the spear classifier of the second layer by using a random forest classification algorithm. The training set of the link classifier is mails containing malicious links and mails without the malicious links, and the data set of the spear classifier is spear fishing mails containing the malicious links and other malicious mails.

In the model use stage:

and the detection interface of the external interface calling module calls the data preprocessing module and the feature extraction module to finish preprocessing the input mails.

And the external interface calling module calls the trained double-layer classifier.

And the detection result processing interface of the external interface calling module outputs the detection result of the mail.

The detection result is the confidence that the mail is the spearphishing mail based on the URL. The confidence level represents the degree of possibility, the invention takes 0.5 as a boundary, and higher than 0.5 represents the detection of the fish spear phishing mail, and the closer the number is to 1, the greater the degree of possibility.

The core of the method and the system lies in three aspects as shown in figure 3:

extracting mail reputation characteristics, forwarding relation characteristics, habit characteristics and the like by using a third-party platform and historical data;

carrying out data enhancement on the spearphishing mails based on the URL by utilizing a SMOTE algorithm based on K-means;

URL-based spearphishing mails are detected by utilizing a double-layer classifier architecture.

In this embodiment, a third-party platform and historical data are used to extract a mail reputation feature, a statistical feature, a forwarding relation feature, and a habit feature, as shown in fig. 4, the specific steps are as follows:

for the first tier link classifier:

firstly, extracting URLs and corresponding domain names from a mail text, and calculating the number of unique URLs and the number of domain names after duplication removal;

secondly, inquiring the ranking of each domain name from an Alexa global domain name ranking website, and taking the lowest ranking as the global ranking of the mail link;

thirdly, inquiring the registration date of each domain name from the WHOIS domain name inquiry website, and taking the latest registration date as the registration date of the mail link;

thirdly, inquiring each URL and domain name from the Virus Total malicious code analysis website to be analyzed into malicious scores, and calculating the score according to the following formula, wherein the highest score is taken as the score of the URL and the domain name of the mail;

score＝numMalicious/numTotal

wherein: numMalcious is the number of analysis engines for judging the URL or domain name as malicious, and nummTotal is the total number of analysis engines;

fourthly, inquiring whether each URL is a phishing link from the Phish Tank phishing link analysis website, and taking the worst inquiry result as the score of the mail which is a phishing mail;

fifthly, counting the occurrence frequency of FQDN corresponding to URL in the historical data and the time interval of the last occurrence of FQDN;

and sixthly, obtaining a feature vector for each sample in the data set by using the method for extracting the features aiming at the link, and forming the input of the link classifier.

For the second layer of spear classifiers:

step one, inquiring the malicious scores of an IP (Internet protocol) of a sender and a mailbox address of the sender from a Virus Total malicious code analysis website;

secondly, counting the times of the names and the addresses of the senders appearing together in the historical data and the time interval of the latest appearance;

thirdly, counting the occurrence times of the names of the senders in the historical data and the time interval of the latest occurrence;

fourthly, counting the occurrence frequency of the address of the sender in the historical data and the time interval of the latest occurrence;

fifthly, extracting the forwarding scale quantity of the mail from the recipient list;

sixthly, counting the times of the simultaneous occurrence of all the recipients of the mail and the time interval of the latest occurrence in the historical data;

sixthly, judging whether the telephone number appears in the mail text;

seventhly, judging whether the mail text has a bank account number or not;

and eighthly, calculating the subject richness of the mail, namely:

subjectRichess＝subjectNumWords/subjectNumCharacters

wherein: SubjectNumWords is the number of words in the mail topic, and substectNumCaractors is the number of words in the mail topic;

and the ninth step, calculating the text richness of the mail, namely:

bodyRichess＝bodyNumWords/bodyNumCharacters

wherein: the body NumWords is the number of words in the mail body, and the body NumCaractors is the number of words in the mail body;

and step ten, obtaining a feature vector for each sample in the data set by using the method for extracting the features of the mails to form the input of the spear classifier.

The SMOTE algorithm based on K-means is used for enhancing the URL-based spearphishing mails, and as shown in FIG. 5, the specific steps are as follows:

first, a few URL-based spearphishing data sets are marked as P { P₁,p₂,…,p_mRecord the majority of classes of non-URL-based spearphishing mail as Q { Q }₁,q₂,…,q_n}；

Second, calculate each minority class sample p using Euclidean distance_i(i ═ 1,2, …, m) K neighbors over the entire dataset;

third, if the minority class samples p_i(i is 1,2, …, m) all have most samples, and then the sample is marked as a noise sample and does not participate in the subsequent interpolation process;

fourthly, carrying out unsupervised clustering on the few samples after the noise samples are removed by using a K-means algorithm to obtain d clusters which are marked as C { C }₁,C₂,…,C_d}；

The fifth step, for each cluster C_i(i ═ 1,2, …, d), enhanced using the SMOTE algorithm (sixth, seventh steps) as follows;

sixth, for C_iCalculating K neighbor of X by using Euclidean distance, and randomly selecting a sample X from the K neighbor_j；

Seventhly, randomly generating a number lambda from 0 to 1, and generating a new sample X by using a time formula_new；

X_new＝X+λ×(X_j-X)

Eighthly, repeating the sixth step and the seventh step for R times, wherein R is the enhanced proportion;

and ninthly, obtaining a newly generated spear phishing mail sample set based on the URL through the interpolation.

The above embodiments are only intended to illustrate the technical solution of the present invention, but not to limit it, and a person skilled in the art can modify the technical solution of the present invention or substitute it with an equivalent, and the protection scope of the present invention is subject to the claims.

Claims

1. A spear phishing mail detection method based on URL is characterized by comprising the following steps:

2. The method of claim 1, wherein extracting feature vectors for URL links comprises the steps of:

counting the occurrence times of the fully qualified domain name corresponding to the URL in the historical data and the time interval of the last occurrence of the fully qualified domain name;

3. The method of claim 2, wherein the ranking of each domain name is queried from an Alexa global domain name ranking website, the registration date of each domain name is queried from a WHOIS domain name query website, each URL and domain name is analyzed as a malicious score from a Virus Total malicious code analysis website, and each URL is queried from a Phish Tank phishing link analysis website as to whether it is a phishing link.

4. The method of claim 1 or 2, wherein the feature vector of the URL link includes a reputation feature and a statistical feature.

5. The method of claim 1 wherein the metadata includes sender IP, sender address, sender name, recipient address, mail subject, mail body, and mail attachment, and the mail feature vector includes reputation features, forwarding relationship features, and habit features.

6. The method of claim 1 or 5, wherein extracting the mail feature vector from the metadata comprises the steps of:

extracting the forwarding scale quantity of the mail from the recipient list;

judging whether the mail text has a telephone number or not;

judging whether a bank account number appears in the mail body;

7. The method of claim 6 wherein the maliciousness score for the sender IP and sender mailbox address is queried from a Virus Total maliciousness code analysis website.

8. The method of claim 1, wherein the link classifier and the spear classifier are trained by a random forest classification algorithm, respectively, wherein the training set of link classifiers is mail with and without malicious links, and the spear classifier data set is malicious mail such as spear fishing mail with malicious links and non-spearfishing mail.

9. The method of claim 1 or 8, wherein the data of spear fishing emails are enhanced using a K-means based SMOTE algorithm, and the resulting samples are used for training of a spear classifier, the enhancement comprising the steps of:

enhancing each cluster by using a SMOTE algorithm;

10. A URL-based spearphishing mail detection system comprising a memory and a processor, the memory storing a computer program, characterized in that the computer program is configured to perform the steps of the method of any of the preceding claims 1-9 by the processor.