CN112948725A

CN112948725A - Phishing website URL detection method and system based on machine learning

Info

Publication number: CN112948725A
Application number: CN202110231656.4A
Authority: CN
Inventors: 于金龙; 王智民; 王高杰; 卯路宁
Original assignee: Beijing 6Cloud Technology Co Ltd; Beijing 6Cloud Information Technology Co Ltd
Current assignee: Beijing 6Cloud Technology Co Ltd; Beijing 6Cloud Information Technology Co Ltd
Priority date: 2021-03-02
Filing date: 2021-03-02
Publication date: 2021-06-11

Abstract

The invention provides a phishing website URL detection method and system based on machine learning, and belongs to the field of information safety. The method comprises the following steps: analyzing the URL to be detected, and extracting the structural information of the URL to be detected and words forming the URL to be detected; extracting URL features according to the URL to be detected, the structural information of the URL to be detected and words forming the URL to be detected; and inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is an abnormal URL. Compared with the traditional blacklist technology, the URL detection method extracts the feature training model from the URL for prediction, is wider in coverage range and more accurate in detection result, adopts the trained URL model for detection, does not need frequent updating, occupies less resources, can run by a common computer, and meets the requirements of a large number of users.

Description

Phishing website URL detection method and system based on machine learning

Technical Field

The invention relates to the field of information safety, in particular to a phishing website URL detection method based on machine learning and a phishing website URL detection system based on machine learning.

Background

Phishing is a major problem on today's internet, and many users are becoming victims due to the deceptive means of criminals. Phishing is a fraudulent technique that uses email spoofing as its primary medium to communicate fraudulently and then obtains the required information from the victim, such as username, password, credit card and bank account, through a deceptive website.

The action requested in an email is typically to open a Web link and fill in personally sensitive information on the Web page, or to provide his personal identity or bank information in reply to the email. The user, after clicking on the Web link provided in the deceptive email, will be directed to the phishing website created by the phisher. Since the phishing website looks similar to the original website, the user often cannot recognize it as a malicious website and inputs required information as required, thereby being successfully phished. In addition to e-mail, an attacker may also direct a user to access malicious links by embedding advertising links on real websites. Furthermore, in some cases, an infected DNS may cause users to be redirected to unusual websites and phishing websites.

Blacklisting techniques remain the most common defense of users against such phishing websites, using a near-matching algorithm to check if suspicious URLs are present in the blacklist. However, this method has the following technical problems that cannot be solved:

1. blacklisting is a passive defense method that requires constant maintenance, often updating (deleting URLs that have expired, adding new phishing URLs), and is not a simple matter.

2. An attacker, after destroying a phishing webpage, may implant it into a server that is considered secure, in which case the blacklist-based approach will fail to detect the phishing website.

3. The system can not cope with the situation that the number of the blacklists is continuously increased, the number of the blacklists is more and more along with the increase of time, and the blacklist data can occupy more and more system resources. Therefore, the blacklist technology has been unable to meet the user's requirements.

Disclosure of Invention

Compared with the traditional blacklist technology, the URL detection method extracts the feature training model from the URL for prediction, is wider in coverage range and more accurate in detection result, adopts the trained URL model for detection, does not need frequent updating, occupies less resources, can run by a common computer, and meets the requirements of a large number of users.

In order to achieve the above object, a first aspect of the present invention provides a phishing website URL detection method based on machine learning, the method including:

analyzing the URL to be detected, and extracting the structural information of the URL to be detected and words forming the URL to be detected;

extracting URL features according to the URL to be detected, the structural information of the URL to be detected and words forming the URL to be detected;

and inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is an abnormal URL.

Optionally, the structure information of the URL includes: a URL sub-domain name, a URL suffix, and a URL path; the URL to be tested is analyzed, the structural information of the URL is extracted, and words forming the URL to be tested comprise:

analyzing the URL to be detected, and extracting the structure information of the URL according to the structure of the URL;

and dividing the URL according to the special characters, and extracting words forming the URL to be detected. After the URL is analyzed and decomposed, more accurate characteristics can be extracted, and therefore the detection accuracy rate is improved.

Optionally, the URL features include a first feature, a second feature, and a third feature; according to the URL to be detected, the structural information of the URL and the words forming the URL to be detected, URL features are extracted, and the method comprises the following steps:

extracting a first feature according to the structure information of the URL;

extracting a second characteristic according to the URL to be detected;

and extracting a third characteristic according to the words forming the URL to be detected.

Further, the extracting the first feature according to the URL structure information includes:

judging whether the URL domain name is in an IP address form or not to obtain the judgment result of the IP address form of the URL;

judging whether the URL domain name is a DGA domain name or not to obtain a URL domain name judgment result;

judging whether the URL to be detected exists in a domain name list one million before the ranking;

the first feature includes: and judging the IP address form of the URL, judging the domain name of the URL, and judging whether the URL to be detected exists in a domain name list which is ranked one million before. The first feature is extracted based on the structure information of the URL, and reflects the characteristic of the structure of the URL.

Further, the extracting a second feature according to the URL to be detected includes:

counting the length of the URL to be detected;

counting the number of special characters in the URL to be detected;

judging whether special keywords exist in the URL to be detected or not;

calculating the number of the numbers in the URL to be detected;

calculating the proportional value of the number and the letter in the URL;

calculating the entropy of the URL;

calculating a KS check value of the URL;

calculating KL distance values of the URLs;

calculating the Euclidean distance value of the URL;

calculating the ratio of vowels to consonants in the URL;

judging whether the URL has an HTML entity to obtain an HTML entity judgment result of the URL;

the second feature includes: the method comprises the steps of determining the length of a URL to be detected, the number of special characters in the URL to be detected, the number of numbers in the URL to be detected, whether special keywords exist in the URL to be detected, the ratio value of the numbers and letters in the URL, the entropy of the URL, the KS test value of the URL, the KL distance value of the URL, the Euclidean distance value of the URL, the ratio value of vowels and consonants in the URL and the HTML entity determination result of the URL. The second feature is extracted based on the URL itself, and embodies the overall characteristics of the URL.

Further, the extracting a third feature according to the words forming the URL to be detected includes:

adding the words forming the URL to be tested into a remaining word list;

judging whether the words in the remaining word list are random characters one by one, adding the words which are the random characters into a random character word list, and keeping the words which are not added in the remaining word list;

judging whether the words with the length larger than a set length threshold value in the remaining word list are combined words formed by a plurality of words one by one, adding the combined words into the combined word list, and keeping the words which are not added in the remaining word list;

judging whether the words in the residual word list are misspelled one by one, adding the misspelled words into a wrong word list, and keeping the words which are not added in the residual word list;

calculating the similarity between the words in the remaining word list and the brand names one by one, judging the words with the similarity larger than a set similarity threshold value as similar words, adding the similar words into a similar word list, and keeping the words which are not added in the remaining word list;

calculating a length of the random character word list, a length of the compound word list, a length of the error word list, a length of the similar word list, and a length of the remaining word list;

the third feature includes a length of the random character word list, a length of the compound word list, a length of the error word list, a length of the similar word list, and a length of the remaining word list. And the third characteristic is extracted based on the composed words of the URL, and the characteristics of random characters, combined words, misspelling, words similar to the known brand names and the like are extracted, so that the detection result is more accurate. Feature extraction is carried out from the URL, structural information corresponding to the URL and three dimensions of words formed by the URL, and the extracted features can comprehensively reflect the characteristics of the URL, so that the detection result is more accurate.

Further, the one-by-one judgment of whether the words in the remaining word list are random characters includes:

and establishing a Markov chain model according to the N-Gram language model to judge whether the words in the residual word list are random characters. The N-Gram language model is trained through conventional documents, the training process is simple, and meanwhile, the language model can accurately judge whether words are random characters or not.

Optionally, the trained URL detection model is: a random forest algorithm model, a decision tree model, a GBDT model, an XGboost algorithm model or an SVM model.

The invention provides a phishing website URL detection system based on machine learning in a second aspect, which comprises:

the URL analyzing unit is used for analyzing the URL to be detected, extracting the structural information of the URL to be detected and forming words of the URL to be detected;

the characteristic extraction unit is used for extracting URL characteristics according to the URL to be detected, the structural information of the URL to be detected and words forming the URL to be detected;

and the abnormal URL detection unit is used for inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is the abnormal URL. The system can effectively detect the probability that the URL is the abnormal URL, and has simple structure and high detection accuracy.

In another aspect, the present invention provides a machine-readable storage medium having stored thereon instructions for causing a machine to perform the machine learning-based phishing URL detection method described herein.

Through the technical scheme, the URL detection method extracts the feature training model from the URL for prediction, the coverage range is wider, the detection result is more accurate, the trained URL model is adopted for detection, frequent updating is not needed, the occupied resources are less, a common computer can also run, and the requirements of a large number of users are met.

Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:

FIG. 1 is a flowchart of a method for detecting URLs in phishing websites based on machine learning according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an example URL structure provided by the present invention;

FIG. 3 is a block diagram of a phishing website URL detection system based on machine learning according to an embodiment of the invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

Fig. 1 is a flowchart of a phishing website URL detection method based on machine learning according to an embodiment of the present invention. As shown in fig. 1, the method includes:

The basic structure of a URL is shown in FIG. 2, with the entire URL including a protocol, domain name, suffix, path, etc. A URL is composed of some meaningful or nonsensical words and some special characters that separate some important components of the address. For example, a point marker ("-") is used to separate a domain name from a sub-domain name. In the path address, folders are separated by "/" symbols. Furthermore, each component of the URL may also contain some delimiters, i.e. special characters, such as "-", "? "," ═ and the like.

For the URL in fig. 2, the extracted sub domain name, suffix, and path information are ' www ', ' abc-def ', ' com ', ' details/index. The extracted words include: https, www, abc, def, com, details, Index, html.

extracting a first feature according to the structure information of the URL;

extracting a second characteristic according to the URL to be detected;

judging whether the URL to be detected exists in a domain name list which is one million before the ranking;

It should be noted that the DGA is an algorithm for generating a random number, and the DGA domain name refers to a domain name generated by the DGA algorithm.

counting the length of the URL to be detected;

counting the number of special characters in the URL to be detected;

judging whether special keywords exist in the URL to be detected or not;

calculating the number of the numbers in the URL to be detected;

calculating the proportional value of the number and the letter in the URL;

calculating the entropy of the URL;

calculating a KS check value of the URL;

calculating KL distance values of the URLs;

calculating the Euclidean distance value of the URL;

calculating the ratio of vowels to consonants in the URL;

In some embodiments, the special keyword is a word set according to a user requirement, such as a word of a bank, money, and the like.

adding the words forming the URL to be tested into a remaining word list;

It should be noted that, the order may be randomly adjusted according to the judging steps of the random characters, the compound words, the error words and the similar words in the third feature, and only one of the multiple orders is described in the present application.

In some embodiments of the invention, a spell check library of Python is used to determine whether a word is misspelled. In some embodiments of the present invention, the similarity between the word and the brand name is calculated using the edit distance, and the smaller the edit distance, the higher the similarity, and the edit distance smaller than a set edit distance threshold is determined as a similar word.

In some embodiments of the invention, the trained URL detection model is trained by:

firstly, collecting a large number of phishing website URLs and normal URLs, marking the normal URLs as 0, and marking the phishing website URLs as 1;

then extracting the collected URL structure information and words forming the URL according to the processing process of the URL to be detected;

extracting the collected characteristics corresponding to the URL according to the structure information, the URL and words forming the URL;

and constructing a URL detection model according to the collected URL and the extracted features, searching a parameter combination which enables the cross validation error to be minimum by using Bayesian optimization, using the parameter combination as an optimal parameter of the URL detection model, and storing the model after modeling is completed, wherein the stored model is the trained URL detection model. In other embodiments, the URL detection model is optimized using an optimization method such as grid search.

FIG. 3 is a block diagram of a phishing website URL detection system based on machine learning according to an embodiment of the invention. As shown in fig. 3, the system includes:

In another aspect, the present invention provides a machine-readable storage medium having stored thereon instructions for causing a machine to execute the method for detecting URL of phishing website based on machine learning according to the present application

Those skilled in the art will appreciate that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, which is stored in a storage medium and includes several instructions to enable a single chip, a chip, or a processor (processor) to execute all or part of the steps in the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the embodiments of the present invention have been described in detail with reference to the accompanying drawings, the embodiments of the present invention are not limited to the details of the above embodiments, and various simple modifications can be made to the technical solution of the embodiments of the present invention within the technical idea of the embodiments of the present invention, and the simple modifications are within the scope of the embodiments of the present invention. It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, the embodiments of the present invention will not be described separately for the various possible combinations.

In addition, any combination of the various embodiments of the present invention is also possible, and the same should be considered as disclosed in the embodiments of the present invention as long as it does not depart from the spirit of the embodiments of the present invention.

Claims

1. A phishing website URL detection method based on machine learning is characterized by comprising the following steps:

2. A phishing website URL detection method as claimed in claim 1 wherein said URL structure information comprises: a URL sub-domain name, a URL suffix, and a URL path; the URL to be tested is analyzed, the structural information of the URL is extracted, and words forming the URL to be tested comprise:

and dividing the URL according to the special characters, and extracting words forming the URL to be detected.

3. A phishing website URL detection method as claimed in claim 2 wherein said URL features include a first feature, a second feature and a third feature; according to the URL to be detected, the structural information of the URL and the words forming the URL to be detected, URL features are extracted, and the method comprises the following steps:

extracting a first feature according to the structure information of the URL;

extracting a second characteristic according to the URL to be detected;

4. A phishing website URL detection method as claimed in claim 3 wherein said extracting first features from said URL structure information comprises:

the first feature includes: and judging the IP address form of the URL, judging the domain name of the URL, and judging whether the URL to be detected exists in a domain name list which is ranked one million before.

5. A phishing website URL detection method as claimed in claim 3 wherein said extracting second features from said URL to be tested comprises:

counting the length of the URL to be detected;

counting the number of special characters in the URL to be detected;

judging whether special keywords exist in the URL to be detected or not;

calculating the number of the numbers in the URL to be detected;

calculating the proportional value of the number and the letter in the URL;

calculating the entropy of the URL;

calculating a KS check value of the URL;

calculating KL distance values of the URLs;

calculating the Euclidean distance value of the URL;

calculating the ratio of vowels to consonants in the URL;

the second feature includes: the method comprises the steps of determining the length of a URL to be detected, the number of special characters in the URL to be detected, the number of numbers in the URL to be detected, whether special keywords exist in the URL to be detected, the ratio value of the numbers and letters in the URL, the entropy of the URL, the KS test value of the URL, the KL distance value of the URL, the Euclidean distance value of the URL, the ratio value of vowels and consonants in the URL and the HTML entity determination result of the URL.

6. A phishing website URL detection method as claimed in claim 3 wherein said extracting third feature from said words constituting the URL to be tested comprises:

adding the words forming the URL to be tested into a remaining word list;

the third feature includes a length of the random character word list, a length of the compound word list, a length of the error word list, a length of the similar word list, and a length of the remaining word list.

7. A phishing website URL detection method as claimed in claim 6 wherein said one by one determining if words in said remaining word list are random characters comprises:

and establishing a Markov chain model according to the N-Gram language model to judge whether the words in the residual word list are random characters.

8. A phishing website URL detection method as claimed in claim 1 wherein said trained URL detection model is: a random forest algorithm model, a decision tree model, a GBDT model, an XGboost algorithm model or an SVM model.

9. A phishing website URL detection system based on machine learning, the system comprising:

and the abnormal URL detection unit is used for inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is the abnormal URL.

10. A machine-readable storage medium having stored thereon instructions for causing a machine to perform the method for machine learning based phishing website URL detection of any one of claims 1-8 herein.