CN112948725A - Phishing website URL detection method and system based on machine learning - Google Patents

Phishing website URL detection method and system based on machine learning Download PDF

Info

Publication number
CN112948725A
CN112948725A CN202110231656.4A CN202110231656A CN112948725A CN 112948725 A CN112948725 A CN 112948725A CN 202110231656 A CN202110231656 A CN 202110231656A CN 112948725 A CN112948725 A CN 112948725A
Authority
CN
China
Prior art keywords
url
detected
words
word list
extracting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110231656.4A
Other languages
Chinese (zh)
Inventor
于金龙
王智民
王高杰
卯路宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing 6Cloud Technology Co Ltd
Beijing 6Cloud Information Technology Co Ltd
Original Assignee
Beijing 6Cloud Technology Co Ltd
Beijing 6Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing 6Cloud Technology Co Ltd, Beijing 6Cloud Information Technology Co Ltd filed Critical Beijing 6Cloud Technology Co Ltd
Priority to CN202110231656.4A priority Critical patent/CN112948725A/en
Publication of CN112948725A publication Critical patent/CN112948725A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention provides a phishing website URL detection method and system based on machine learning, and belongs to the field of information safety. The method comprises the following steps: analyzing the URL to be detected, and extracting the structural information of the URL to be detected and words forming the URL to be detected; extracting URL features according to the URL to be detected, the structural information of the URL to be detected and words forming the URL to be detected; and inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is an abnormal URL. Compared with the traditional blacklist technology, the URL detection method extracts the feature training model from the URL for prediction, is wider in coverage range and more accurate in detection result, adopts the trained URL model for detection, does not need frequent updating, occupies less resources, can run by a common computer, and meets the requirements of a large number of users.

Description

Phishing website URL detection method and system based on machine learning
Technical Field
The invention relates to the field of information safety, in particular to a phishing website URL detection method based on machine learning and a phishing website URL detection system based on machine learning.
Background
Phishing is a major problem on today's internet, and many users are becoming victims due to the deceptive means of criminals. Phishing is a fraudulent technique that uses email spoofing as its primary medium to communicate fraudulently and then obtains the required information from the victim, such as username, password, credit card and bank account, through a deceptive website.
The action requested in an email is typically to open a Web link and fill in personally sensitive information on the Web page, or to provide his personal identity or bank information in reply to the email. The user, after clicking on the Web link provided in the deceptive email, will be directed to the phishing website created by the phisher. Since the phishing website looks similar to the original website, the user often cannot recognize it as a malicious website and inputs required information as required, thereby being successfully phished. In addition to e-mail, an attacker may also direct a user to access malicious links by embedding advertising links on real websites. Furthermore, in some cases, an infected DNS may cause users to be redirected to unusual websites and phishing websites.
Blacklisting techniques remain the most common defense of users against such phishing websites, using a near-matching algorithm to check if suspicious URLs are present in the blacklist. However, this method has the following technical problems that cannot be solved:
1. blacklisting is a passive defense method that requires constant maintenance, often updating (deleting URLs that have expired, adding new phishing URLs), and is not a simple matter.
2. An attacker, after destroying a phishing webpage, may implant it into a server that is considered secure, in which case the blacklist-based approach will fail to detect the phishing website.
3. The system can not cope with the situation that the number of the blacklists is continuously increased, the number of the blacklists is more and more along with the increase of time, and the blacklist data can occupy more and more system resources. Therefore, the blacklist technology has been unable to meet the user's requirements.
Disclosure of Invention
Compared with the traditional blacklist technology, the URL detection method extracts the feature training model from the URL for prediction, is wider in coverage range and more accurate in detection result, adopts the trained URL model for detection, does not need frequent updating, occupies less resources, can run by a common computer, and meets the requirements of a large number of users.
In order to achieve the above object, a first aspect of the present invention provides a phishing website URL detection method based on machine learning, the method including:
analyzing the URL to be detected, and extracting the structural information of the URL to be detected and words forming the URL to be detected;
extracting URL features according to the URL to be detected, the structural information of the URL to be detected and words forming the URL to be detected;
and inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is an abnormal URL.
Optionally, the structure information of the URL includes: a URL sub-domain name, a URL suffix, and a URL path; the URL to be tested is analyzed, the structural information of the URL is extracted, and words forming the URL to be tested comprise:
analyzing the URL to be detected, and extracting the structure information of the URL according to the structure of the URL;
and dividing the URL according to the special characters, and extracting words forming the URL to be detected. After the URL is analyzed and decomposed, more accurate characteristics can be extracted, and therefore the detection accuracy rate is improved.
Optionally, the URL features include a first feature, a second feature, and a third feature; according to the URL to be detected, the structural information of the URL and the words forming the URL to be detected, URL features are extracted, and the method comprises the following steps:
extracting a first feature according to the structure information of the URL;
extracting a second characteristic according to the URL to be detected;
and extracting a third characteristic according to the words forming the URL to be detected.
Further, the extracting the first feature according to the URL structure information includes:
judging whether the URL domain name is in an IP address form or not to obtain the judgment result of the IP address form of the URL;
judging whether the URL domain name is a DGA domain name or not to obtain a URL domain name judgment result;
judging whether the URL to be detected exists in a domain name list one million before the ranking;
the first feature includes: and judging the IP address form of the URL, judging the domain name of the URL, and judging whether the URL to be detected exists in a domain name list which is ranked one million before. The first feature is extracted based on the structure information of the URL, and reflects the characteristic of the structure of the URL.
Further, the extracting a second feature according to the URL to be detected includes:
counting the length of the URL to be detected;
counting the number of special characters in the URL to be detected;
judging whether special keywords exist in the URL to be detected or not;
calculating the number of the numbers in the URL to be detected;
calculating the proportional value of the number and the letter in the URL;
calculating the entropy of the URL;
calculating a KS check value of the URL;
calculating KL distance values of the URLs;
calculating the Euclidean distance value of the URL;
calculating the ratio of vowels to consonants in the URL;
judging whether the URL has an HTML entity to obtain an HTML entity judgment result of the URL;
the second feature includes: the method comprises the steps of determining the length of a URL to be detected, the number of special characters in the URL to be detected, the number of numbers in the URL to be detected, whether special keywords exist in the URL to be detected, the ratio value of the numbers and letters in the URL, the entropy of the URL, the KS test value of the URL, the KL distance value of the URL, the Euclidean distance value of the URL, the ratio value of vowels and consonants in the URL and the HTML entity determination result of the URL. The second feature is extracted based on the URL itself, and embodies the overall characteristics of the URL.
Further, the extracting a third feature according to the words forming the URL to be detected includes:
adding the words forming the URL to be tested into a remaining word list;
judging whether the words in the remaining word list are random characters one by one, adding the words which are the random characters into a random character word list, and keeping the words which are not added in the remaining word list;
judging whether the words with the length larger than a set length threshold value in the remaining word list are combined words formed by a plurality of words one by one, adding the combined words into the combined word list, and keeping the words which are not added in the remaining word list;
judging whether the words in the residual word list are misspelled one by one, adding the misspelled words into a wrong word list, and keeping the words which are not added in the residual word list;
calculating the similarity between the words in the remaining word list and the brand names one by one, judging the words with the similarity larger than a set similarity threshold value as similar words, adding the similar words into a similar word list, and keeping the words which are not added in the remaining word list;
calculating a length of the random character word list, a length of the compound word list, a length of the error word list, a length of the similar word list, and a length of the remaining word list;
the third feature includes a length of the random character word list, a length of the compound word list, a length of the error word list, a length of the similar word list, and a length of the remaining word list. And the third characteristic is extracted based on the composed words of the URL, and the characteristics of random characters, combined words, misspelling, words similar to the known brand names and the like are extracted, so that the detection result is more accurate. Feature extraction is carried out from the URL, structural information corresponding to the URL and three dimensions of words formed by the URL, and the extracted features can comprehensively reflect the characteristics of the URL, so that the detection result is more accurate.
Further, the one-by-one judgment of whether the words in the remaining word list are random characters includes:
and establishing a Markov chain model according to the N-Gram language model to judge whether the words in the residual word list are random characters. The N-Gram language model is trained through conventional documents, the training process is simple, and meanwhile, the language model can accurately judge whether words are random characters or not.
Optionally, the trained URL detection model is: a random forest algorithm model, a decision tree model, a GBDT model, an XGboost algorithm model or an SVM model.
The invention provides a phishing website URL detection system based on machine learning in a second aspect, which comprises:
the URL analyzing unit is used for analyzing the URL to be detected, extracting the structural information of the URL to be detected and forming words of the URL to be detected;
the characteristic extraction unit is used for extracting URL characteristics according to the URL to be detected, the structural information of the URL to be detected and words forming the URL to be detected;
and the abnormal URL detection unit is used for inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is the abnormal URL. The system can effectively detect the probability that the URL is the abnormal URL, and has simple structure and high detection accuracy.
In another aspect, the present invention provides a machine-readable storage medium having stored thereon instructions for causing a machine to perform the machine learning-based phishing URL detection method described herein.
Through the technical scheme, the URL detection method extracts the feature training model from the URL for prediction, the coverage range is wider, the detection result is more accurate, the trained URL model is adopted for detection, frequent updating is not needed, the occupied resources are less, a common computer can also run, and the requirements of a large number of users are met.
Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:
FIG. 1 is a flowchart of a method for detecting URLs in phishing websites based on machine learning according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating an example URL structure provided by the present invention;
FIG. 3 is a block diagram of a phishing website URL detection system based on machine learning according to an embodiment of the invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
Fig. 1 is a flowchart of a phishing website URL detection method based on machine learning according to an embodiment of the present invention. As shown in fig. 1, the method includes:
analyzing the URL to be detected, and extracting the structural information of the URL to be detected and words forming the URL to be detected;
extracting URL features according to the URL to be detected, the structural information of the URL to be detected and words forming the URL to be detected;
and inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is an abnormal URL.
Optionally, the structure information of the URL includes: a URL sub-domain name, a URL suffix, and a URL path; the URL to be tested is analyzed, the structural information of the URL is extracted, and words forming the URL to be tested comprise:
analyzing the URL to be detected, and extracting the structure information of the URL according to the structure of the URL;
and dividing the URL according to the special characters, and extracting words forming the URL to be detected. After the URL is analyzed and decomposed, more accurate characteristics can be extracted, and therefore the detection accuracy rate is improved.
The basic structure of a URL is shown in FIG. 2, with the entire URL including a protocol, domain name, suffix, path, etc. A URL is composed of some meaningful or nonsensical words and some special characters that separate some important components of the address. For example, a point marker ("-") is used to separate a domain name from a sub-domain name. In the path address, folders are separated by "/" symbols. Furthermore, each component of the URL may also contain some delimiters, i.e. special characters, such as "-", "? "," ═ and the like.
For the URL in fig. 2, the extracted sub domain name, suffix, and path information are ' www ', ' abc-def ', ' com ', ' details/index. The extracted words include: https, www, abc, def, com, details, Index, html.
Optionally, the URL features include a first feature, a second feature, and a third feature; according to the URL to be detected, the structural information of the URL and the words forming the URL to be detected, URL features are extracted, and the method comprises the following steps:
extracting a first feature according to the structure information of the URL;
extracting a second characteristic according to the URL to be detected;
and extracting a third characteristic according to the words forming the URL to be detected.
Further, the extracting the first feature according to the URL structure information includes:
judging whether the URL domain name is in an IP address form or not to obtain the judgment result of the IP address form of the URL;
judging whether the URL domain name is a DGA domain name or not to obtain a URL domain name judgment result;
judging whether the URL to be detected exists in a domain name list which is one million before the ranking;
the first feature includes: and judging the IP address form of the URL, judging the domain name of the URL, and judging whether the URL to be detected exists in a domain name list which is ranked one million before. The first feature is extracted based on the structure information of the URL, and reflects the characteristic of the structure of the URL.
It should be noted that the DGA is an algorithm for generating a random number, and the DGA domain name refers to a domain name generated by the DGA algorithm.
Further, the extracting a second feature according to the URL to be detected includes:
counting the length of the URL to be detected;
counting the number of special characters in the URL to be detected;
judging whether special keywords exist in the URL to be detected or not;
calculating the number of the numbers in the URL to be detected;
calculating the proportional value of the number and the letter in the URL;
calculating the entropy of the URL;
calculating a KS check value of the URL;
calculating KL distance values of the URLs;
calculating the Euclidean distance value of the URL;
calculating the ratio of vowels to consonants in the URL;
judging whether the URL has an HTML entity to obtain an HTML entity judgment result of the URL;
the second feature includes: the method comprises the steps of determining the length of a URL to be detected, the number of special characters in the URL to be detected, the number of numbers in the URL to be detected, whether special keywords exist in the URL to be detected, the ratio value of the numbers and letters in the URL, the entropy of the URL, the KS test value of the URL, the KL distance value of the URL, the Euclidean distance value of the URL, the ratio value of vowels and consonants in the URL and the HTML entity determination result of the URL. The second feature is extracted based on the URL itself, and embodies the overall characteristics of the URL.
In some embodiments, the special keyword is a word set according to a user requirement, such as a word of a bank, money, and the like.
Further, the extracting a third feature according to the words forming the URL to be detected includes:
adding the words forming the URL to be tested into a remaining word list;
judging whether the words in the remaining word list are random characters one by one, adding the words which are the random characters into a random character word list, and keeping the words which are not added in the remaining word list;
judging whether the words with the length larger than a set length threshold value in the remaining word list are combined words formed by a plurality of words one by one, adding the combined words into the combined word list, and keeping the words which are not added in the remaining word list;
judging whether the words in the residual word list are misspelled one by one, adding the misspelled words into a wrong word list, and keeping the words which are not added in the residual word list;
calculating the similarity between the words in the remaining word list and the brand names one by one, judging the words with the similarity larger than a set similarity threshold value as similar words, adding the similar words into a similar word list, and keeping the words which are not added in the remaining word list;
calculating a length of the random character word list, a length of the compound word list, a length of the error word list, a length of the similar word list, and a length of the remaining word list;
the third feature includes a length of the random character word list, a length of the compound word list, a length of the error word list, a length of the similar word list, and a length of the remaining word list. And the third characteristic is extracted based on the composed words of the URL, and the characteristics of random characters, combined words, misspelling, words similar to the known brand names and the like are extracted, so that the detection result is more accurate. Feature extraction is carried out from the URL, structural information corresponding to the URL and three dimensions of words formed by the URL, and the extracted features can comprehensively reflect the characteristics of the URL, so that the detection result is more accurate.
It should be noted that, the order may be randomly adjusted according to the judging steps of the random characters, the compound words, the error words and the similar words in the third feature, and only one of the multiple orders is described in the present application.
In some embodiments of the invention, a spell check library of Python is used to determine whether a word is misspelled. In some embodiments of the present invention, the similarity between the word and the brand name is calculated using the edit distance, and the smaller the edit distance, the higher the similarity, and the edit distance smaller than a set edit distance threshold is determined as a similar word.
Further, the one-by-one judgment of whether the words in the remaining word list are random characters includes:
and establishing a Markov chain model according to the N-Gram language model to judge whether the words in the residual word list are random characters. The N-Gram language model is trained through conventional documents, the training process is simple, and meanwhile, the language model can accurately judge whether words are random characters or not.
Optionally, the trained URL detection model is: a random forest algorithm model, a decision tree model, a GBDT model, an XGboost algorithm model or an SVM model.
In some embodiments of the invention, the trained URL detection model is trained by:
firstly, collecting a large number of phishing website URLs and normal URLs, marking the normal URLs as 0, and marking the phishing website URLs as 1;
then extracting the collected URL structure information and words forming the URL according to the processing process of the URL to be detected;
extracting the collected characteristics corresponding to the URL according to the structure information, the URL and words forming the URL;
and constructing a URL detection model according to the collected URL and the extracted features, searching a parameter combination which enables the cross validation error to be minimum by using Bayesian optimization, using the parameter combination as an optimal parameter of the URL detection model, and storing the model after modeling is completed, wherein the stored model is the trained URL detection model. In other embodiments, the URL detection model is optimized using an optimization method such as grid search.
FIG. 3 is a block diagram of a phishing website URL detection system based on machine learning according to an embodiment of the invention. As shown in fig. 3, the system includes:
the URL analyzing unit is used for analyzing the URL to be detected, extracting the structural information of the URL to be detected and forming words of the URL to be detected;
the characteristic extraction unit is used for extracting URL characteristics according to the URL to be detected, the structural information of the URL to be detected and words forming the URL to be detected;
and the abnormal URL detection unit is used for inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is the abnormal URL. The system can effectively detect the probability that the URL is the abnormal URL, and has simple structure and high detection accuracy.
In another aspect, the present invention provides a machine-readable storage medium having stored thereon instructions for causing a machine to execute the method for detecting URL of phishing website based on machine learning according to the present application
Those skilled in the art will appreciate that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, which is stored in a storage medium and includes several instructions to enable a single chip, a chip, or a processor (processor) to execute all or part of the steps in the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the embodiments of the present invention have been described in detail with reference to the accompanying drawings, the embodiments of the present invention are not limited to the details of the above embodiments, and various simple modifications can be made to the technical solution of the embodiments of the present invention within the technical idea of the embodiments of the present invention, and the simple modifications are within the scope of the embodiments of the present invention. It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, the embodiments of the present invention will not be described separately for the various possible combinations.
In addition, any combination of the various embodiments of the present invention is also possible, and the same should be considered as disclosed in the embodiments of the present invention as long as it does not depart from the spirit of the embodiments of the present invention.

Claims (10)

1. A phishing website URL detection method based on machine learning is characterized by comprising the following steps:
analyzing the URL to be detected, and extracting the structural information of the URL to be detected and words forming the URL to be detected;
extracting URL features according to the URL to be detected, the structural information of the URL to be detected and words forming the URL to be detected;
and inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is an abnormal URL.
2. A phishing website URL detection method as claimed in claim 1 wherein said URL structure information comprises: a URL sub-domain name, a URL suffix, and a URL path; the URL to be tested is analyzed, the structural information of the URL is extracted, and words forming the URL to be tested comprise:
analyzing the URL to be detected, and extracting the structure information of the URL according to the structure of the URL;
and dividing the URL according to the special characters, and extracting words forming the URL to be detected.
3. A phishing website URL detection method as claimed in claim 2 wherein said URL features include a first feature, a second feature and a third feature; according to the URL to be detected, the structural information of the URL and the words forming the URL to be detected, URL features are extracted, and the method comprises the following steps:
extracting a first feature according to the structure information of the URL;
extracting a second characteristic according to the URL to be detected;
and extracting a third characteristic according to the words forming the URL to be detected.
4. A phishing website URL detection method as claimed in claim 3 wherein said extracting first features from said URL structure information comprises:
judging whether the URL domain name is in an IP address form or not to obtain the judgment result of the IP address form of the URL;
judging whether the URL domain name is a DGA domain name or not to obtain a URL domain name judgment result;
judging whether the URL to be detected exists in a domain name list one million before the ranking;
the first feature includes: and judging the IP address form of the URL, judging the domain name of the URL, and judging whether the URL to be detected exists in a domain name list which is ranked one million before.
5. A phishing website URL detection method as claimed in claim 3 wherein said extracting second features from said URL to be tested comprises:
counting the length of the URL to be detected;
counting the number of special characters in the URL to be detected;
judging whether special keywords exist in the URL to be detected or not;
calculating the number of the numbers in the URL to be detected;
calculating the proportional value of the number and the letter in the URL;
calculating the entropy of the URL;
calculating a KS check value of the URL;
calculating KL distance values of the URLs;
calculating the Euclidean distance value of the URL;
calculating the ratio of vowels to consonants in the URL;
judging whether the URL has an HTML entity to obtain an HTML entity judgment result of the URL;
the second feature includes: the method comprises the steps of determining the length of a URL to be detected, the number of special characters in the URL to be detected, the number of numbers in the URL to be detected, whether special keywords exist in the URL to be detected, the ratio value of the numbers and letters in the URL, the entropy of the URL, the KS test value of the URL, the KL distance value of the URL, the Euclidean distance value of the URL, the ratio value of vowels and consonants in the URL and the HTML entity determination result of the URL.
6. A phishing website URL detection method as claimed in claim 3 wherein said extracting third feature from said words constituting the URL to be tested comprises:
adding the words forming the URL to be tested into a remaining word list;
judging whether the words in the remaining word list are random characters one by one, adding the words which are the random characters into a random character word list, and keeping the words which are not added in the remaining word list;
judging whether the words with the length larger than a set length threshold value in the remaining word list are combined words formed by a plurality of words one by one, adding the combined words into the combined word list, and keeping the words which are not added in the remaining word list;
judging whether the words in the residual word list are misspelled one by one, adding the misspelled words into a wrong word list, and keeping the words which are not added in the residual word list;
calculating the similarity between the words in the remaining word list and the brand names one by one, judging the words with the similarity larger than a set similarity threshold value as similar words, adding the similar words into a similar word list, and keeping the words which are not added in the remaining word list;
calculating a length of the random character word list, a length of the compound word list, a length of the error word list, a length of the similar word list, and a length of the remaining word list;
the third feature includes a length of the random character word list, a length of the compound word list, a length of the error word list, a length of the similar word list, and a length of the remaining word list.
7. A phishing website URL detection method as claimed in claim 6 wherein said one by one determining if words in said remaining word list are random characters comprises:
and establishing a Markov chain model according to the N-Gram language model to judge whether the words in the residual word list are random characters.
8. A phishing website URL detection method as claimed in claim 1 wherein said trained URL detection model is: a random forest algorithm model, a decision tree model, a GBDT model, an XGboost algorithm model or an SVM model.
9. A phishing website URL detection system based on machine learning, the system comprising:
the URL analyzing unit is used for analyzing the URL to be detected, extracting the structural information of the URL to be detected and forming words of the URL to be detected;
the characteristic extraction unit is used for extracting URL characteristics according to the URL to be detected, the structural information of the URL to be detected and words forming the URL to be detected;
and the abnormal URL detection unit is used for inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is the abnormal URL.
10. A machine-readable storage medium having stored thereon instructions for causing a machine to perform the method for machine learning based phishing website URL detection of any one of claims 1-8 herein.
CN202110231656.4A 2021-03-02 2021-03-02 Phishing website URL detection method and system based on machine learning Pending CN112948725A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110231656.4A CN112948725A (en) 2021-03-02 2021-03-02 Phishing website URL detection method and system based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110231656.4A CN112948725A (en) 2021-03-02 2021-03-02 Phishing website URL detection method and system based on machine learning

Publications (1)

Publication Number Publication Date
CN112948725A true CN112948725A (en) 2021-06-11

Family

ID=76247228

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110231656.4A Pending CN112948725A (en) 2021-03-02 2021-03-02 Phishing website URL detection method and system based on machine learning

Country Status (1)

Country Link
CN (1) CN112948725A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114928472A (en) * 2022-04-20 2022-08-19 哈尔滨工业大学(威海) Method for filtering bad site grey list based on full-volume circulation main domain name
CN116633684A (en) * 2023-07-19 2023-08-22 中移(苏州)软件技术有限公司 Phishing detection method, system, electronic device and readable storage medium
CN117176483A (en) * 2023-11-03 2023-12-05 北京艾瑞数智科技有限公司 Abnormal URL identification method and device and related products

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106789888A (en) * 2016-11-18 2017-05-31 重庆邮电大学 A kind of fishing webpage detection method of multiple features fusion
CN107798080A (en) * 2017-10-13 2018-03-13 中国科学院信息工程研究所 A kind of similar sample set construction method towards fishing URL detections
CN107992469A (en) * 2017-10-13 2018-05-04 中国科学院信息工程研究所 A kind of fishing URL detection methods and system based on word sequence
CN109274632A (en) * 2017-07-12 2019-01-25 中国移动通信集团广东有限公司 A kind of recognition methods of website and device
CN110290116A (en) * 2019-06-04 2019-09-27 中山大学 A kind of malice domain name detection method of knowledge based map
US20190361998A1 (en) * 2018-05-24 2019-11-28 Paypal, Inc. Efficient random string processing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106789888A (en) * 2016-11-18 2017-05-31 重庆邮电大学 A kind of fishing webpage detection method of multiple features fusion
CN109274632A (en) * 2017-07-12 2019-01-25 中国移动通信集团广东有限公司 A kind of recognition methods of website and device
CN107798080A (en) * 2017-10-13 2018-03-13 中国科学院信息工程研究所 A kind of similar sample set construction method towards fishing URL detections
CN107992469A (en) * 2017-10-13 2018-05-04 中国科学院信息工程研究所 A kind of fishing URL detection methods and system based on word sequence
US20190361998A1 (en) * 2018-05-24 2019-11-28 Paypal, Inc. Efficient random string processing
CN110290116A (en) * 2019-06-04 2019-09-27 中山大学 A kind of malice domain name detection method of knowledge based map

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张萌: "基于机器学习的URL安全检测技术的研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114928472A (en) * 2022-04-20 2022-08-19 哈尔滨工业大学(威海) Method for filtering bad site grey list based on full-volume circulation main domain name
CN114928472B (en) * 2022-04-20 2023-07-18 哈尔滨工业大学(威海) Bad site gray list filtering method based on full circulation main domain name
CN116633684A (en) * 2023-07-19 2023-08-22 中移(苏州)软件技术有限公司 Phishing detection method, system, electronic device and readable storage medium
CN116633684B (en) * 2023-07-19 2023-10-13 中移(苏州)软件技术有限公司 Phishing detection method, system, electronic device and readable storage medium
CN117176483A (en) * 2023-11-03 2023-12-05 北京艾瑞数智科技有限公司 Abnormal URL identification method and device and related products

Similar Documents

Publication Publication Date Title
US10178107B2 (en) Detection of malicious domains using recurring patterns in domain names
Tan et al. PhishWHO: Phishing webpage detection via identity keywords extraction and target domain name finder
US10404745B2 (en) Automatic phishing email detection based on natural language processing techniques
EP2803031B1 (en) Machine-learning based classification of user accounts based on email addresses and other account information
US10033757B2 (en) Identifying malicious identifiers
Mahajan et al. Phishing website detection using machine learning algorithms
Kiruthiga et al. Phishing websites detection using machine learning
Chu et al. Protect sensitive sites from phishing attacks using features extractable from inaccessible phishing URLs
Buber et al. NLP based phishing attack detection from URLs
US20150067833A1 (en) Automatic phishing email detection based on natural language processing techniques
CN112948725A (en) Phishing website URL detection method and system based on machine learning
Das Guptta et al. Modeling hybrid feature-based phishing websites detection using machine learning techniques
Tan et al. Phishing website detection using URL-assisted brand name weighting system
Joshi et al. Phishing attack detection using feature selection techniques
CN110572359A (en) Phishing webpage detection method based on machine learning
Deshpande et al. Detection of phishing websites using Machine Learning
CN116917894A (en) Detecting phishing URLs using a converter
Nowroozi et al. An adversarial attack analysis on malicious advertisement url detection framework
CN115314236A (en) System and method for detecting phishing domains in a Domain Name System (DNS) record set
Valiyaveedu et al. Survey and analysis on AI based phishing detection techniques
US11647046B2 (en) Fuzzy inclusion based impersonation detection
Dangwal et al. Feature selection for machine learning-based phishing websites detection
Franchina et al. Detecting phishing e-mails using Text Mining and features analysis
AT&T
Hossain et al. PhishRescue: A Stacked Ensemble Model to Identify Phishing Website Using Lexical Features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210611

RJ01 Rejection of invention patent application after publication