CN112948725A - Phishing website URL detection method and system based on machine learning - Google Patents
Phishing website URL detection method and system based on machine learning Download PDFInfo
- Publication number
- CN112948725A CN112948725A CN202110231656.4A CN202110231656A CN112948725A CN 112948725 A CN112948725 A CN 112948725A CN 202110231656 A CN202110231656 A CN 202110231656A CN 112948725 A CN112948725 A CN 112948725A
- Authority
- CN
- China
- Prior art keywords
- url
- detected
- words
- word list
- extracting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Abstract
The invention provides a phishing website URL detection method and system based on machine learning, and belongs to the field of information safety. The method comprises the following steps: analyzing the URL to be detected, and extracting the structural information of the URL to be detected and words forming the URL to be detected; extracting URL features according to the URL to be detected, the structural information of the URL to be detected and words forming the URL to be detected; and inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is an abnormal URL. Compared with the traditional blacklist technology, the URL detection method extracts the feature training model from the URL for prediction, is wider in coverage range and more accurate in detection result, adopts the trained URL model for detection, does not need frequent updating, occupies less resources, can run by a common computer, and meets the requirements of a large number of users.
Description
Technical Field
The invention relates to the field of information safety, in particular to a phishing website URL detection method based on machine learning and a phishing website URL detection system based on machine learning.
Background
Phishing is a major problem on today's internet, and many users are becoming victims due to the deceptive means of criminals. Phishing is a fraudulent technique that uses email spoofing as its primary medium to communicate fraudulently and then obtains the required information from the victim, such as username, password, credit card and bank account, through a deceptive website.
The action requested in an email is typically to open a Web link and fill in personally sensitive information on the Web page, or to provide his personal identity or bank information in reply to the email. The user, after clicking on the Web link provided in the deceptive email, will be directed to the phishing website created by the phisher. Since the phishing website looks similar to the original website, the user often cannot recognize it as a malicious website and inputs required information as required, thereby being successfully phished. In addition to e-mail, an attacker may also direct a user to access malicious links by embedding advertising links on real websites. Furthermore, in some cases, an infected DNS may cause users to be redirected to unusual websites and phishing websites.
Blacklisting techniques remain the most common defense of users against such phishing websites, using a near-matching algorithm to check if suspicious URLs are present in the blacklist. However, this method has the following technical problems that cannot be solved:
1. blacklisting is a passive defense method that requires constant maintenance, often updating (deleting URLs that have expired, adding new phishing URLs), and is not a simple matter.
2. An attacker, after destroying a phishing webpage, may implant it into a server that is considered secure, in which case the blacklist-based approach will fail to detect the phishing website.
3. The system can not cope with the situation that the number of the blacklists is continuously increased, the number of the blacklists is more and more along with the increase of time, and the blacklist data can occupy more and more system resources. Therefore, the blacklist technology has been unable to meet the user's requirements.
Disclosure of Invention
Compared with the traditional blacklist technology, the URL detection method extracts the feature training model from the URL for prediction, is wider in coverage range and more accurate in detection result, adopts the trained URL model for detection, does not need frequent updating, occupies less resources, can run by a common computer, and meets the requirements of a large number of users.
In order to achieve the above object, a first aspect of the present invention provides a phishing website URL detection method based on machine learning, the method including:
analyzing the URL to be detected, and extracting the structural information of the URL to be detected and words forming the URL to be detected;
extracting URL features according to the URL to be detected, the structural information of the URL to be detected and words forming the URL to be detected;
and inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is an abnormal URL.
Optionally, the structure information of the URL includes: a URL sub-domain name, a URL suffix, and a URL path; the URL to be tested is analyzed, the structural information of the URL is extracted, and words forming the URL to be tested comprise:
analyzing the URL to be detected, and extracting the structure information of the URL according to the structure of the URL;
and dividing the URL according to the special characters, and extracting words forming the URL to be detected. After the URL is analyzed and decomposed, more accurate characteristics can be extracted, and therefore the detection accuracy rate is improved.
Optionally, the URL features include a first feature, a second feature, and a third feature; according to the URL to be detected, the structural information of the URL and the words forming the URL to be detected, URL features are extracted, and the method comprises the following steps:
extracting a first feature according to the structure information of the URL;
extracting a second characteristic according to the URL to be detected;
and extracting a third characteristic according to the words forming the URL to be detected.
Further, the extracting the first feature according to the URL structure information includes:
judging whether the URL domain name is in an IP address form or not to obtain the judgment result of the IP address form of the URL;
judging whether the URL domain name is a DGA domain name or not to obtain a URL domain name judgment result;
judging whether the URL to be detected exists in a domain name list one million before the ranking;
the first feature includes: and judging the IP address form of the URL, judging the domain name of the URL, and judging whether the URL to be detected exists in a domain name list which is ranked one million before. The first feature is extracted based on the structure information of the URL, and reflects the characteristic of the structure of the URL.
Further, the extracting a second feature according to the URL to be detected includes:
counting the length of the URL to be detected;
counting the number of special characters in the URL to be detected;
judging whether special keywords exist in the URL to be detected or not;
calculating the number of the numbers in the URL to be detected;
calculating the proportional value of the number and the letter in the URL;
calculating the entropy of the URL;
calculating a KS check value of the URL;
calculating KL distance values of the URLs;
calculating the Euclidean distance value of the URL;
calculating the ratio of vowels to consonants in the URL;
judging whether the URL has an HTML entity to obtain an HTML entity judgment result of the URL;
the second feature includes: the method comprises the steps of determining the length of a URL to be detected, the number of special characters in the URL to be detected, the number of numbers in the URL to be detected, whether special keywords exist in the URL to be detected, the ratio value of the numbers and letters in the URL, the entropy of the URL, the KS test value of the URL, the KL distance value of the URL, the Euclidean distance value of the URL, the ratio value of vowels and consonants in the URL and the HTML entity determination result of the URL. The second feature is extracted based on the URL itself, and embodies the overall characteristics of the URL.
Further, the extracting a third feature according to the words forming the URL to be detected includes:
adding the words forming the URL to be tested into a remaining word list;
judging whether the words in the remaining word list are random characters one by one, adding the words which are the random characters into a random character word list, and keeping the words which are not added in the remaining word list;
judging whether the words with the length larger than a set length threshold value in the remaining word list are combined words formed by a plurality of words one by one, adding the combined words into the combined word list, and keeping the words which are not added in the remaining word list;
judging whether the words in the residual word list are misspelled one by one, adding the misspelled words into a wrong word list, and keeping the words which are not added in the residual word list;
calculating the similarity between the words in the remaining word list and the brand names one by one, judging the words with the similarity larger than a set similarity threshold value as similar words, adding the similar words into a similar word list, and keeping the words which are not added in the remaining word list;
calculating a length of the random character word list, a length of the compound word list, a length of the error word list, a length of the similar word list, and a length of the remaining word list;
the third feature includes a length of the random character word list, a length of the compound word list, a length of the error word list, a length of the similar word list, and a length of the remaining word list. And the third characteristic is extracted based on the composed words of the URL, and the characteristics of random characters, combined words, misspelling, words similar to the known brand names and the like are extracted, so that the detection result is more accurate. Feature extraction is carried out from the URL, structural information corresponding to the URL and three dimensions of words formed by the URL, and the extracted features can comprehensively reflect the characteristics of the URL, so that the detection result is more accurate.
Further, the one-by-one judgment of whether the words in the remaining word list are random characters includes:
and establishing a Markov chain model according to the N-Gram language model to judge whether the words in the residual word list are random characters. The N-Gram language model is trained through conventional documents, the training process is simple, and meanwhile, the language model can accurately judge whether words are random characters or not.
Optionally, the trained URL detection model is: a random forest algorithm model, a decision tree model, a GBDT model, an XGboost algorithm model or an SVM model.
The invention provides a phishing website URL detection system based on machine learning in a second aspect, which comprises:
the URL analyzing unit is used for analyzing the URL to be detected, extracting the structural information of the URL to be detected and forming words of the URL to be detected;
the characteristic extraction unit is used for extracting URL characteristics according to the URL to be detected, the structural information of the URL to be detected and words forming the URL to be detected;
and the abnormal URL detection unit is used for inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is the abnormal URL. The system can effectively detect the probability that the URL is the abnormal URL, and has simple structure and high detection accuracy.
In another aspect, the present invention provides a machine-readable storage medium having stored thereon instructions for causing a machine to perform the machine learning-based phishing URL detection method described herein.
Through the technical scheme, the URL detection method extracts the feature training model from the URL for prediction, the coverage range is wider, the detection result is more accurate, the trained URL model is adopted for detection, frequent updating is not needed, the occupied resources are less, a common computer can also run, and the requirements of a large number of users are met.
Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:
FIG. 1 is a flowchart of a method for detecting URLs in phishing websites based on machine learning according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating an example URL structure provided by the present invention;
FIG. 3 is a block diagram of a phishing website URL detection system based on machine learning according to an embodiment of the invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
Fig. 1 is a flowchart of a phishing website URL detection method based on machine learning according to an embodiment of the present invention. As shown in fig. 1, the method includes:
analyzing the URL to be detected, and extracting the structural information of the URL to be detected and words forming the URL to be detected;
extracting URL features according to the URL to be detected, the structural information of the URL to be detected and words forming the URL to be detected;
and inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is an abnormal URL.
Optionally, the structure information of the URL includes: a URL sub-domain name, a URL suffix, and a URL path; the URL to be tested is analyzed, the structural information of the URL is extracted, and words forming the URL to be tested comprise:
analyzing the URL to be detected, and extracting the structure information of the URL according to the structure of the URL;
and dividing the URL according to the special characters, and extracting words forming the URL to be detected. After the URL is analyzed and decomposed, more accurate characteristics can be extracted, and therefore the detection accuracy rate is improved.
The basic structure of a URL is shown in FIG. 2, with the entire URL including a protocol, domain name, suffix, path, etc. A URL is composed of some meaningful or nonsensical words and some special characters that separate some important components of the address. For example, a point marker ("-") is used to separate a domain name from a sub-domain name. In the path address, folders are separated by "/" symbols. Furthermore, each component of the URL may also contain some delimiters, i.e. special characters, such as "-", "? "," ═ and the like.
For the URL in fig. 2, the extracted sub domain name, suffix, and path information are ' www ', ' abc-def ', ' com ', ' details/index. The extracted words include: https, www, abc, def, com, details, Index, html.
Optionally, the URL features include a first feature, a second feature, and a third feature; according to the URL to be detected, the structural information of the URL and the words forming the URL to be detected, URL features are extracted, and the method comprises the following steps:
extracting a first feature according to the structure information of the URL;
extracting a second characteristic according to the URL to be detected;
and extracting a third characteristic according to the words forming the URL to be detected.
Further, the extracting the first feature according to the URL structure information includes:
judging whether the URL domain name is in an IP address form or not to obtain the judgment result of the IP address form of the URL;
judging whether the URL domain name is a DGA domain name or not to obtain a URL domain name judgment result;
judging whether the URL to be detected exists in a domain name list which is one million before the ranking;
the first feature includes: and judging the IP address form of the URL, judging the domain name of the URL, and judging whether the URL to be detected exists in a domain name list which is ranked one million before. The first feature is extracted based on the structure information of the URL, and reflects the characteristic of the structure of the URL.
It should be noted that the DGA is an algorithm for generating a random number, and the DGA domain name refers to a domain name generated by the DGA algorithm.
Further, the extracting a second feature according to the URL to be detected includes:
counting the length of the URL to be detected;
counting the number of special characters in the URL to be detected;
judging whether special keywords exist in the URL to be detected or not;
calculating the number of the numbers in the URL to be detected;
calculating the proportional value of the number and the letter in the URL;
calculating the entropy of the URL;
calculating a KS check value of the URL;
calculating KL distance values of the URLs;
calculating the Euclidean distance value of the URL;
calculating the ratio of vowels to consonants in the URL;
judging whether the URL has an HTML entity to obtain an HTML entity judgment result of the URL;
the second feature includes: the method comprises the steps of determining the length of a URL to be detected, the number of special characters in the URL to be detected, the number of numbers in the URL to be detected, whether special keywords exist in the URL to be detected, the ratio value of the numbers and letters in the URL, the entropy of the URL, the KS test value of the URL, the KL distance value of the URL, the Euclidean distance value of the URL, the ratio value of vowels and consonants in the URL and the HTML entity determination result of the URL. The second feature is extracted based on the URL itself, and embodies the overall characteristics of the URL.
In some embodiments, the special keyword is a word set according to a user requirement, such as a word of a bank, money, and the like.
Further, the extracting a third feature according to the words forming the URL to be detected includes:
adding the words forming the URL to be tested into a remaining word list;
judging whether the words in the remaining word list are random characters one by one, adding the words which are the random characters into a random character word list, and keeping the words which are not added in the remaining word list;
judging whether the words with the length larger than a set length threshold value in the remaining word list are combined words formed by a plurality of words one by one, adding the combined words into the combined word list, and keeping the words which are not added in the remaining word list;
judging whether the words in the residual word list are misspelled one by one, adding the misspelled words into a wrong word list, and keeping the words which are not added in the residual word list;
calculating the similarity between the words in the remaining word list and the brand names one by one, judging the words with the similarity larger than a set similarity threshold value as similar words, adding the similar words into a similar word list, and keeping the words which are not added in the remaining word list;
calculating a length of the random character word list, a length of the compound word list, a length of the error word list, a length of the similar word list, and a length of the remaining word list;
the third feature includes a length of the random character word list, a length of the compound word list, a length of the error word list, a length of the similar word list, and a length of the remaining word list. And the third characteristic is extracted based on the composed words of the URL, and the characteristics of random characters, combined words, misspelling, words similar to the known brand names and the like are extracted, so that the detection result is more accurate. Feature extraction is carried out from the URL, structural information corresponding to the URL and three dimensions of words formed by the URL, and the extracted features can comprehensively reflect the characteristics of the URL, so that the detection result is more accurate.
It should be noted that, the order may be randomly adjusted according to the judging steps of the random characters, the compound words, the error words and the similar words in the third feature, and only one of the multiple orders is described in the present application.
In some embodiments of the invention, a spell check library of Python is used to determine whether a word is misspelled. In some embodiments of the present invention, the similarity between the word and the brand name is calculated using the edit distance, and the smaller the edit distance, the higher the similarity, and the edit distance smaller than a set edit distance threshold is determined as a similar word.
Further, the one-by-one judgment of whether the words in the remaining word list are random characters includes:
and establishing a Markov chain model according to the N-Gram language model to judge whether the words in the residual word list are random characters. The N-Gram language model is trained through conventional documents, the training process is simple, and meanwhile, the language model can accurately judge whether words are random characters or not.
Optionally, the trained URL detection model is: a random forest algorithm model, a decision tree model, a GBDT model, an XGboost algorithm model or an SVM model.
In some embodiments of the invention, the trained URL detection model is trained by:
firstly, collecting a large number of phishing website URLs and normal URLs, marking the normal URLs as 0, and marking the phishing website URLs as 1;
then extracting the collected URL structure information and words forming the URL according to the processing process of the URL to be detected;
extracting the collected characteristics corresponding to the URL according to the structure information, the URL and words forming the URL;
and constructing a URL detection model according to the collected URL and the extracted features, searching a parameter combination which enables the cross validation error to be minimum by using Bayesian optimization, using the parameter combination as an optimal parameter of the URL detection model, and storing the model after modeling is completed, wherein the stored model is the trained URL detection model. In other embodiments, the URL detection model is optimized using an optimization method such as grid search.
FIG. 3 is a block diagram of a phishing website URL detection system based on machine learning according to an embodiment of the invention. As shown in fig. 3, the system includes:
the URL analyzing unit is used for analyzing the URL to be detected, extracting the structural information of the URL to be detected and forming words of the URL to be detected;
the characteristic extraction unit is used for extracting URL characteristics according to the URL to be detected, the structural information of the URL to be detected and words forming the URL to be detected;
and the abnormal URL detection unit is used for inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is the abnormal URL. The system can effectively detect the probability that the URL is the abnormal URL, and has simple structure and high detection accuracy.
In another aspect, the present invention provides a machine-readable storage medium having stored thereon instructions for causing a machine to execute the method for detecting URL of phishing website based on machine learning according to the present application
Those skilled in the art will appreciate that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, which is stored in a storage medium and includes several instructions to enable a single chip, a chip, or a processor (processor) to execute all or part of the steps in the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the embodiments of the present invention have been described in detail with reference to the accompanying drawings, the embodiments of the present invention are not limited to the details of the above embodiments, and various simple modifications can be made to the technical solution of the embodiments of the present invention within the technical idea of the embodiments of the present invention, and the simple modifications are within the scope of the embodiments of the present invention. It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, the embodiments of the present invention will not be described separately for the various possible combinations.
In addition, any combination of the various embodiments of the present invention is also possible, and the same should be considered as disclosed in the embodiments of the present invention as long as it does not depart from the spirit of the embodiments of the present invention.
Claims (10)
1. A phishing website URL detection method based on machine learning is characterized by comprising the following steps:
analyzing the URL to be detected, and extracting the structural information of the URL to be detected and words forming the URL to be detected;
extracting URL features according to the URL to be detected, the structural information of the URL to be detected and words forming the URL to be detected;
and inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is an abnormal URL.
2. A phishing website URL detection method as claimed in claim 1 wherein said URL structure information comprises: a URL sub-domain name, a URL suffix, and a URL path; the URL to be tested is analyzed, the structural information of the URL is extracted, and words forming the URL to be tested comprise:
analyzing the URL to be detected, and extracting the structure information of the URL according to the structure of the URL;
and dividing the URL according to the special characters, and extracting words forming the URL to be detected.
3. A phishing website URL detection method as claimed in claim 2 wherein said URL features include a first feature, a second feature and a third feature; according to the URL to be detected, the structural information of the URL and the words forming the URL to be detected, URL features are extracted, and the method comprises the following steps:
extracting a first feature according to the structure information of the URL;
extracting a second characteristic according to the URL to be detected;
and extracting a third characteristic according to the words forming the URL to be detected.
4. A phishing website URL detection method as claimed in claim 3 wherein said extracting first features from said URL structure information comprises:
judging whether the URL domain name is in an IP address form or not to obtain the judgment result of the IP address form of the URL;
judging whether the URL domain name is a DGA domain name or not to obtain a URL domain name judgment result;
judging whether the URL to be detected exists in a domain name list one million before the ranking;
the first feature includes: and judging the IP address form of the URL, judging the domain name of the URL, and judging whether the URL to be detected exists in a domain name list which is ranked one million before.
5. A phishing website URL detection method as claimed in claim 3 wherein said extracting second features from said URL to be tested comprises:
counting the length of the URL to be detected;
counting the number of special characters in the URL to be detected;
judging whether special keywords exist in the URL to be detected or not;
calculating the number of the numbers in the URL to be detected;
calculating the proportional value of the number and the letter in the URL;
calculating the entropy of the URL;
calculating a KS check value of the URL;
calculating KL distance values of the URLs;
calculating the Euclidean distance value of the URL;
calculating the ratio of vowels to consonants in the URL;
judging whether the URL has an HTML entity to obtain an HTML entity judgment result of the URL;
the second feature includes: the method comprises the steps of determining the length of a URL to be detected, the number of special characters in the URL to be detected, the number of numbers in the URL to be detected, whether special keywords exist in the URL to be detected, the ratio value of the numbers and letters in the URL, the entropy of the URL, the KS test value of the URL, the KL distance value of the URL, the Euclidean distance value of the URL, the ratio value of vowels and consonants in the URL and the HTML entity determination result of the URL.
6. A phishing website URL detection method as claimed in claim 3 wherein said extracting third feature from said words constituting the URL to be tested comprises:
adding the words forming the URL to be tested into a remaining word list;
judging whether the words in the remaining word list are random characters one by one, adding the words which are the random characters into a random character word list, and keeping the words which are not added in the remaining word list;
judging whether the words with the length larger than a set length threshold value in the remaining word list are combined words formed by a plurality of words one by one, adding the combined words into the combined word list, and keeping the words which are not added in the remaining word list;
judging whether the words in the residual word list are misspelled one by one, adding the misspelled words into a wrong word list, and keeping the words which are not added in the residual word list;
calculating the similarity between the words in the remaining word list and the brand names one by one, judging the words with the similarity larger than a set similarity threshold value as similar words, adding the similar words into a similar word list, and keeping the words which are not added in the remaining word list;
calculating a length of the random character word list, a length of the compound word list, a length of the error word list, a length of the similar word list, and a length of the remaining word list;
the third feature includes a length of the random character word list, a length of the compound word list, a length of the error word list, a length of the similar word list, and a length of the remaining word list.
7. A phishing website URL detection method as claimed in claim 6 wherein said one by one determining if words in said remaining word list are random characters comprises:
and establishing a Markov chain model according to the N-Gram language model to judge whether the words in the residual word list are random characters.
8. A phishing website URL detection method as claimed in claim 1 wherein said trained URL detection model is: a random forest algorithm model, a decision tree model, a GBDT model, an XGboost algorithm model or an SVM model.
9. A phishing website URL detection system based on machine learning, the system comprising:
the URL analyzing unit is used for analyzing the URL to be detected, extracting the structural information of the URL to be detected and forming words of the URL to be detected;
the characteristic extraction unit is used for extracting URL characteristics according to the URL to be detected, the structural information of the URL to be detected and words forming the URL to be detected;
and the abnormal URL detection unit is used for inputting the URL characteristics into a trained URL detection model for detection to obtain the probability that the URL to be detected is the abnormal URL.
10. A machine-readable storage medium having stored thereon instructions for causing a machine to perform the method for machine learning based phishing website URL detection of any one of claims 1-8 herein.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110231656.4A CN112948725A (en) | 2021-03-02 | 2021-03-02 | Phishing website URL detection method and system based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110231656.4A CN112948725A (en) | 2021-03-02 | 2021-03-02 | Phishing website URL detection method and system based on machine learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112948725A true CN112948725A (en) | 2021-06-11 |
Family
ID=76247228
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110231656.4A Pending CN112948725A (en) | 2021-03-02 | 2021-03-02 | Phishing website URL detection method and system based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112948725A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114928472A (en) * | 2022-04-20 | 2022-08-19 | 哈尔滨工业大学(威海) | Method for filtering bad site grey list based on full-volume circulation main domain name |
CN116633684A (en) * | 2023-07-19 | 2023-08-22 | 中移(苏州)软件技术有限公司 | Phishing detection method, system, electronic device and readable storage medium |
CN117176483A (en) * | 2023-11-03 | 2023-12-05 | 北京艾瑞数智科技有限公司 | Abnormal URL identification method and device and related products |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106789888A (en) * | 2016-11-18 | 2017-05-31 | 重庆邮电大学 | A kind of fishing webpage detection method of multiple features fusion |
CN107798080A (en) * | 2017-10-13 | 2018-03-13 | 中国科学院信息工程研究所 | A kind of similar sample set construction method towards fishing URL detections |
CN107992469A (en) * | 2017-10-13 | 2018-05-04 | 中国科学院信息工程研究所 | A kind of fishing URL detection methods and system based on word sequence |
CN109274632A (en) * | 2017-07-12 | 2019-01-25 | 中国移动通信集团广东有限公司 | A kind of recognition methods of website and device |
CN110290116A (en) * | 2019-06-04 | 2019-09-27 | 中山大学 | A kind of malice domain name detection method of knowledge based map |
US20190361998A1 (en) * | 2018-05-24 | 2019-11-28 | Paypal, Inc. | Efficient random string processing |
-
2021
- 2021-03-02 CN CN202110231656.4A patent/CN112948725A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106789888A (en) * | 2016-11-18 | 2017-05-31 | 重庆邮电大学 | A kind of fishing webpage detection method of multiple features fusion |
CN109274632A (en) * | 2017-07-12 | 2019-01-25 | 中国移动通信集团广东有限公司 | A kind of recognition methods of website and device |
CN107798080A (en) * | 2017-10-13 | 2018-03-13 | 中国科学院信息工程研究所 | A kind of similar sample set construction method towards fishing URL detections |
CN107992469A (en) * | 2017-10-13 | 2018-05-04 | 中国科学院信息工程研究所 | A kind of fishing URL detection methods and system based on word sequence |
US20190361998A1 (en) * | 2018-05-24 | 2019-11-28 | Paypal, Inc. | Efficient random string processing |
CN110290116A (en) * | 2019-06-04 | 2019-09-27 | 中山大学 | A kind of malice domain name detection method of knowledge based map |
Non-Patent Citations (1)
Title |
---|
张萌: "基于机器学习的URL安全检测技术的研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114928472A (en) * | 2022-04-20 | 2022-08-19 | 哈尔滨工业大学(威海) | Method for filtering bad site grey list based on full-volume circulation main domain name |
CN114928472B (en) * | 2022-04-20 | 2023-07-18 | 哈尔滨工业大学(威海) | Bad site gray list filtering method based on full circulation main domain name |
CN116633684A (en) * | 2023-07-19 | 2023-08-22 | 中移(苏州)软件技术有限公司 | Phishing detection method, system, electronic device and readable storage medium |
CN116633684B (en) * | 2023-07-19 | 2023-10-13 | 中移(苏州)软件技术有限公司 | Phishing detection method, system, electronic device and readable storage medium |
CN117176483A (en) * | 2023-11-03 | 2023-12-05 | 北京艾瑞数智科技有限公司 | Abnormal URL identification method and device and related products |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10178107B2 (en) | Detection of malicious domains using recurring patterns in domain names | |
Tan et al. | PhishWHO: Phishing webpage detection via identity keywords extraction and target domain name finder | |
US10404745B2 (en) | Automatic phishing email detection based on natural language processing techniques | |
EP2803031B1 (en) | Machine-learning based classification of user accounts based on email addresses and other account information | |
US10033757B2 (en) | Identifying malicious identifiers | |
Mahajan et al. | Phishing website detection using machine learning algorithms | |
Kiruthiga et al. | Phishing websites detection using machine learning | |
Chu et al. | Protect sensitive sites from phishing attacks using features extractable from inaccessible phishing URLs | |
Buber et al. | NLP based phishing attack detection from URLs | |
US20150067833A1 (en) | Automatic phishing email detection based on natural language processing techniques | |
CN112948725A (en) | Phishing website URL detection method and system based on machine learning | |
Das Guptta et al. | Modeling hybrid feature-based phishing websites detection using machine learning techniques | |
Tan et al. | Phishing website detection using URL-assisted brand name weighting system | |
Joshi et al. | Phishing attack detection using feature selection techniques | |
CN110572359A (en) | Phishing webpage detection method based on machine learning | |
Deshpande et al. | Detection of phishing websites using Machine Learning | |
CN116917894A (en) | Detecting phishing URLs using a converter | |
Nowroozi et al. | An adversarial attack analysis on malicious advertisement url detection framework | |
CN115314236A (en) | System and method for detecting phishing domains in a Domain Name System (DNS) record set | |
Valiyaveedu et al. | Survey and analysis on AI based phishing detection techniques | |
US11647046B2 (en) | Fuzzy inclusion based impersonation detection | |
Dangwal et al. | Feature selection for machine learning-based phishing websites detection | |
Franchina et al. | Detecting phishing e-mails using Text Mining and features analysis | |
AT&T | ||
Hossain et al. | PhishRescue: A Stacked Ensemble Model to Identify Phishing Website Using Lexical Features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210611 |
|
RJ01 | Rejection of invention patent application after publication |