WO2016201938A1 - Multi-stage phishing website detection method and system - Google Patents

Multi-stage phishing website detection method and system Download PDF

Info

Publication number
WO2016201938A1
WO2016201938A1 PCT/CN2015/098463 CN2015098463W WO2016201938A1 WO 2016201938 A1 WO2016201938 A1 WO 2016201938A1 CN 2015098463 W CN2015098463 W CN 2015098463W WO 2016201938 A1 WO2016201938 A1 WO 2016201938A1
Authority
WO
WIPO (PCT)
Prior art keywords
filtering
website
phishing
stage
websites
Prior art date
Application number
PCT/CN2015/098463
Other languages
French (fr)
Chinese (zh)
Inventor
耿光刚
李晓东
Original Assignee
中国互联网络信息中心
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国互联网络信息中心 filed Critical 中国互联网络信息中心
Publication of WO2016201938A1 publication Critical patent/WO2016201938A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/554Detecting local intrusion or implementing counter-measures involving event detection and direct action
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2119Authenticating web pages, e.g. with suspicious links

Definitions

  • the present invention relates to the field of information technology, and in particular, to the field of network security technologies, and in particular, to a multi-stage phishing website detection method and system.
  • Phishing is a new international word.
  • the first two letters of phreak (the person who steals the phone line) replace the fishing f, which is based on social engineering (ie, deception) combined with network communication technology.
  • Cybercrime means.
  • the purpose of Internet phishing is to defraud the account password (online banking, online games or Alipay, etc.), credit card information and personal data on the victim's website, such as online transfer, stealing online game equipment, stealing email information and stealing credit cards.
  • Internet phishing is mainly implemented through a phishing site.
  • a phishing website can pretend to be a bank online banking page to steal a user's bank card number and password, thereby transferring the user's deposit in the bank account; disguising as an official website of the online game to steal the user Online game account, stealing the virtual currency or equipment of the user in the online game; pretending to send the Q coin website, stealing the user's QQ number and password to steal the QQ number; disguising as a winning website, stealing the user's personal information, and then using personal information to achieve The purpose of the crime; you can also obtain the user's email account and password through the above means, and then learn the user's email information, to achieve the purpose of spying on the privacy of others, and even stealing trade secrets.
  • the existing machine learning methods used in the detection of phishing fraud websites based on statistical learning mainly include decision trees, Bagging, support vector machines, etc. These general machine learning algorithms are widely used in pattern recognition fields such as text classification and face recognition. Can be used directly for phishing website detection. If the model based on the above machine learning algorithm is to achieve good results in the actual Internet, a necessary condition is that the training sample needs to cover various Internet pages, but the existing anti-phishing technology research is large. More based on the effectiveness of relatively small sample set verification algorithms, some sample sets even contain only dozens of samples, which can be questionable for generalization.
  • the present invention provides a multi-stage phishing website detection method and system, and the core idea is to combine the means of rapid filtering and precision filtering.
  • the suspected phishing website is controlled within a relatively small range; further, the accurate judgment model is trained by analyzing the statistical characteristics of positive and negative samples in a small range.
  • One of the objectives of the present invention is to provide a multi-stage phishing website detection method comprising the following steps:
  • step 1) the fast filtering of the website to be detected in the Internet includes:
  • step 1-1 the first layer of filtering is used to quickly exclude normal brand websites and ensure quick access of key websites.
  • the sensitive words include a bank, a credit card, a payment, a winning, a login, and a password.
  • step 1-2 the second layer filtering adopts a Bayesian filtering method.
  • the website related features include PageRank, domain name registration time, and favicon.
  • step 2) includes training an accurate decision model by analyzing statistical characteristics of positive and negative samples in the remaining range.
  • the statistical characteristics of the positive and negative samples include existing statistical phishing detection features, DNS registration and analytic features, and brand element features.
  • step 2) is trained by the confusing data set.
  • Another object of the present invention is to provide a multi-stage phishing website detection system, including:
  • a fast filtering module for selecting a range of websites to be detected for rapid filtering, excluding the obvious non-fishing Fish website
  • An accurate decision module for accurately determining the website to be detected in the remaining range after rapid filtering.
  • the fast filtering module includes:
  • a first filtering module for performing first layer filtering by using a brand name domain library and/or a domain name white list
  • a second filtering module for performing second layer filtering by using sensitive words
  • a third filtering module is configured to perform third layer filtering by using relevant features of the website.
  • the method and system of the present invention are divided into multiple stages to determine whether a website within a to-be-detected range is a phishing website, and can quickly filter a large number of non-phishing websites, and control the suspected fishing to be relatively small through multi-layer rapid filtering in the previous stage.
  • accurate judgment, using multi-dimensional features, training classification model accurate judgment of suspected phishing websites. That is to improve the efficiency of phishing website detection, and accurately determine the phishing website. It not only effectively overcomes the deficiencies of phishing website detection as an extremely unbalanced detection, but also greatly speeds up the detection of phishing websites, and is suitable for online applications.
  • FIG. 1 is a schematic diagram showing the imbalance of the phishing fraud detection problem according to the present invention.
  • FIG. 2 is a flow chart of the method for detecting a multi-stage phishing website according to the present invention.
  • FIG. 3 is a schematic diagram of the module composition of the system of the present invention.
  • a necessary premise for good results of the phishing detection method based on pattern classification is that the training samples are rich enough to cover various web pages.
  • the problem of phishing website detection in the actual Internet environment is an extreme class imbalance problem. As shown in Figure 1, the black spot in the center of the figure represents a phishing website, and the gray circle represents a non-phishing website.
  • the existing statistical learning-based phishing detection methods and strategies do not consider this fact, and lack of necessary explanation for the coverage and rationality of the constructed test data set.
  • the present invention is directed to the above situation, and designs a layered detection strategy, that is, multi-stage fishing detection.
  • the core of this strategy is to rationally design the filtering rules of each layer to achieve the purpose of improving detection efficiency and accuracy.
  • the first few stages of the detection strategy focus on the improvement of detection efficiency, that is, it can quickly eliminate the obvious non-phishing pages on the Internet, that is, remove the websites outside the black circle shown in Figure 1. Suspected fishing is reduced to the black circle; further, in the subsequent stage, the suspected fishing in the black circle is determined to ensure high accuracy and low false detection rate.
  • the method for detecting the multi-stage phishing website of the present invention is specifically described below with reference to the accompanying drawings.
  • the range to be detected applicable to the method of the present invention may be directed to a website collection.
  • the present invention does not limit the size of the collection, and may be a website collection of the entire Internet. Hehe. as shown in picture 2:
  • the multi-stage phishing website detection method includes two major stages of fast filtering and accurate determination, wherein the fast filtering is implemented by a fast filtering module, and the accurate determination is implemented by an accurate determining module.
  • the operating environment is: the software environment is not limited to Windows or Unix systems, and can be used in any common development language, such as C++, Java, Perl, and so on.
  • the hardware environment is also not limited, and can be an ordinary personal computer or a common server.
  • the three-stage rapid filtering performed by the present invention can effectively eliminate the majority of non-phishing websites on the Internet, and only a small number of websites that need accurate judgment enter the final accurate classification. stage. High efficiency is critical for phishing detection.
  • the fast filtering includes three specific stages in this embodiment.
  • First, the first stage is to use the brand name domain name and the domain name white list to perform the first layer filtering to quickly exclude the normal brand website, considering that these websites have great daily The access requirements, this layer of filtering can guarantee fast access to key sites.
  • the second stage is the filtering of the login box, sensitive words and copyrights, that is, the second layer of filtering.
  • the sensitive words include and are not limited to: "bank, credit card, payment, winning, login, password", etc., according to the increase of the type of phishing websites. Update settings.
  • the second layer of filtering uses Bayesian filtering, also known as Bayesian classification. The related working principles are generally known to those skilled in the art and will not be described herein. Through the second layer filtering step, most common web pages will be filtered out, which will greatly improve the overall detection efficiency.
  • the third stage further determines the page containing relevant sensitive words based on PageRank (The PageRank[The PageRank Citation Ranking: Bringing Order to the Web; Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry: technical Report .Stanford Infolab; 1999] is a web page ranking algorithm proposed by Larry Page.
  • PageRank [The PageRank Citation Ranking: Bringing Order to the Web; Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry: technical Report .Stanford Infolab; 1999] is a web page ranking algorithm proposed by Larry Page.
  • the basic idea is that compared with non-popular websites, a popular website is characterized by more popular websites connected to it. Aspects: The more websites that link to a website, the more popular the website; the higher the popularity of a website that links to a website, the more popular the website. That is, the popularity of a website and the website that links to it.
  • the number is proportional to the popularity of the link to the site.
  • the domain registration time and favicon (the favorite icon is the small icon that appears on the left side of the browser's address bar, also known as the website avatar.
  • favicon The display is also different: in most major browsers such as FireFox and Internet Explorer (5.5 And above), favicon is not only displayed in the favorites, but also appears in the address bar, when the user can drag favicon to the desktop to create a shortcut to the website.)
  • the third stage is based on this Principle: Normal brand websites often have high PageRank, and the domain name registration time is greater than K years (such as more than 3 years), and often has a counterfeit favicon, while phishing websites are just the opposite, that is, after the first layer of filtering and After the second layer of filtering, if the website within the remaining range does not have high PageRank, domain name note If the book is short and/or there is no need to prevent favicon, it will be judged as
  • the second stage and the third stage above are all implemented by training a simple classifier with a small number of features.
  • the next step is the accurate detection and determination stage: using the rich features of the existing statistical phishing detection (URL characters, titles, DOM trees, search engine rankings, login boxes, etc.) will be the same series of DNS registration and resolution features, brand element features, etc.
  • the confusing data set is trained to accurately determine the model.
  • the confusing data set may be a data set composed of samples (websites) in the black circle in FIG. 1 to carry out the final determination of fishing or not.
  • model training is a field of pattern recognition, machine learning, especially in the field of supervised learning, that is, learning from a training material or establishing a model, see: http://en.wikipedia.org/wiki/%E7% 9B%91%E7%9D%A3%E5%AD%A6%E4%B9%A0, and will not be described here.
  • FIG. 3 it is a schematic diagram of a module structure of a multi-stage phishing website detection system according to an embodiment of the present invention, where the system includes:
  • the fast filtering module is used to select a range of websites to be detected for rapid filtering, and to exclude obvious non-phishing websites;
  • the first filtering module is configured to perform first layer filtering by using a brand name domain library and/or a domain name white list, including: a brand name domain filtering module and a host white list filtering module;
  • the second filtering module is configured to perform second layer filtering by using sensitive words, including: a login box detecting module and a sensitive word filtering module;
  • the third filtering module is configured to perform third layer filtering by using related features of the website, including: a PageRank obtaining module, a domain name registration information acquiring module, and a favicon obtaining and matching module.
  • An accurate decision module for accurately determining the website to be tested in the remaining range after rapid filtering including:
  • the multi-dimensional feature extraction module is configured to extract multi-dimensional features including, but not limited to, the above three filtering modules:
  • Domain registration feature The registration duration of the domain name used by the website
  • Favicon feature whether the suspected phishing website contains the brand Favicon;
  • PageRank feature The PageRank value of the domain name used by the website
  • Login box feature Whether the website contains a login box
  • Sensitive word characteristics Whether the website contains keywords such as “bank”, “payment”, “password”, “winning”;
  • Copyright statement characteristics Whether the website contains a copyright statement of a brand
  • Https feature Whether the website uses the Https protocol.
  • Accurate decision module using the above multi-dimensional features on the training set, training support vector machine [https://en.wiki A classifier such as pedia.org/wiki/Support_vector_machine, a decision tree [https://en.wikipedia.org/wiki/Decision_tree], and a classification model that determines the suspected website. Specific model training and classification decisions can be found at https://en.wikipedia.org/wiki/Statistical_classification.
  • the multi-stage phishing detection method and system of the present invention is expected to improve the performance of phishing website detection based on statistical machine learning from two aspects of detection efficiency and robustness.
  • the fast filtering step efficiently filters most non-phishing websites, which greatly solves the defect that the existing phishing detection method needs to extract a large number of features and comprehensive judgments. It can balance detection efficiency and accuracy, and is suitable for large-scale server-side processing as well as client applications such as browser plug-ins.
  • the differences in detection performance of the phishing website compared to the prior art are described below in tabular form:
  • the phishing website detection method of the prior art is: a heuristic phishing detection method, which uses some column heuristic rules to determine the phishing.
  • This method requires manual setting of heuristic parameters, and the phishers can easily avoid the rules.
  • This determines that the heuristic rule method is often not suitable for the rapidly changing Internet environment, especially since the method is completely unsuitable for the emerging phishing mode discovery, and the limitations are obvious.
  • the phishing website detection method of the prior art 2 is: a single-stage phishing detection method based on statistical machine learning. This kind of method avoids the defect that the parameter setting of the heuristic rule method is easy to be avoided by the angler, and can easily adapt to the determination of multiple fishing, but the construction of the high accuracy model needs to extract a large number of features, and the feature extraction phase takes a long time. Not suitable for online testing with high time requirements.
  • the multi-stage phishing website detection is described as the above four stages in this embodiment, in actuality, those skilled in the art may adjust and test according to the validity and extraction complexity of the related features of the website, until Get the phishing detection strategy that best suits your current network environment. That is to say, the method of the present invention is not limited to the above four stages, and the number of stages may be increased or reduced according to actual conditions, for example, the second and third stages may be combined into one stage; or for example A URL similarity filtering phase can be added between the first and second phases (the URL of the phishing website often contains the brand name of the phishing target) and the like.
  • the adjustments such as the above are in accordance with the technical idea of the present invention, and the scope of the present invention should be defined by the claims.

Abstract

The present invention discloses a multi-stage phishing website detection method and system, and combines means of both fast filtering and accurate filtering. Multiple stages of fast filtering are used to control the number of potential phishing websites to be in a relatively small range; furthermore, an accurate determination model is trained by analyzing statistical features of positive and negative samples in a small range. The method comprises the following steps: selecting a to-be-detected range of websites to perform fast filtering and excluding obvious non-phishing websites therefrom; and performing accurate determination on the remaining range of the websites after the fast filtering to determine whether said websites are phishing websites. The system comprises: a fast filtering module, configured to select a range of to-be-detected websites to perform fast filtering and exclude obvious non-phishing websites therefrom; and an accurate determination module, configured to perform accurate determination on the remaining range of the to-be-detected websites after the fast filtering.

Description

一种多阶段钓鱼网站检测方法与系统Multi-stage phishing website detection method and system 技术领域Technical field
本发明涉及信息技术领域,尤其涉及网络安全技术领域,具体涉及一种多阶段钓鱼网站检测方法与系统。The present invention relates to the field of information technology, and in particular, to the field of network security technologies, and in particular, to a multi-stage phishing website detection method and system.
背景技术Background technique
时至今日,互联网已经成为人们社会生活重要的组成部分,但是伴随着互联网的不断普及和应用水平的不断提高,除了木马、病毒以及僵尸网络等传统的信息安全威胁以外,互联网钓鱼欺诈已经逐渐成为网络犯罪分子最主要的攻击手段之一。Today, the Internet has become an important part of people's social life, but with the increasing popularity of the Internet and the increasing application level, in addition to traditional information security threats such as Trojans, viruses and botnets, Internet phishing scams have gradually become One of the most important means of attack by cybercriminals.
互联网钓鱼(phishing)是国际通用的新词,去phreak(偷接电话线的人)的前两个字母ph取代fishing(钓鱼)的f,是以社会工程学(即骗术)结合网络通讯技术的网络犯罪手段。互联网钓鱼的目的是骗取受害人的网站上的账号密码(网银、网游或支付宝等)、信用卡资料及个人资料,进行例如网上转账、盗取网游装备、盗取电子邮件信息及盗刷信用卡等。互联网钓鱼主要通过钓鱼网站(phishing site)实施,例如,钓鱼网站可伪装为银行网银页面窃取用户的银行卡卡号和密码,进而转走用户在银行账户内的存款;伪装为网游的官方网站窃取用户的网游账号,窃取用户在网游内的虚拟货币或装备;伪装成送Q币的网站,窃取用户的QQ号和密码进而窃取QQ号;伪装成中奖网站,窃取用户个人信息,进而利用个人信息达到犯罪的目的;还可以通过上述手段获取用户的Email账号和密码,进而获悉用户电子邮件的往来信息,达到窥探他人隐私,甚至窃取商业机密的犯罪目的。Phishing is a new international word. The first two letters of phreak (the person who steals the phone line) replace the fishing f, which is based on social engineering (ie, deception) combined with network communication technology. Cybercrime means. The purpose of Internet phishing is to defraud the account password (online banking, online games or Alipay, etc.), credit card information and personal data on the victim's website, such as online transfer, stealing online game equipment, stealing email information and stealing credit cards. Internet phishing is mainly implemented through a phishing site. For example, a phishing website can pretend to be a bank online banking page to steal a user's bank card number and password, thereby transferring the user's deposit in the bank account; disguising as an official website of the online game to steal the user Online game account, stealing the virtual currency or equipment of the user in the online game; pretending to send the Q coin website, stealing the user's QQ number and password to steal the QQ number; disguising as a winning website, stealing the user's personal information, and then using personal information to achieve The purpose of the crime; you can also obtain the user's email account and password through the above means, and then learn the user's email information, to achieve the purpose of spying on the privacy of others, and even stealing trade secrets.
为了防止和打击互联网钓鱼的犯罪行为,维护互联网用户的自身利益和隐私,采取检测方法出隐藏在互联网中的钓鱼网站是最为有效和直接的技术手段。In order to prevent and combat the criminal behavior of Internet phishing and maintain the self-interest and privacy of Internet users, it is the most effective and direct technical means to take detection methods to hang phishing websites hidden in the Internet.
随着信息技术的不断发展,钓鱼网站越来越多的存在于互联网中,各种各样的钓鱼网站层出不穷,覆盖各领域各种类的互联网页面。在现阶段,互联网钓鱼欺诈检测多使用基于统计机器学习的模式分类技术,这是由于近年来人工智能、机器学习理论成功应用于多个领域的示范效应,基于统计机器学习的钓鱼网站检测已逐渐成为流行的钓鱼网站检测方法。With the continuous development of information technology, more and more phishing websites exist in the Internet, and various phishing websites are emerging one after another, covering various types of Internet pages in various fields. At this stage, Internet phishing detection uses pattern mining technology based on statistical machine learning. This is due to the demonstration effect of artificial intelligence and machine learning theory successfully applied in many fields in recent years. The detection of phishing websites based on statistical machine learning has gradually Become a popular phishing website detection method.
现有基于统计学习的钓鱼欺诈网站检测所使用的机器学习方法主要有决策树、Bagging、支持矢量机等,这些通用的机器学习算法被广泛用于文本分类、人脸识别等模式识别领域,也可以直接用于钓鱼网站检测。基于上述机器学习算法学习的模型若要在实际互联网中取得好的效果,一个必要条件是训练样本需要覆盖各类互联网页面,然而现有反钓鱼技术研究大 多基于相对较小的样本集验证算法的有效性,有的样本集甚至只包含几十个样本,其可推广性存疑。另外,即便样本集真能大到覆盖各类样本、且各类样本符合实际互联网中的比例,考虑到钓鱼检测属于极度类不均衡问题(即全球亿级的网站中每年仅有数十万数量级的钓鱼网站),直接使用现有模式分类算法很难取得良好的检测效果。The existing machine learning methods used in the detection of phishing fraud websites based on statistical learning mainly include decision trees, Bagging, support vector machines, etc. These general machine learning algorithms are widely used in pattern recognition fields such as text classification and face recognition. Can be used directly for phishing website detection. If the model based on the above machine learning algorithm is to achieve good results in the actual Internet, a necessary condition is that the training sample needs to cover various Internet pages, but the existing anti-phishing technology research is large. More based on the effectiveness of relatively small sample set verification algorithms, some sample sets even contain only dozens of samples, which can be questionable for generalization. In addition, even if the sample set is really large enough to cover all kinds of samples, and the various types of samples meet the proportion of the actual Internet, considering that the phishing detection is an extreme class imbalance problem (that is, the global billion-level website only has hundreds of thousands of orders of magnitude per year. Phishing website), it is difficult to obtain good detection results by directly using the existing pattern classification algorithm.
发明内容Summary of the invention
针对上述问题,本发明提供一种多阶段钓鱼网站检测方法及系统,其核心思想是:将快速过滤与精准过滤的手段相融合。通过多阶段的快速过滤,将疑似钓鱼网站控制在相对小的范围内;进一步,通过分析小范围内正负样本的统计特征,训练精准判定模型。In view of the above problems, the present invention provides a multi-stage phishing website detection method and system, and the core idea is to combine the means of rapid filtering and precision filtering. Through multi-stage rapid filtering, the suspected phishing website is controlled within a relatively small range; further, the accurate judgment model is trained by analyzing the statistical characteristics of positive and negative samples in a small range.
本发明的目的之一在于提供一种多阶段钓鱼网站检测方法,包括以下步骤:One of the objectives of the present invention is to provide a multi-stage phishing website detection method comprising the following steps:
1)选取一待检测范围内的网站进行快速过滤,排除其中的明显非钓鱼网站;1) Select a website within the scope of detection to perform rapid filtering to exclude obvious non-phishing websites;
2)提取进行所述快速过滤时所用的多维度特征;2) extracting multi-dimensional features used in performing the fast filtering;
3)在训练集上使用上述多维度特征,对快速过滤后的余下范围内的网站进行精确判定,判断其是否为钓鱼网站。3) Using the above multi-dimensional features on the training set, the website within the remaining range after the fast filtering is accurately determined to determine whether it is a phishing website.
进一步地,步骤1)所述对互联网中的待检测网站进行快速过滤包括:Further, the step 1), the fast filtering of the website to be detected in the Internet includes:
1-1)利用品牌主机和/或域名白名单进行第一层过滤;1-1) use the brand host and/or domain name whitelist for the first layer of filtering;
1-2)利用登陆框、敏感词及版权信息进行第二层过滤;1-2) using the login box, sensitive words and copyright information for the second layer of filtering;
1-3)利用网站相关特征进行第三层过滤。1-3) Perform third-level filtering using website-related features.
进一步地,步骤1-1)中,所述第一层过滤用以快速排除正常的品牌网站,保障重点网站的快速访问。Further, in step 1-1), the first layer of filtering is used to quickly exclude normal brand websites and ensure quick access of key websites.
进一步地,步骤1-2)中,所述敏感词包括银行、信用卡、支付、中奖、登录及密码。Further, in step 1-2), the sensitive words include a bank, a credit card, a payment, a winning, a login, and a password.
进一步地,步骤1-2)中,所述第二层过滤采用贝叶斯过滤方式。Further, in step 1-2), the second layer filtering adopts a Bayesian filtering method.
进一步地,步骤1-3)中,所述网站相关特征包括PageRank、域名注册时间及favicon。Further, in step 1-3), the website related features include PageRank, domain name registration time, and favicon.
进一步地,步骤2)中所述精确判定包括:通过分析余下范围内正负样本的统计特征,训练一精准判定模型。Further, the accurate determination in step 2) includes training an accurate decision model by analyzing statistical characteristics of positive and negative samples in the remaining range.
进一步地,所述正负样本的统计特征包括现有统计钓鱼检测特征、DNS注册和解析特征、及品牌元素特征。Further, the statistical characteristics of the positive and negative samples include existing statistical phishing detection features, DNS registration and analytic features, and brand element features.
进一步地,步骤2)所述中精准判定模型通过易混淆数据集进行训练。Further, the accurate determination model in step 2) is trained by the confusing data set.
本发明的另一目的在于提供一种一种多阶段钓鱼网站检测系统,包括:Another object of the present invention is to provide a multi-stage phishing website detection system, including:
一快速过滤模块,用以选取一范围内的待检测网站进行快速过滤,排除其中的明显非钓 鱼网站;a fast filtering module for selecting a range of websites to be detected for rapid filtering, excluding the obvious non-fishing Fish website
一精确判定模块,用以对快速过滤后的余下范围内的待检测网站进行精确判定。An accurate decision module for accurately determining the website to be detected in the remaining range after rapid filtering.
进一步地,所述快速过滤模块包括:Further, the fast filtering module includes:
一第一过滤模块,用以利用品牌域名库和/或域名白名单进行第一层过滤;a first filtering module for performing first layer filtering by using a brand name domain library and/or a domain name white list;
一第二过滤模块,用以利用敏感词进行第二层过滤;a second filtering module for performing second layer filtering by using sensitive words;
一第三过滤模块,用以利用网站相关特征进行第三层过滤。A third filtering module is configured to perform third layer filtering by using relevant features of the website.
本发明的方法和系统分为多个阶段判定待检测范围内的网站是否为钓鱼网站,能够快速对广大的非钓鱼网站进行过滤,通过前阶段的多层快速过滤,将疑似钓鱼控制在相对小的范围内;同时通过精准判定,利用多维度特征,训练分类模型,对疑似钓鱼网站进行精准判定。即提升了钓鱼网站检测的效率,又准确的判定钓鱼网站。不仅有效克服了钓鱼网站检测作为极度类不均衡检测无法取得良好效果的缺陷,而且大大加速了钓鱼网站检测的速度,适合在线应用。The method and system of the present invention are divided into multiple stages to determine whether a website within a to-be-detected range is a phishing website, and can quickly filter a large number of non-phishing websites, and control the suspected fishing to be relatively small through multi-layer rapid filtering in the previous stage. At the same time, through accurate judgment, using multi-dimensional features, training classification model, accurate judgment of suspected phishing websites. That is to improve the efficiency of phishing website detection, and accurately determine the phishing website. It not only effectively overcomes the deficiencies of phishing website detection as an extremely unbalanced detection, but also greatly speeds up the detection of phishing websites, and is suitable for online applications.
附图说明DRAWINGS
图1为本发明所述钓鱼欺诈检测问题类不均衡示意图。FIG. 1 is a schematic diagram showing the imbalance of the phishing fraud detection problem according to the present invention.
图2为本发明的多阶段钓鱼网站检测方法工作流程图。2 is a flow chart of the method for detecting a multi-stage phishing website according to the present invention.
图3为本发明的系统的模块组成示意图。3 is a schematic diagram of the module composition of the system of the present invention.
具体实施方式detailed description
基于模式分类的钓鱼检测方法取得好的效果的一个必要前提是训练样本要足够丰富,即覆盖各类Web页面。然而,实际互联网环境中钓鱼网站检测问题属于极度类不均衡问题,如图1所示,该图中心的黑斑表示钓鱼网站,灰色的圆圈表示非钓鱼网站。A necessary premise for good results of the phishing detection method based on pattern classification is that the training samples are rich enough to cover various web pages. However, the problem of phishing website detection in the actual Internet environment is an extreme class imbalance problem. As shown in Figure 1, the black spot in the center of the figure represents a phishing website, and the gray circle represents a non-phishing website.
现有的基于统计学习的钓鱼检测方法和策略均未考虑该事实,对于所构建的测试数据集的覆盖面和合理性缺乏必要说明。本发明针对以上情况,设计分层检测策略,即多阶段钓鱼检测。该策略的核心是合理设计每一层的过滤规则,以达到提升检测效率和准确率的目的。为了达到该目的,检测策略的前几个阶段聚焦在对检测效率的提升,即能够快速排除互联网上占绝大多数的明显非钓鱼网页,即去除掉图1所示黑圈外的网站,将疑似钓鱼缩小到黑圈内;进一步,在后续阶段对黑圈内的疑似钓鱼进行重点判定,以确保高准确率和低误检率。The existing statistical learning-based phishing detection methods and strategies do not consider this fact, and lack of necessary explanation for the coverage and rationality of the constructed test data set. The present invention is directed to the above situation, and designs a layered detection strategy, that is, multi-stage fishing detection. The core of this strategy is to rationally design the filtering rules of each layer to achieve the purpose of improving detection efficiency and accuracy. In order to achieve this goal, the first few stages of the detection strategy focus on the improvement of detection efficiency, that is, it can quickly eliminate the obvious non-phishing pages on the Internet, that is, remove the websites outside the black circle shown in Figure 1. Suspected fishing is reduced to the black circle; further, in the subsequent stage, the suspected fishing in the black circle is determined to ensure high accuracy and low false detection rate.
以下,结合附图对本发明的多阶段钓鱼网站检测方法进行具体说明,本发明的方法适用的待检测范围可针对一个网站集合,本发明不限定该集合大小,可以是整个互联网的网站集 合。如图2所示:The method for detecting the multi-stage phishing website of the present invention is specifically described below with reference to the accompanying drawings. The range to be detected applicable to the method of the present invention may be directed to a website collection. The present invention does not limit the size of the collection, and may be a website collection of the entire Internet. Hehe. as shown in picture 2:
在本发明的一实施例中,多阶段钓鱼网站检测方法包括快速过滤和精确判定两个大的阶段,其中快速过滤通过一快速过滤模块实现,精确判定通过一精确判定模块实现。运行环境为:软件环境不限于Windows或Unix系统,可以采用任一常用开发语言,比如C++、Java、Perl等。硬件环境也不做特别限定,可以为普通个人电脑,也可以是常用服务器。In an embodiment of the present invention, the multi-stage phishing website detection method includes two major stages of fast filtering and accurate determination, wherein the fast filtering is implemented by a fast filtering module, and the accurate determination is implemented by an accurate determining module. The operating environment is: the software environment is not limited to Windows or Unix systems, and can be used in any common development language, such as C++, Java, Perl, and so on. The hardware environment is also not limited, and can be an ordinary personal computer or a common server.
与传统的单阶段钓鱼检测方法相比,本发明执行的三个阶段的快速过滤,可以高效的排除掉互联网上占绝大多数的非钓鱼网站,只有少量需要精准判定的网站进入最后的精准分类阶段。高效率对于钓鱼检测而言至关重要。Compared with the traditional single-stage phishing detection method, the three-stage rapid filtering performed by the present invention can effectively eliminate the majority of non-phishing websites on the Internet, and only a small number of websites that need accurate judgment enter the final accurate classification. stage. High efficiency is critical for phishing detection.
快速过滤在本实施例中包括三个具体的阶段,首先第一阶段是利用品牌域名库和域名白名单进行第一层过滤,以快速排除正常的品牌网站,考虑到这些网站每日拥有极大的访问需求,该层过滤可以保障重点网站的快速访问。The fast filtering includes three specific stages in this embodiment. First, the first stage is to use the brand name domain name and the domain name white list to perform the first layer filtering to quickly exclude the normal brand website, considering that these websites have great daily The access requirements, this layer of filtering can guarantee fast access to key sites.
第二阶段是登陆框、敏感词及版权等过滤,即第二层过滤,敏感词包含且不限于:“银行、信用卡、支付、中奖、登录、密码”等,可根据钓鱼网站类型的增加进行更新设置。第二层过滤采用贝叶斯过滤,又称贝叶斯分类,相关工作原理本领域技术人员应普遍公知,在此不再赘述。通过该第二层过滤的步骤,绝大多数普通网页将被过滤掉,将大大提升总体检测效率。The second stage is the filtering of the login box, sensitive words and copyrights, that is, the second layer of filtering. The sensitive words include and are not limited to: "bank, credit card, payment, winning, login, password", etc., according to the increase of the type of phishing websites. Update settings. The second layer of filtering uses Bayesian filtering, also known as Bayesian classification. The related working principles are generally known to those skilled in the art and will not be described herein. Through the second layer filtering step, most common web pages will be filtered out, which will greatly improve the overall detection efficiency.
第三阶段对包含相关敏感词的页面进行进一步判定,该阶段基于PageRank(PageRank[The PageRank Citation Ranking:Bringing Order to the Web;Page,Lawrence and Brin,Sergey and Motwani,Rajeev and Winograd,Terry:technical Report.Stanford Infolab;1999]是Larry Page提出的一种网页排名算法。其基本思想为:与不流行网站相比,一个流行网站的特征是连接到它的流行网站较多。这种直观思想包括两个方面:链接到一个网站的网站数目越多,这个网站越流行;链接到一个网站的网络流行度越高,这个网站越流行。也就是说,一个网站的流行度与链接到该网站的网站数目和链接到该网站的流行度成正比)、域名注册时间和favicon(收藏夹图标,就是出现在浏览器地址栏左侧的那个小图标,也称为网站头像。根据浏览器的不同,favicon显示也有所区别:在大多数主流浏览器如FireFox和Internet Explorer(5.5及以上版本)中,favicon不仅在收藏夹中显示,还会同时出现在地址栏上,这时用户可以拖曳favicon到桌面以建立到网站的快捷方式。)等特征,第三阶段判定基于这样的原则:正常品牌网站往往拥有高PageRank、且所使用域名注册时间大于K年(比如大于3年),以及往往拥有仿冒favicon,而钓鱼网站则恰恰相反,也就是说,如经过第一层过滤和第二层过滤后,余下范围内的网站如果不具有高PageRank、域名注 册时间较短和/或不用有防冒favicon,则将其判定为疑似钓鱼网站。The third stage further determines the page containing relevant sensitive words based on PageRank (The PageRank[The PageRank Citation Ranking: Bringing Order to the Web; Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry: technical Report .Stanford Infolab; 1999] is a web page ranking algorithm proposed by Larry Page. The basic idea is that compared with non-popular websites, a popular website is characterized by more popular websites connected to it. Aspects: The more websites that link to a website, the more popular the website; the higher the popularity of a website that links to a website, the more popular the website. That is, the popularity of a website and the website that links to it. The number is proportional to the popularity of the link to the site.) The domain registration time and favicon (the favorite icon is the small icon that appears on the left side of the browser's address bar, also known as the website avatar. Depending on the browser, favicon The display is also different: in most major browsers such as FireFox and Internet Explorer (5.5 And above), favicon is not only displayed in the favorites, but also appears in the address bar, when the user can drag favicon to the desktop to create a shortcut to the website.) Features, the third stage is based on this Principle: Normal brand websites often have high PageRank, and the domain name registration time is greater than K years (such as more than 3 years), and often has a counterfeit favicon, while phishing websites are just the opposite, that is, after the first layer of filtering and After the second layer of filtering, if the website within the remaining range does not have high PageRank, domain name note If the book is short and/or there is no need to prevent favicon, it will be judged as a suspected phishing website.
以上第二阶段和第三阶段,均利用少量几个特征训练简单分类器,即可实现。能够快速将大量的非钓鱼网站排除到检测范围之外。不仅提高检测效率,且节省硬件及软件资源。The second stage and the third stage above are all implemented by training a simple classifier with a small number of features. Ability to quickly exclude large numbers of non-phishing sites from detection. Not only improve detection efficiency, but also save hardware and software resources.
接下来是精准检测判定阶段:利用现有统计钓鱼检测中的丰富特征(URL字符、标题、DOM树、搜索引擎排名、登陆框等)会同一系列DNS注册和解析特征、品牌元素特征等,在易混淆数据集上训练精确判定模型,举例而言,该混淆数据集可以是图1中黑圈内的样本(网站)组成的数据集,以开展钓鱼与否的最终判定。另外,模型训练是模式识别、机器学习领域,特别是监督学习领域公知的技术,即由训练资料中学到或建立一个模式,可参见:http://zh.wikipedia.org/wiki/%E7%9B%91%E7%9D%A3%E5%AD%A6%E4%B9%A0,在此不再赘述。The next step is the accurate detection and determination stage: using the rich features of the existing statistical phishing detection (URL characters, titles, DOM trees, search engine rankings, login boxes, etc.) will be the same series of DNS registration and resolution features, brand element features, etc. The confusing data set is trained to accurately determine the model. For example, the confusing data set may be a data set composed of samples (websites) in the black circle in FIG. 1 to carry out the final determination of fishing or not. In addition, model training is a field of pattern recognition, machine learning, especially in the field of supervised learning, that is, learning from a training material or establishing a model, see: http://en.wikipedia.org/wiki/%E7% 9B%91%E7%9D%A3%E5%AD%A6%E4%B9%A0, and will not be described here.
如图3所示,为本发明一实施例中多阶段钓鱼网站检测系统的模块组成示意图,该系统包括:As shown in FIG. 3, it is a schematic diagram of a module structure of a multi-stage phishing website detection system according to an embodiment of the present invention, where the system includes:
快速过滤模块,用以选取一范围内的待检测网站进行快速过滤,排除其中的明显非钓鱼网站;包括:The fast filtering module is used to select a range of websites to be detected for rapid filtering, and to exclude obvious non-phishing websites;
第一过滤模块,用以利用品牌域名库和/或域名白名单进行第一层过滤,包括;品牌域名过滤模块及主机白名单过滤模块;The first filtering module is configured to perform first layer filtering by using a brand name domain library and/or a domain name white list, including: a brand name domain filtering module and a host white list filtering module;
第二过滤模块,用以利用敏感词进行第二层过滤,包括:登录框检测模块,敏感词过滤模块;The second filtering module is configured to perform second layer filtering by using sensitive words, including: a login box detecting module and a sensitive word filtering module;
第三过滤模块,用以利用网站相关特征进行第三层过滤,包括:PageRank获取模块,域名注册信息获取模块及favicon获取与匹配模块等。The third filtering module is configured to perform third layer filtering by using related features of the website, including: a PageRank obtaining module, a domain name registration information acquiring module, and a favicon obtaining and matching module.
精确判定模块,用以对快速过滤后的余下范围内的待检测网站进行精确判定,包括:An accurate decision module for accurately determining the website to be tested in the remaining range after rapid filtering, including:
多维度特征提取模块,用以提取包括且不限于上述三个过滤模块使用到的多维度特征:The multi-dimensional feature extraction module is configured to extract multi-dimensional features including, but not limited to, the above three filtering modules:
域名注册特征:网站所使用域名的注册时长;Domain registration feature: The registration duration of the domain name used by the website;
Logo特征:疑似钓鱼网站是否含有品牌Logo;Logo feature: Whether the suspected phishing website contains a brand logo;
favicon特征:疑似钓鱼网站是否含有品牌Favicon;Favicon feature: whether the suspected phishing website contains the brand Favicon;
PageRank特征:网站所使用域名的PageRank值;PageRank feature: The PageRank value of the domain name used by the website;
登陆框特征:网站是否含有登陆框;Login box feature: Whether the website contains a login box;
敏感词特征:网站是否含有“银行”、“支付”、“密码”、“中奖”等关键词;Sensitive word characteristics: Whether the website contains keywords such as “bank”, “payment”, “password”, “winning”;
版权声明特征:网站是否含某品牌的copyright声明;Copyright statement characteristics: Whether the website contains a copyright statement of a brand;
Https特征:网站是否使用Https协议。Https feature: Whether the website uses the Https protocol.
精确判定模块,在训练集上使用上述多维度特征,训练支持向量机【https://en.wiki  pedia.org/wiki/Support_vector_machine】、决策树【https://en.wikipedia.org/wiki/Decision_tree】等分类器,获得分类模型,该模型对疑似网站进行判定。具体模型训练和分类判定可参照https://en.wikipedia.org/wiki/Statistical_classification。Accurate decision module, using the above multi-dimensional features on the training set, training support vector machine [https://en.wiki A classifier such as pedia.org/wiki/Support_vector_machine, a decision tree [https://en.wikipedia.org/wiki/Decision_tree], and a classification model that determines the suspected website. Specific model training and classification decisions can be found at https://en.wikipedia.org/wiki/Statistical_classification.
由上述,本发明的多阶段钓鱼检测方法和系统,以期望从检测效率和鲁棒性两个层面提升基于统计机器学习的钓鱼网站检测性能。通过多阶段过滤,快速过滤步骤高效过滤绝大多数的非钓鱼网站,极大解决了现有钓鱼检测方法需要提取大量特征综合判定耗时的缺陷。能够兼顾检测效率和准确性,既适用用于大规模服务端处理,也适用于浏览器插件等客户端应用。以下以表格形式说明本发明的方法和系统与现有技术相比,在钓鱼网站检测性能上的差异:From the above, the multi-stage phishing detection method and system of the present invention is expected to improve the performance of phishing website detection based on statistical machine learning from two aspects of detection efficiency and robustness. Through multi-stage filtering, the fast filtering step efficiently filters most non-phishing websites, which greatly solves the defect that the existing phishing detection method needs to extract a large number of features and comprehensive judgments. It can balance detection efficiency and accuracy, and is suitable for large-scale server-side processing as well as client applications such as browser plug-ins. The differences in detection performance of the phishing website compared to the prior art are described below in tabular form:
Figure PCTCN2015098463-appb-000001
Figure PCTCN2015098463-appb-000001
本发明与现有技术的钓鱼网站检测性能对比表Comparison table of detection performance between the present invention and prior art phishing websites
上述表格中,现有技术一的钓鱼网站检测方法为:启发式钓鱼检测方法,利用一些列启发式规则对钓鱼进行判定,该方法需要人工设置启发式参数,钓鱼者可以较容易避开规则,这决定了启发式规则方法往往不适合快速变化的互联网环境,特别是由于该方法完全不适合新出现的钓鱼模式发现,局限性明显。In the above table, the phishing website detection method of the prior art is: a heuristic phishing detection method, which uses some column heuristic rules to determine the phishing. This method requires manual setting of heuristic parameters, and the phishers can easily avoid the rules. This determines that the heuristic rule method is often not suitable for the rapidly changing Internet environment, especially since the method is completely unsuitable for the emerging phishing mode discovery, and the limitations are obvious.
上述表格中,现有技术二的钓鱼网站检测方法为:基于统计机器学习的单阶段钓鱼检测方法。该类方法避免了启发式规则方法参数设置容易被钓鱼者避开的缺陷,可以较容易的适应多种钓鱼的判定,但高准确率模型的构建需要提取大量的特征,特征提取阶段耗时长,不适用于时间要求高的在线检测。In the above table, the phishing website detection method of the prior art 2 is: a single-stage phishing detection method based on statistical machine learning. This kind of method avoids the defect that the parameter setting of the heuristic rule method is easy to be avoided by the angler, and can easily adapt to the determination of multiple fishing, but the construction of the high accuracy model needs to extract a large number of features, and the feature extraction phase takes a long time. Not suitable for online testing with high time requirements.
需要说明的是,虽然本实施例中将多阶段钓鱼网站检测描述为上述四个阶段,但实际上,本领域技术人员可根据网站相关特征的有效性及提取复杂性等进行调整、测试,直至获得最适合当前网络环境的钓鱼检测策略。也就是说,本发明的方法不限于上述四个阶段,可以根据实际情况增加阶段或缩小阶段数,例如:可以将第二和第三阶合并为一个阶段;或者例如 可以在第一和第二阶段中间加入URL相似性过滤阶段(钓鱼网站的URL往往包含钓鱼目标的品牌字符串)等。诸如上述的调整均符合本发明的技术构思,应在本发明的范围之内,本发明的保护范围应以权利要求书所界定为准。 It should be noted that, although the multi-stage phishing website detection is described as the above four stages in this embodiment, in actuality, those skilled in the art may adjust and test according to the validity and extraction complexity of the related features of the website, until Get the phishing detection strategy that best suits your current network environment. That is to say, the method of the present invention is not limited to the above four stages, and the number of stages may be increased or reduced according to actual conditions, for example, the second and third stages may be combined into one stage; or for example A URL similarity filtering phase can be added between the first and second phases (the URL of the phishing website often contains the brand name of the phishing target) and the like. The adjustments such as the above are in accordance with the technical idea of the present invention, and the scope of the present invention should be defined by the claims.

Claims (10)

  1. 一种多阶段钓鱼网站检测方法,包括以下步骤:A multi-stage phishing website detection method includes the following steps:
    1)选取一待检测范围内的网站进行快速过滤,排除其中的明显非钓鱼网站;1) Select a website within the scope of detection to perform rapid filtering to exclude obvious non-phishing websites;
    2)提取进行所述快速过滤时所用的多维度特征;2) extracting multi-dimensional features used in performing the fast filtering;
    3)在训练集上使用上述多维度特征,对快速过滤后的余下范围内的网站进行精确判定,判断其是否为钓鱼网站。3) Using the above multi-dimensional features on the training set, the website within the remaining range after the fast filtering is accurately determined to determine whether it is a phishing website.
  2. 如权利要求1所述的多阶段钓鱼网站检测方法,其特征在于,步骤1)所述对互联网中的待检测网站进行快速过滤包括:The method for detecting a multi-stage phishing website according to claim 1, wherein the step 1) of performing fast filtering on the website to be detected in the Internet comprises:
    1-1)利用品牌主机和/或域名白名单进行第一层过滤;1-1) use the brand host and/or domain name whitelist for the first layer of filtering;
    1-2)利用登陆框、敏感词及版权信息进行第二层过滤;1-2) using the login box, sensitive words and copyright information for the second layer of filtering;
    1-3)利用网站相关特征进行第三层过滤。1-3) Perform third-level filtering using website-related features.
  3. 如权利要求2所述的多阶段钓鱼网站检测方法,其特征在于,步骤1-1)中,所述第一层过滤用以排除正常的品牌网站。The multi-stage phishing website detecting method according to claim 2, wherein in the step 1-1), the first layer filtering is used to exclude a normal brand website.
  4. 如权利要求2所述的多阶段钓鱼网站检测方法,其特征在于,步骤1-2)中,所述敏感词包括银行、信用卡、支付、中奖、登录及密码。The multi-stage phishing website detecting method according to claim 2, wherein in step 1-2), the sensitive words include a bank, a credit card, a payment, a winning, a login, and a password.
  5. 如权利要求2所述的多阶段钓鱼网站检测方法,其特征在于,步骤1-2)中,所述第二层过滤采用贝叶斯过滤方式。The multi-stage phishing website detecting method according to claim 2, wherein in the step 1-2), the second layer filtering adopts a Bayesian filtering method.
  6. 如权利要求2所述的多阶段钓鱼网站检测方法,其特征在于,步骤1-3)中,所述网站相关特征包括PageRank、域名注册时间及favicon。The multi-stage phishing website detecting method according to claim 2, wherein in the step 1-3), the website related feature comprises a PageRank, a domain name registration time, and a favicon.
  7. 如权利要求1所述的多阶段钓鱼网站检测方法,其特征在于,步骤3)中所述精确判定包括:通过分析余下范围内正负样本的统计特征,训练一精准判定模型。The multi-stage phishing website detecting method according to claim 1, wherein the accurate determination in the step 3) comprises: training an accurate decision model by analyzing statistical characteristics of positive and negative samples in the remaining range.
  8. 如权利要求7所述的多阶段钓鱼网站检测方法,其特征在于,所述正负样本的统计特征包括现有统计钓鱼检测特征、DNS注册和解析特征、及品牌元素特征。The multi-stage phishing website detecting method according to claim 7, wherein the statistical characteristics of the positive and negative samples include existing statistical phishing detection features, DNS registration and analytic features, and brand element features.
  9. 一种一种多阶段钓鱼网站检测系统,包括:A multi-stage phishing website detection system, comprising:
    一快速过滤模块,用以选取一范围内的待检测网站进行快速过滤,排除其中的明显非钓鱼网站;A fast filtering module is used to select a website to be detected in a range for rapid filtering, and to exclude obvious non-phishing websites;
    一多维度特征提取模块,用以提取所述快速过滤模块进行快速过滤时所用的多维度特征;a multi-dimensional feature extraction module, configured to extract multi-dimensional features used by the fast filtering module for rapid filtering;
    一精确判定模块,用以在训练集上使用上述多维度特征,对快速过滤后的余下范围内的 待检测网站进行精确判定。An accurate decision module for using the above multi-dimensional features on the training set for the remaining range after fast filtering The website to be tested is accurately determined.
  10. 如权利要求9所述的多阶段钓鱼网站检测系统,其特征在于,所述快速过滤模块包括:The multi-stage phishing website detecting system according to claim 9, wherein the fast filtering module comprises:
    一第一过滤模块,用以利用品牌域名库和/或域名白名单进行第一层过滤;a first filtering module for performing first layer filtering by using a brand name domain library and/or a domain name white list;
    一第二过滤模块,用以利用敏感词进行第二层过滤;a second filtering module for performing second layer filtering by using sensitive words;
    一第三过滤模块,用以利用网站相关特征进行第三层过滤。 A third filtering module is configured to perform third layer filtering by using relevant features of the website.
PCT/CN2015/098463 2015-06-17 2015-12-23 Multi-stage phishing website detection method and system WO2016201938A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510337127.7A CN104899508B (en) 2015-06-17 2015-06-17 A kind of multistage detection method for phishing site and system
CN201510337127.7 2015-06-17

Publications (1)

Publication Number Publication Date
WO2016201938A1 true WO2016201938A1 (en) 2016-12-22

Family

ID=54032168

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/098463 WO2016201938A1 (en) 2015-06-17 2015-12-23 Multi-stage phishing website detection method and system

Country Status (2)

Country Link
CN (1) CN104899508B (en)
WO (1) WO2016201938A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI636371B (en) * 2017-07-31 2018-09-21 中華電信股份有限公司 Associated sentiment cluster method
US10375091B2 (en) 2017-07-11 2019-08-06 Horizon Healthcare Services, Inc. Method, device and assembly operable to enhance security of networks
WO2021062015A1 (en) * 2019-09-27 2021-04-01 Mcafee, Llc Methods and apparatus to detect website phishing attacks

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899508B (en) * 2015-06-17 2018-12-07 中国互联网络信息中心 A kind of multistage detection method for phishing site and system
CN108023868B (en) * 2016-10-31 2021-02-02 腾讯科技(深圳)有限公司 Malicious resource address detection method and device
CN106776946A (en) * 2016-12-02 2017-05-31 重庆大学 A kind of detection method of fraudulent website
CN107659564B (en) * 2017-09-15 2020-07-31 广州唯品会研究院有限公司 Method for actively detecting phishing website and electronic equipment
CN112182578A (en) * 2017-10-24 2021-01-05 创新先进技术有限公司 Model training method, URL detection method and device
CN108306878A (en) * 2018-01-30 2018-07-20 平安科技(深圳)有限公司 Detection method for phishing site, device, computer equipment and storage medium
CN109347786A (en) * 2018-08-14 2019-02-15 国家计算机网络与信息安全管理中心 Detection method for phishing site
CN109831460B (en) * 2019-03-27 2021-03-16 杭州师范大学 Web attack detection method based on collaborative training
CN110784462B (en) * 2019-10-23 2020-11-03 北京邮电大学 Three-layer phishing website detection system based on hybrid method
CN114070653B (en) * 2022-01-14 2022-06-24 浙江大学 Hybrid phishing website detection method and device, electronic equipment and storage medium
CN114095278B (en) * 2022-01-19 2022-05-24 南京明博互联网安全创新研究院有限公司 Phishing website detection method based on mixed feature selection frame

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120158626A1 (en) * 2010-12-15 2012-06-21 Microsoft Corporation Detection and categorization of malicious urls
CN103544436A (en) * 2013-10-12 2014-01-29 深圳先进技术研究院 System and method for distinguishing phishing websites
US20140298460A1 (en) * 2013-03-26 2014-10-02 Microsoft Corporation Malicious uniform resource locator detection
CN104217160A (en) * 2014-09-19 2014-12-17 中国科学院深圳先进技术研究院 Method and system for detecting Chinese phishing website
CN104899508A (en) * 2015-06-17 2015-09-09 中国互联网络信息中心 Multistage phishing website detecting method and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6785732B1 (en) * 2000-09-11 2004-08-31 International Business Machines Corporation Web server apparatus and method for virus checking
CN103379111A (en) * 2012-04-21 2013-10-30 中南林业科技大学 Intelligent anti-phishing defensive system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120158626A1 (en) * 2010-12-15 2012-06-21 Microsoft Corporation Detection and categorization of malicious urls
US20140298460A1 (en) * 2013-03-26 2014-10-02 Microsoft Corporation Malicious uniform resource locator detection
CN103544436A (en) * 2013-10-12 2014-01-29 深圳先进技术研究院 System and method for distinguishing phishing websites
CN104217160A (en) * 2014-09-19 2014-12-17 中国科学院深圳先进技术研究院 Method and system for detecting Chinese phishing website
CN104899508A (en) * 2015-06-17 2015-09-09 中国互联网络信息中心 Multistage phishing website detecting method and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10375091B2 (en) 2017-07-11 2019-08-06 Horizon Healthcare Services, Inc. Method, device and assembly operable to enhance security of networks
TWI636371B (en) * 2017-07-31 2018-09-21 中華電信股份有限公司 Associated sentiment cluster method
WO2021062015A1 (en) * 2019-09-27 2021-04-01 Mcafee, Llc Methods and apparatus to detect website phishing attacks
US11831419B2 (en) 2019-09-27 2023-11-28 Mcafee, Llc Methods and apparatus to detect website phishing attacks

Also Published As

Publication number Publication date
CN104899508A (en) 2015-09-09
CN104899508B (en) 2018-12-07

Similar Documents

Publication Publication Date Title
WO2016201938A1 (en) Multi-stage phishing website detection method and system
Parra et al. Detecting Internet of Things attacks using distributed deep learning
Xiang et al. Cantina+ a feature-rich machine learning framework for detecting phishing web sites
CN104077396B (en) Method and device for detecting phishing website
CN104954372B (en) A kind of evidence obtaining of fishing website and verification method and system
CN103530367B (en) A kind of fishing website identification system and method
CN104504335B (en) Fishing APP detection methods and system based on page feature and URL features
CN104217160A (en) Method and system for detecting Chinese phishing website
CN109274632A (en) A kind of recognition methods of website and device
Liu et al. An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment
CN110177114A (en) The recognition methods of network security threats index, unit and computer readable storage medium
Chen et al. Ai@ ntiphish—machine learning mechanisms for cyber-phishing attack
Rasymas et al. Detection of phishing URLs by using deep learning approach and multiple features combinations
Vanitha et al. Malicious-URL detection using logistic regression technique
CN107231383B (en) CC attack detection method and device
Valiyaveedu et al. Survey and analysis on AI based phishing detection techniques
Sahu et al. Kernel K-means clustering for phishing website and malware categorization
Zaman et al. Phishing website detection using effective classifiers and feature selection techniques
CN105653941A (en) Heuristic detection method and system for phishing website
Ahmed et al. A framework for phishing attack identification using rough set and formal concept analysis
US20230164180A1 (en) Phishing detection methods and systems
Chen et al. A Malicious URL detection method based on CNN
Kalabarige et al. A Boosting based Hybrid Feature Selection and Multi-layer Stacked Ensemble Learning Model to detect phishing websites
Bikku et al. Optimized Machine Learning Algorithm to classify Phishing Websites
CN115001763A (en) Phishing website attack detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15895503

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15895503

Country of ref document: EP

Kind code of ref document: A1