CN108337255B

CN108337255B - Phishing website detection method based on web automatic test and width learning

Info

Publication number: CN108337255B
Application number: CN201810088364.8A
Authority: CN
Inventors: 袁巍; 聂依凡; 李浩鹏; 贾昂; 蔡明辉; 姜源
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2018-01-30
Filing date: 2018-01-30
Publication date: 2020-08-04
Anticipated expiration: 2038-01-30
Also published as: CN108337255A

Abstract

The invention discloses a phishing website detection method based on web automatic test and breadth learning, and belongs to the technical field of computer network safety. According to the invention, traditional feature extraction is carried out based on url and html pages, interactive feature extraction is carried out by using a web automatic testing technology, and finally width learning training is carried out by using a preprocessed training sample after feature extraction, so that phishing websites are accurately and quickly identified and detected, and the network information and property safety of people are protected.

Description

A phishing website detection method based on web automated testing and breadth learning

技术领域technical field

本发明属于计算机网络安全技术领域，更具体地，涉及一种基于web自动化测试和宽度学习的钓鱼网站检测方法。The invention belongs to the technical field of computer network security, and more particularly relates to a phishing website detection method based on web automatic testing and breadth learning.

背景技术Background technique

网络钓鱼是通过大量发送声称来自于银行或知名机构的欺骗性垃圾邮件、网页虚假广告等，窃取用户的个人身份数据和金融账号等敏感信息的一种攻击方式。最典型的网络钓鱼攻击是将用户引诱到一个精心设计的与目标组织的网站极其相似的钓鱼网站上，获取用户在该网站上输入的个人敏感信息或骗取用户汇款。由于这类攻击过程受害者不易警觉，钓鱼网站已经成为目前最为严重的互联网犯罪手段之一，而钓鱼网站的检测也成为网络安全领域最热门的研究方向之一。Phishing is an attack method that steals sensitive information such as users' personally identifiable data and financial account numbers by sending a large number of deceptive spam emails claiming to be from banks or well-known institutions, false advertisements on web pages, etc. The most typical phishing attack is to lure users to a well-designed phishing website that is very similar to the target organization's website, obtain personal sensitive information entered by users on the website or defraud users to send money. Because the victims of this type of attack are not easily alerted, phishing websites have become one of the most serious Internet crimes, and the detection of phishing websites has also become one of the most popular research directions in the field of network security.

2016年，由CNNIC牵头筹建的互联网域名管理技术国家工程实验室与国际反钓鱼工作组(APWG)、中国反钓鱼网站联盟(APAC)联合发布了《全球中文钓鱼网站现状统计分析报告(2016年)》(以下简称《报告》)。数据显示，2016年我国钓鱼网站数量同比增长150.96％，主要仿冒对象为淘宝、中移动，各大银行，所使用的域名主要有.COM、.CC、.PW、.NET。In 2016, the National Engineering Laboratory of Internet Domain Name Management Technology led by CNNIC, the International Anti-Phishing Working Group (APWG), and the China Anti-Phishing Alliance (APAC) jointly released the "Statistical Analysis Report on the Status Quo of Chinese Phishing Websites in the World (2016)" ” (hereinafter referred to as the “Report”). The data shows that in 2016, the number of phishing websites in my country increased by 150.96% year-on-year. The main counterfeit objects are Taobao, China Mobile, and major banks. The domain names used are mainly .COM, .CC, .PW, and .NET.

2017年第三季度360手机卫士为全国手机用户拦截钓鱼网站计7.9亿次，较2016年第三季度增长102.6％。对所拦截的手机端钓鱼网站分类，其中赌博博彩类钓鱼网站占总体比重的80.2％，虚假购物、虚假招聘、金融证券、假药以及钓鱼广告等类型占比依次递减。In the third quarter of 2017, 360 Mobile Guard blocked 790 million phishing websites for mobile phone users nationwide, an increase of 102.6% over the third quarter of 2016. The intercepted mobile phone phishing websites were classified, among which gambling and gambling phishing websites accounted for 80.2% of the total, and the proportions of false shopping, false recruitment, financial securities, counterfeit medicines and phishing advertisements decreased in order.

虽然拦截数量很多，但拦截的网站大部分是已长时间存在，难以捕获和封锁最新的钓鱼网站。钓鱼网站的生命周期平均只有4.684天，而举报的平均周期13.327天，对于钓鱼网站，必须在极短的时间内识别和拦截，否则会对民众的财产安全造成威胁。Although there are many blocked websites, most of the blocked websites have existed for a long time, and it is difficult to catch and block the latest phishing websites. The average life cycle of phishing websites is only 4.684 days, while the average reporting cycle is 13.327 days. For phishing websites, they must be identified and blocked in a very short period of time, otherwise it will pose a threat to people's property safety.

目前对于钓鱼网站的识别和拦截技术由杀毒软件和浏览器自身来执行，其技术分为以下几类：At present, the identification and blocking technologies for phishing websites are performed by antivirus software and browsers themselves. The technologies are divided into the following categories:

①黑名单过滤技术：将人工检测和民众举报的钓鱼网站加入黑名单，当访问的url(Uniform Resource Locator，统一资源定位符)存在于黑名单中，实施拦截并提出警告。这种方式不能识别最新的钓鱼网站，同时需要人工验证。①Blacklist filtering technology: Add the phishing websites that are manually detected and reported by the public to the blacklist. When the visited URL (Uniform Resource Locator, Uniform Resource Locator) exists in the blacklist, it will be intercepted and warned. This method cannot identify the latest phishing websites and requires manual verification.

②url的特征提取：通过访问的url来提取出相应的特征，比如域名，但是这种判定方式不可靠，因为url中并不具有钓鱼网站的决定性特征，这类方法的误判率和漏判率较高。②The feature extraction of url: The corresponding features, such as domain names, are extracted through the accessed url, but this method of determination is unreliable, because the url does not have the decisive characteristics of phishing websites, and the misjudgment rate and omission rate of this method are higher.

③结合各种网站页面元素作为特征进行钓鱼网站的检测:因为网页页面的特征获取需耗费一定的时间，这类方法在准确度上相比第二类方法有提高，但执行的速度和效率都不高。③Combining various website page elements as features to detect phishing websites: Because it takes a certain amount of time to obtain the characteristics of web pages, this method is more accurate than the second type of method, but the speed and efficiency of execution are both not tall.

发明内容SUMMARY OF THE INVENTION

针对现有技术的以上缺陷或改进需求，本发明提供了一种基于web自动化测试和宽度学习的钓鱼网站检测方法，其目的在于基于url和html页面进行传统特征提取，利用web自动化测试技术来进行交互式特征提取，利用提取特征后的预处理训练样本进行宽度学习训练，从而准确快速地识别和检测钓鱼网站，保护民众的网络信息和财产安全。Aiming at the above defects or improvement requirements of the prior art, the present invention provides a phishing website detection method based on web automated testing and width learning, the purpose of which is to perform traditional feature extraction based on url and html pages, and use web automated testing technology to perform Interactive feature extraction, using the preprocessing training samples after feature extraction for breadth learning training, so as to accurately and quickly identify and detect phishing websites, and protect people's network information and property security.

为实现上述目的，按照本发明的一个方面，提供了一种基于web自动化测试和宽度学习的钓鱼网站检测方法，包括如下步骤：To achieve the above object, according to one aspect of the present invention, a method for detecting a phishing website based on web automated testing and breadth learning is provided, comprising the following steps:

(1)在PC(Personal Computer，个人计算机)端对于数据集里面的大量的钓鱼网站和正常网站进行静态特征提取、动态特征提取和交互式特征提取，形成特征向量集合；(1) Perform static feature extraction, dynamic feature extraction and interactive feature extraction on a large number of phishing websites and normal websites in the data set on the PC (Personal Computer, personal computer) end to form a feature vector set;

所述数据集来自网络上搜集的钓鱼网站和正常网站，或直接从网络安全公司获取；The data set comes from phishing websites and normal websites collected on the Internet, or obtained directly from network security companies;

(2)将步骤(1)中特征向量集合利用k折交叉验证法分为训练集和验证集；(2) The feature vector set in step (1) is divided into a training set and a verification set by using the k-fold cross-validation method;

(3)利用所述训练集进行宽度学习的训练，利用所述验证集进行测试对比，构建基础模型并对分类器的性能进行优化；(3) use the training set to carry out the training of width learning, use the verification set to carry out test comparison, build a basic model and optimize the performance of the classifier;

所述分类器是通过宽度学习算法训练出的模型，使用时，在所述分类器中输入网址，输出是否为钓鱼网站；所述分类器的性能是指分类器识别钓鱼网站的正确率；对性能进行优化是指提高识别正确率；The classifier is a model trained by the width learning algorithm. When using, input a website address in the classifier and output whether it is a phishing website; the performance of the classifier refers to the correct rate of the classifier identifying the phishing website; Performance optimization refers to improving the recognition accuracy;

(4)收集误判网站和新收录的网站作为新的特征向量集合，对模型进行增加输入的增量学习，对模型进行优化。(4) Collect misjudged websites and newly included websites as a new set of feature vectors, and perform incremental learning to add input to the model to optimize the model.

优选地，步骤(1)具体为：Preferably, step (1) is specifically:

(1.1)对url进行静态特征提取；(1.1) Perform static feature extraction on url;

(1.2)利用web自动化测试技术模拟无界面浏览器，对数据集的url进行访问；(1.2) Use web automated testing technology to simulate a non-interface browser to access the url of the dataset;

(1.3)对于url访问的页面进行动态特征提取；(1.3) Dynamic feature extraction for pages accessed by url;

(1.4)模拟浏览器对页面进行交互式点击浏览，并返回交互式特征。(1.4) Simulate the browser to interactively click and browse the page, and return to the interactive feature.

优选地，步骤(1.1)中所述静态特征包括：Preferably, the static features in step (1.1) include:

①url中是否含有ip地址；①Whether there is an IP address in the url;

②url的域名从开始到第一个点之间是否是纯数字；②Whether the domain name of the url is a pure number from the beginning to the first point;

③url里是否含有敏感字符，如@；③Whether the url contains sensitive characters, such as @;

④url端口是否是80端口；④Whether the url port is port 80;

⑤url的长度是否小于23个字符；⑤ Whether the length of the url is less than 23 characters;

⑥url里是否包含涉及到购物或财产账号的关键字，如account，banking，taobao；⑥Whether the url contains keywords related to shopping or property accounts, such as account, banking, taobao;

以上六个静态特征记为<F1,F2,F3,F4,F5,F6>。The above six static features are recorded as <F1, F2, F3, F4, F5, F6>.

优选地，步骤(1.3)中所述动态特征包括：Preferably, the dynamic features in step (1.3) include:

①html的title(标题)是否包含敏感字符，如‘彩票’，‘境外赌博’，‘中奖’；①Whether the title (title) of html contains sensitive characters, such as 'lottery', 'overseas gambling', 'winner';

②是否有form表单；②Whether there is a form form;

③图片的resource(资源)是否和原url同域名；③ Whether the resource (resource) of the picture is the same domain name as the original url;

④链接的href是否和url同域名；所述href是Hypertext Reference的缩写，是指定超链接目标的url；④ Whether the href of the link has the same domain name as the url; the href is the abbreviation of Hypertext Reference, which is the url of the designated hyperlink target;

以上四个动态特征记为<F7,F8,F9,F10>。The above four dynamic features are marked as <F7, F8, F9, F10>.

优选地，步骤(1.4)中所述交互式特征包括：Preferably, the interactive features in step (1.4) include:

①form表单是否严谨；①Whether the form is rigorous;

②点击链接，是否为空；②Click the link, whether it is empty;

③点击链接，是否发生url重定向；③ Click on the link, whether URL redirection occurs;

以上三个交互式特征记为<F11,F12,F13>。The above three interactive features are marked as <F11, F12, F13>.

优选地，步骤(2)具体为：Preferably, step (2) is specifically:

(2.1)设定k值；(2.1) Set the k value;

(2.2)利用k折交叉验证法对步骤1的数据集进训练集和验证集的划分。(2.2) Use the k-fold cross-validation method to divide the data set in step 1 into the training set and the validation set.

优选地，步骤(3)具体为：Preferably, step (3) is specifically:

(3.1)利用步骤(2)的训练集中网页样本的特征向量集合对宽度学习模型进行训练并测试分类器性能；所述网页样本是指训练集中钓鱼网站网址和正常网站网址；(3.1) using the feature vector set of the webpage samples in the training set of step (2) to train the width learning model and test the performance of the classifier; the webpage samples refer to the phishing website URL and the normal website URL in the training set;

(3.2)通过增加特征节点和增强型节点不断调整网络架构进行训练并测试直到分类器达到预期性能，获取各层权重信息并保存模型。(3.2) Continuously adjust the network architecture by adding feature nodes and enhanced nodes for training and testing until the classifier achieves the expected performance, obtain the weight information of each layer and save the model.

优选地，步骤(3.1)具体为：Preferably, step (3.1) is specifically:

(3.11)初始化特征窗口数N2，窗口内特征节点数N1，增强节点数N3的值；随机初始化分类器模型特征节点权重矩阵，并利用稀疏自编码对特征节点权重进行处理；(3.11) Initialize the number of feature windows N2, the number of feature nodes in the window N1, and the value of the number of enhanced nodes N3; randomly initialize the weight matrix of the feature nodes of the classifier model, and use the sparse auto-encoding to process the weight of the feature nodes;

(3.12)将网页样本的特征向量集合与步骤(3.11)获取的权重矩阵进行矩阵乘法得到特征节点矩阵；(3.12) Perform matrix multiplication between the feature vector set of the webpage sample and the weight matrix obtained in step (3.11) to obtain a feature node matrix;

(3.13)随机初始化增强节点权重矩阵；(3.13) Randomly initialize the enhanced node weight matrix;

(3.14)将步骤(3.12)获取的特征节点矩阵与步骤(3.13)获取的权重矩阵相乘获得增强节点矩阵；(3.14) Multiply the feature node matrix obtained in step (3.12) and the weight matrix obtained in step (3.13) to obtain an enhanced node matrix;

(3.15)将步骤(3.12)获取的特征节点矩阵和步骤(3.14)获取的增强节点矩阵按列进行横向拼接得到输入矩阵；(3.15) The feature node matrix obtained in step (3.12) and the enhanced node matrix obtained in step (3.14) are horizontally spliced by column to obtain an input matrix;

(3.16)求取步骤(3.15)所得输入矩阵的加号广义逆并与<Y>进行矩阵乘法得到权重矩阵；所述<Y>是网页样本的标签组合成的矩阵；所述网页样本的标签代表是或不是钓鱼网站，例如，标签1代表是钓鱼网站，标签0代表不是钓鱼网站；(3.16) Obtain the plus generalized inverse of the input matrix obtained in step (3.15) and perform matrix multiplication with <Y> to obtain a weight matrix; the <Y> is a matrix composed of labels of web page samples; the labels of the web page samples Indicates whether it is a phishing website or not, for example, label 1 means it is a phishing website, and label 0 means it is not a phishing website;

(3.17)由于步骤(2)是k折交叉验证，因此把步骤(3.1)重复k次，平均k次的精度；(3.17) Since step (2) is k-fold cross-validation, step (3.1) is repeated k times to average the accuracy of k times;

(3.18)逐渐增加N1,N2,N3的值，观察宽度模型的精度是否提升，并找到最优参数。(3.18) Gradually increase the values of N1, N2, and N3, observe whether the accuracy of the width model is improved, and find the optimal parameters.

优选地，步骤(3.2)具体为：Preferably, step (3.2) is specifically:

(3.21)利用增加特征节点数和增强节点数的增量学习方法对步骤(3.1)中所得模型进行调整并进行测试；(3.21) Use the incremental learning method of increasing the number of feature nodes and enhancing the number of nodes to adjust and test the model obtained in step (3.1);

(3.22)将步骤(3.21)循环设定次数并对所得测试精度进行记录，对比确定最优的特征节点数目和增强节点数目，保存此最优模型。(3.22) Set the number of loops in step (3.21) and record the obtained test accuracy, compare and determine the optimal number of feature nodes and the number of enhancement nodes, and save the optimal model.

优选地，步骤(4)具体为：收集误判网站和新收录的网站作为新的特征向量集合，对模型进行增加输入的增量学习，获得调整过的权重矩阵，从而实现优化模型；Preferably, step (4) is specifically as follows: collecting misjudged websites and newly included websites as a new set of feature vectors, performing incremental learning of increasing input to the model, and obtaining an adjusted weight matrix, thereby realizing the optimization model;

优选地，在步骤(1)中利用正则表达式提取url的静态特征，利用Phantomjs模拟无界面浏览器进行无界面UI自动化测试；所述PhantomJS是一个基于webkit的JavaScriptAPI，是一种无界面浏览器。Preferably, in step (1), a regular expression is used to extract the static features of the url, and Phantomjs is used to simulate a non-interface browser to perform an automated UI test without an interface; the PhantomJS is a webkit-based JavaScript API, which is a non-interface browser. .

本发明运用url静态特征提取、html动态特征提取以及web自动化测试技术的交互式特征提取技术，先提取url的静态特征；再进行无界面UI自动化测试，模拟浏览器进行url访问，同时实现页面源代码提取，从html中提取动态特征；模拟链接点击，模拟form表单的账户输入和登录等操作，省去页面渲染的过程，快速提取交互式特征；对于不在黑名单内的新钓鱼网站，通过模拟点击链接，测试链接是否为空；模拟登入表单，测试表单是否正规；模拟点击链接，测试是否存在url重定向。通过这些全新的交互式特征，快速准确地检测钓鱼网站。The invention uses the url static feature extraction, html dynamic feature extraction and the interactive feature extraction technology of web automatic testing technology, firstly extracts the static feature of the url; Code extraction, extracting dynamic features from html; simulating link clicks, simulating account input and login in form forms, eliminating the process of page rendering, and quickly extracting interactive features; for new phishing websites that are not in the blacklist, by simulating Click the link to test whether the link is empty; simulate the login form to test whether the form is normal; simulate clicking the link to test whether there is url redirection. Quickly and accurately detect phishing sites with these new interactive features.

本发明运用宽度学习模型，提升对新钓鱼网站的识别能力。宽度学习是一种新的机器学习方法及思想，不同于深度学习的是，宽度学习架构层次较浅，对计算资源要求较低。除此之外，深度学习在接受新的样本时需要对整个模型进行重新改进，需耗费要大量的时间，但宽度学习算法不需要对原有的模型进行重新训练，只需要对新加入的钓鱼网站样本进行特征提取，对现有的模型进行调整补充，检测精度在自更新中不断自我提升。The invention uses the width learning model to improve the ability to identify new fishing websites. Breadth learning is a new machine learning method and idea. Different from deep learning, breadth learning has a shallow architecture and requires less computing resources. In addition, deep learning needs to re-improve the entire model when accepting new samples, which takes a lot of time, but the breadth learning algorithm does not need to re-train the original model, only the newly added fishing Feature extraction is performed on website samples, and the existing models are adjusted and supplemented, and the detection accuracy is continuously improved in self-update.

总体而言，通过本发明所构思的以上技术方案与现有技术相比，能够取得下列有益效果：In general, compared with the prior art, the above technical solutions conceived by the present invention can achieve the following beneficial effects:

1、根据本发明所提供的方法，先基于url进行静态特征提取，再基于html页面进行动态特征提取，然后利用web自动化测试技术模拟浏览器对页面进行交互式特征的提取，特征由静态到动态，再到交互式反馈，特征挖掘由浅入深，保证了特征的数量与质量；最后，利用宽度学习模型的只需少量资源、快速训练和增量学习特性，实现了准确、快速、自适应的钓鱼网站的识别技术；1. According to the method provided by the present invention, first perform static feature extraction based on url, then perform dynamic feature extraction based on html page, and then use web automated testing technology to simulate browser to perform interactive feature extraction on the page, and the features are from static to dynamic. , and then to interactive feedback, feature mining proceeds from shallow to deep, ensuring the quantity and quality of features; finally, using the breadth learning model requires only a small amount of resources, fast training and incremental learning features, to achieve accurate, fast and adaptive Identification technology for phishing websites;

2、交互式特征结合url静态特征、html动态特征提取能极大的提高钓鱼网站的识别精度，并且适用于最新的钓鱼网站，能够准确快速地检测和识别钓鱼网站；2. Interactive features combined with url static features and html dynamic feature extraction can greatly improve the identification accuracy of phishing websites, and are suitable for the latest phishing websites, which can accurately and quickly detect and identify phishing websites;

3、通过宽度学习对新加入的钓鱼网站样本进行特征提取，对现有的模型进行调整补充，在更新模型上所需的时间大大减少，对计算资源要求较低；同时，宽度学习的检测精度能够在自更新中实现不断自我提升。3. The feature extraction of newly added phishing website samples through breadth learning, and the adjustment and supplement of the existing model, the time required for updating the model is greatly reduced, and the requirement for computing resources is low; at the same time, the detection accuracy of breadth learning is improved. Able to achieve continuous self-improvement in self-renewal.

附图说明Description of drawings

图1是本发明较佳实施例中一种基于web自动化测试和宽度学习的钓鱼网站检测方法的总流程图；Fig. 1 is the general flow chart of a kind of phishing website detection method based on web automated testing and width learning in the preferred embodiment of the present invention;

图2是本发明较佳实施例中钓鱼网站特征提取的示意图；2 is a schematic diagram of feature extraction of a phishing website in a preferred embodiment of the present invention;

图3是本发明较佳实施例中对宽度学习进行k折交叉验证数据集准备的示意图；3 is a schematic diagram of preparing a k-fold cross-validation data set for width learning in a preferred embodiment of the present invention;

图4是本发明较佳实施例中对宽度学习进行训练及优化的示意图。FIG. 4 is a schematic diagram of training and optimization of width learning in a preferred embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

本发明提供了一种基于web自动化测试和宽度学习的钓鱼网站智能检测方法，如图1所示，是本发明的主流程图，其清晰地展示出整个发明的流程及步骤间的关系。以下具体地说明步骤的实施方式：The present invention provides an intelligent detection method for phishing websites based on web automated testing and breadth learning. As shown in FIG. 1 , it is the main flowchart of the present invention, which clearly shows the relationship between the process and steps of the entire invention. The following specifically describes the implementation of the steps:

(1)步骤1：在PC端对于数据集中大量的钓鱼网站和正常网站进行静态特征提取、动态特征提取和交互式点击访问。(1) Step 1: Perform static feature extraction, dynamic feature extraction and interactive click access on a large number of phishing websites and normal websites in the dataset on the PC side.

如图2所示，步骤1具体如下：As shown in Figure 2, step 1 is as follows:

步骤1.1，对于url本身进行静态特征提取，包括以下六个特征：Step 1.1, perform static feature extraction on the url itself, including the following six features:

①url中是否含有ip地址：ip地址可用于逃避域名注册和用户检查；①Whether there is an IP address in the url: IP addresses can be used to evade domain name registration and user inspection;

②url的域名开始到第一个点间是否是纯数字：正规网址很少用纯数字插入域名，例如正规网址百度https://www.baidu.com/，钓鱼网站http://www.030033.com/；②Whether the domain name of the url is pure numbers from the beginning to the first point: Regular URLs rarely use pure numbers to insert domain names, such as regular URL Baidu https://www.baidu.com/, phishing website http://www.030033. com/;

③url里是否有@等敏感字符：@字符前面是账户，后面才是真正地址，钓鱼网站常用这种方式；③Whether there are sensitive characters such as @ in the url: the front of the @ character is the account, and the back is the real address, which is often used by phishing websites;

④url端口是否是80端口：正规url都是通过80端口访问的，非80端口具有钓鱼网站的嫌疑；④Whether the url port is port 80: regular urls are accessed through port 80, and non-port 80 is suspected of being a phishing website;

⑤url的长度是否小于23个字符：据统计，正规网址url一般不超过23个字符；⑤Whether the length of the url is less than 23 characters: According to statistics, the url of a regular website generally does not exceed 23 characters;

⑥url里是否包含account，banking，taobao等关键字：涉及到购物，银行值得警醒；⑥Whether the url contains keywords such as account, banking, taobao, etc.: When it comes to shopping, the bank should be alerted;

以上六个特征利用六个正则表达式对url来匹配和提取相应特征，例如re.match(r'.*？//(.*)/.*',url)结合split(‘.’)可以提取出url中的域名。其他特征同理得以提取。若出现上述特征，则返回1，反之返回0，总共构成六个特征集合<F1,F2,F3,F4,F5,F6>。The above six features use six regular expression pairs of url to match and extract corresponding features, for example, re.match(r'.*?//(.*)/.*',url) combined with split('.') can Extract the domain name from the url. Other features can be extracted in the same way. If the above features appear, return 1, otherwise return 0, forming a total of six feature sets <F1, F2, F3, F4, F5, F6>.

步骤1.2，通过代码实现无界面浏览器对url的访问，省略对浏览器页面的渲染过程，为其后特征的提取节省时间。具体代码如下：In step 1.2, the access to the url of the browser without an interface is realized through the code, the rendering process of the browser page is omitted, and time is saved for the subsequent feature extraction. The specific code is as follows:

self.dirver＝webdriver.PhantomJS()self.dirver=webdriver.PhantomJS()

dirver.get('http://www.douyu.com/directory/all')dirver.get('http://www.douyu.com/directory/all')

步骤1.3，通过HTML(HyperText Markup Language，超文本标记语言)源代码对于url访问的页面进行动态特征提取，包括以下四个特征：Step 1.3, through HTML (HyperText Markup Language, hypertext markup language) source code for dynamic feature extraction for the page accessed by the url, including the following four features:

①html的title(标题)是否包含‘彩票’，‘境外赌博’，‘中奖’等敏感字符，部分非法钓鱼网站喜欢利用网民的利欲之心；①Whether the title (title) of html contains sensitive characters such as 'lottery', 'overseas gambling', 'winner', some illegal phishing websites like to take advantage of netizens' desire for profit;

②是否有form表单：表单也是敏感特征，因为钓鱼网站的最终目的是盗取账号密码；②Whether there is a form form: The form is also a sensitive feature, because the ultimate purpose of the phishing website is to steal the account password;

③图片的resource(资源)是否和原url同域名：因为钓鱼网站经常盗取其他正版网站的图片；③ Whether the resource (resource) of the picture is the same domain name as the original url: because phishing websites often steal pictures from other genuine websites;

④链接的href是否和url同域名：因为钓鱼网站不会自己写文章内容，大多是代链；href是Hypertext Reference的缩写，是指定超链接目标的URL。④ Whether the href of the link has the same domain name as the url: because the phishing website does not write the content of the article itself, most of it is a chain; href is the abbreviation of Hypertext Reference, which is the URL that specifies the hyperlink target.

以上四个特征利用步骤1.1的driver对象，利用driver从源代码中解析出这四个特征，如driver.find_element_by_tag_name("title").text,提取title的内容。其他特征同理得以提取。若出现上述特征，返回1，反之返回0，总共构成四个特征集合<F7,F8,F9,F10>。The above four features use the driver object in step 1.1, and use the driver to parse these four features from the source code, such as driver.find_element_by_tag_name("title").text, extract the content of the title. Other features can be extracted in the same way. If the above features appear, return 1, otherwise return 0, which constitutes a total of four feature sets <F7, F8, F9, F10>.

步骤1.4，对页面本身进行交互式点击访问，共以下三个特征：Step 1.4, interactive click access to the page itself, with the following three features:

①form表单是否严谨：对于随机账号密码是否都能登入：因为钓鱼网站通常不会有原始数据库，一般任何账号密码都能登入；①Whether the form is rigorous: whether the random account password can be logged in: because the phishing website usually does not have the original database, generally any account password can be logged in;

②点击链接，是否为空：若大部分链接都是空链接，此网站为钓鱼网站的可能性更高；②Click on the link, whether it is empty: If most of the links are empty links, the possibility of this website being a phishing website is higher;

③点击链接，是否发生url重定向：钓鱼网站生命周期短暂，经常利用url重定向跳转到未被查封的ip地址；③ Click on the link, whether URL redirection occurs: the life cycle of phishing websites is short, and URL redirection is often used to jump to the unsealed IP address;

以上三个特征可以利用driver来模拟浏览器进行实时点击页面，从而根据返回值提取特征，例如elem.send_keys(u'随机生成的账号')，可以在表单中填写账号，密码同理，点击登入若成功则可以预测大概率是钓鱼网址。其他特征同理得以提取。若出现上述特征，返回1，反之，返回0，总共构成三个特征集合<F11,F12,F13>；The above three features can use the driver to simulate the browser to click the page in real time, so as to extract the features according to the return value, such as elem. If it is successful, it can be predicted that it is a phishing website with a high probability. Other features can be extracted in the same way. If the above features appear, return 1, otherwise, return 0, which constitutes a total of three feature sets <F11, F12, F13>;

(2)步骤2：k折交叉分配训练集和验证集，如图3所示，步骤2具体如下：(2) Step 2: k-fold cross-allocation training set and validation set, as shown in Figure 3, step 2 is as follows:

步骤2.1，设定k值：参数k代表重复训练和测试的次数，和消耗的资源成正比；Step 2.1, set the k value: the parameter k represents the number of repeated training and testing, which is proportional to the resources consumed;

步骤2.2，利用k折交叉验证来分出训练集和验证集对模型进行训练，将原始数据集分为k份，k-1份作为训练集，1份作为验证集，并重复k次，提高模型的泛化能力。Step 2.2, use k-fold cross-validation to separate the training set and the validation set to train the model, divide the original data set into k parts, k-1 as the training set and 1 as the validation set, and repeat k times to improve The generalization ability of the model.

如图4所示为对宽度学习进行训练及优化的示意图，具体说明见如下步骤3及步骤4：Figure 4 is a schematic diagram of training and optimization of width learning. For details, see steps 3 and 4 below:

(3)步骤3：训练宽度模型及自身优化，步骤3具体如下：(3) Step 3: Train the width model and optimize itself. Step 3 is as follows:

步骤3.1，利用步骤1获取的钓鱼网站样本特征向量集合对宽度学习模型进行训练并测试分类器性能；具体包括以下步骤：Step 3.1, use the phishing website sample feature vector set obtained in step 1 to train the width learning model and test the performance of the classifier; specifically, the following steps are included:

步骤3.1.1，初始化特征窗口数N2，窗口内特征节点数N1，增强节点数N3的值，根据多次试验经验，初始化N1*N2＝samples/600，N3＝samples/10,samples代表样本数量；随机初始化分类器模型特征节点权重矩阵We，并利用稀疏自编码对特征节点权重进行处理；Step 3.1.1, initialize the number of feature windows N2, the number of feature nodes in the window N1, and the value of the number of enhanced nodes N3, according to the experience of many experiments, initialize N1*N2=samples/600, N3=samples/10, samples represents the number of samples ; Randomly initialize the classifier model feature node weight matrix We, and use sparse auto-encoding to process the feature node weights;

步骤3.1.2，钓鱼网站训练样本特征集合X与步骤3.1.1.获取的权重矩阵We进行矩阵乘法得到特征节点矩阵Z；Step 3.1.2, perform matrix multiplication with the weight matrix We obtained in step 3.1.1. to obtain the feature node matrix Z;

步骤3.1.3，随机初始化增强节点权重矩阵Wh；Step 3.1.3, randomly initialize the enhanced node weight matrix Wh;

步骤3.1.4，将步骤3.1.2获取的特征节点矩阵Z与步骤3.1.3获取的权重矩阵Wh相乘获得增强节点矩阵；Step 3.1.4, multiply the feature node matrix Z obtained in step 3.1.2 with the weight matrix Wh obtained in step 3.1.3 to obtain an enhanced node matrix;

步骤3.1.5，将步骤3.1.2获取的特征节点矩阵Z和步骤3.1.4获取的增强节点矩阵H按列进行横向拼接得到输入矩阵A；Step 3.1.5: The input matrix A is obtained by horizontally splicing the feature node matrix Z obtained in step 3.1.2 and the enhanced node matrix H obtained in step 3.1.4 by column;

步骤3.1.6，求取步骤3.1.5所得输入矩阵A的加号广义逆并与应用训练样本标签集合进行矩阵乘法得到权重矩阵W,具体代码如下：Step 3.1.6, obtain the plus generalized inverse of the input matrix A obtained in step 3.1.5 and perform matrix multiplication with the applied training sample label set to obtain the weight matrix W. The specific code is as follows:

W＝np.linalg.inv((A.T).dot(A)+lamda*W=np.linalg.inv((A.T).dot(A)+lamda*

np.eye((A.T).shape[0])).dot((A.T).dot(train_y))；np.eye((A.T).shape[0])).dot((A.T).dot(train_y));

步骤3.1.7，由于步骤2是k折交叉验证，因此把步骤3.1重复k次，平均k次的精度，增强分类器的泛化能力；Step 3.1.7, since step 2 is k-fold cross-validation, repeat step 3.1 k times to average the accuracy of k times to enhance the generalization ability of the classifier;

步骤3.1.8，根据实验经验对N1,N2,N3参数进行调优，在峰值出取得最高精度，并保存此三个参数的值。Step 3.1.8, according to the experimental experience, adjust the N1, N2, N3 parameters, obtain the highest accuracy at the peak value, and save the values of these three parameters.

步骤3.2，通过增加矩阵的节点数调整网络架构进行训练并测试直到分类器达到预期性能或者调整达到一定次数，获取最优情况下的各层权重信息并保存；Step 3.2, adjust the network architecture by increasing the number of nodes in the matrix for training and testing until the classifier achieves the expected performance or the adjustment reaches a certain number of times, and obtains the weight information of each layer in the optimal case and saves it;

步骤3.2具体包括以下处理：Step 3.2 specifically includes the following processing:

步骤3.2.1，利用增加特征节点数和增强节点数的增量学习方法对步骤3.1中所得模型进行调整并进行测试；Step 3.2.1, use the incremental learning method of increasing the number of feature nodes and enhancing the number of nodes to adjust and test the model obtained in step 3.1;

步骤3.2.2，循环设定次数进行步骤3.2.1并对所得测试精度进行记录，对比确定最优的特征节点数目和增强节点数目，保存此最优模型。Step 3.2.2, set the number of loops to perform step 3.2.1 and record the obtained test accuracy, compare and determine the optimal number of feature nodes and the number of enhancement nodes, and save the optimal model.

(4)步骤4，收集误判网站和新收录的网站作为得到新的特征向量集合，适时地对模型进行增加输入的增量学习；(4) Step 4, collect misjudged websites and newly included websites as a new feature vector set, and carry out incremental learning of increasing input to the model in a timely manner;

步骤4具体包括如下处理：Step 4 specifically includes the following processing:

4.1，对于步骤3中分类失败的例子进行特征提取并且保存；4.1, perform feature extraction and save for the examples that fail to be classified in step 3;

4.2，对于新的网站进行特征提取并保存在本地文件当中；4.2, perform feature extraction for the new website and save it in a local file;

4.3，当数量达到预设值时在进行统一的输入型增量学习，调节W权重矩阵，从而更新模型；4.3, when the number reaches the preset value, a unified input-based incremental learning is performed, and the W weight matrix is adjusted to update the model;

综上所述，利用本发明提供的技术方案，基于web自动化测试和宽度学习的钓鱼网站智能检测方法将传统特征与交互式特征相结合，利用宽度学习训练模型，耗费资源较小，实现快速自适应更新，同时保证了模型的准确性，能够在钓鱼网站极短的生命周期内实现精准拦截和打击。To sum up, using the technical solution provided by the present invention, the intelligent detection method for phishing websites based on web automated testing and breadth learning combines traditional features with interactive features, uses breadth learning to train the model, consumes less resources, and realizes rapid self-development. It adapts to the update while ensuring the accuracy of the model, and can achieve precise interception and strike in the extremely short life cycle of phishing websites.

本发明中所述k折交叉验证法，是将样本打乱，然后均匀分成k份，轮流选择其中k－1份训练，剩余的一份做验证，计算预测误差平方和，最后把k次的预测误差平方和再做平均作为选择最优模型结构的依据。假设有N个样本,特别的k取N，就是留一法(leave oneout)。The k-fold cross-validation method described in the present invention is to scramble the samples, and then evenly divide them into k parts, select k-1 parts in turn for training, the remaining part for verification, calculate the sum of the squares of the prediction errors, and finally put the k-folds for training. The sum of squared prediction errors is averaged as the basis for selecting the optimal model structure. Suppose there are N samples, and the special k is N, which is the leave oneout method.

设矩阵A是m*n的矩阵，本发明中所述矩阵A的加号广义逆，是指(A'A)A’，其中A'表示A的转置矩阵。Assuming that the matrix A is an m*n matrix, the plus-sign generalized inverse of the matrix A in the present invention refers to (A'A)A', where A' represents the transposed matrix of A.

本发明中所述权重是指宽度学习模型参数。The weights mentioned in the present invention refer to the parameters of the width learning model.

所述稀疏自编码(Sparse Autoencoder)是指自动从无标注数据中学习特征，并给出比原始数据更好的特征描述的技术。The sparse autoencoder refers to a technology that automatically learns features from unlabeled data and provides a better feature description than the original data.

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。Those skilled in the art can easily understand that the above are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, etc., All should be included within the protection scope of the present invention.

Claims

1. a phishing website detection method based on width learning, is characterized in that, comprises the steps:

(1) Perform static feature extraction, dynamic feature extraction and interactive feature extraction on a large number of phishing websites and normal websites in the website dataset on the PC side to form a feature vector set;

(2) The feature vector set in step (1) is divided into a training set and a verification set by using the k-fold cross-validation method;

(3) Use the training set to carry out the training of breadth learning, use the verification set to test and compare, build a basic model and optimize the performance of the classifier; the performance of the classifier refers to the correctness of the classifier to identify the phishing website. Rate;

(4) Collect misjudged websites and newly included websites as a new set of feature vectors, and perform incremental learning to increase the input of the model to optimize the model;

Step (1) is specifically:

(1.1) static feature extraction is carried out for url; the static feature includes: whether the domain name of url is a pure number from the beginning to the first point, and whether the length of the url is less than 23 characters;

(1.2) Use web automated testing technology to simulate a non-interface browser to access the url of the dataset;

(1.3) Dynamic feature extraction is performed for the page accessed by the url; the dynamic features include: whether the resource of the picture has the same domain name as the original url;

(1.4) Simulate the browser to interactively click and browse the page, and return to the interactive feature.

2. a kind of phishing website detection method based on width learning as claimed in claim 1, it is characterised in that the static feature described in step (1.1) also comprises:

①Whether there is an IP address in the url;

③Whether there are sensitive characters in the url, and the sensitive characters include @;

④Whether the url port is port 80;

⑥ Whether the url contains keywords related to shopping or property accounts, the keywords include account, banking, taobao.

3. a kind of phishing website detection method based on width learning as claimed in claim 1, is characterized in that, described in step (1.3), dynamic feature also comprises:

①Whether the title of html contains sensitive characters, the sensitive characters include 'lottery', 'overseas gambling', 'winner';

②Whether there is a form form;

④ Whether the href of the link is the same domain name as the url; the href is the abbreviation of Hypertext Reference, which is the url of the designated hyperlink target.

4. a kind of phishing website detection method based on width learning as claimed in claim 1 is characterized in that, the interactive feature described in step (1.4) comprises:

①Whether the form is rigorous;

②Click the link, whether it is empty;

③ Click on the link to see if URL redirection occurs.

5. a kind of phishing website detection method based on width learning as claimed in claim 1, is characterized in that, step (2) is specially:

(2.1) Set the k value;

(2.2) Use the k-fold cross-validation method to divide the data set in step (1) into the training set and the validation set.

6. a kind of phishing website detection method based on width learning as claimed in claim 1, is characterized in that, step (3) is specially:

(3.1) Use the feature vector set of the web page samples in the training set of step (2) to train the width learning model and test the performance of the classifier;

(3.2) Continuously adjust the network architecture by adding feature nodes and enhanced nodes for training and testing until the classifier achieves the expected performance, obtain the weight information of each layer and save the model.

7. a kind of phishing website detection method based on width learning as claimed in claim 6 is characterized in that, step (3.1) is specially:

(3.11) Initialize the number of feature windows N2, the number of feature nodes in the window N1, and the value of the number of enhanced nodes N3; randomly initialize the weight matrix of the feature nodes of the classifier model, and use the sparse auto-encoding to process the weight of the feature nodes;

(3.12) Perform matrix multiplication between the feature vector set of the webpage sample and the weight matrix obtained in step (3.11) to obtain a feature node matrix;

(3.13) Randomly initialize the enhanced node weight matrix;

(3.14) Multiply the feature node matrix obtained in step (3.12) and the weight matrix obtained in step (3.13) to obtain an enhanced node matrix;

(3.15) The feature node matrix obtained in step (3.12) and the enhanced node matrix obtained in step (3.14) are horizontally spliced by column to obtain an input matrix;

(3.16) Obtain the plus generalized inverse of the input matrix obtained in step (3.15) and perform matrix multiplication with <Y> to obtain a weight matrix; the <Y> is a matrix composed of labels of web page samples; the labels of the web page samples represent a phishing website or not;

(3.17) Since step (2) is k-fold cross-validation, step (3.1) is repeated k times to average the accuracy of k times;

(3.18) Gradually increase the values of N1, N2, and N3, observe whether the accuracy of the width model is improved, and find the optimal parameters.

8. a kind of phishing website detection method based on width learning as claimed in claim 6 is characterized in that, step (3.2) is specially:

(3.21) Use the incremental learning method of increasing the number of feature nodes and enhancing the number of nodes to adjust and test the model obtained in step (3.1);

(3.22) Set the number of loops in step (3.21) and record the obtained test accuracy, compare and determine the optimal number of feature nodes and the number of enhancement nodes, and save the optimal model.

9. a kind of phishing website detection method based on width learning as claimed in claim 1, is characterized in that, step (4) is specially: collect misjudged website and newly included website as new feature vector set, carry out model to model. Incremental learning of the input is added to obtain an adjusted weight matrix to optimize the model.