CN108965245B - Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model - Google Patents

Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model Download PDF

Info

Publication number
CN108965245B
CN108965245B CN201810549417.1A CN201810549417A CN108965245B CN 108965245 B CN108965245 B CN 108965245B CN 201810549417 A CN201810549417 A CN 201810549417A CN 108965245 B CN108965245 B CN 108965245B
Authority
CN
China
Prior art keywords
training
website
classification
sample
classifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810549417.1A
Other languages
Chinese (zh)
Other versions
CN108965245A (en
Inventor
臧天宁
强倩
杜飞
周渊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruichi Xinan Technology Co ltd
National Computer Network and Information Security Management Center
Original Assignee
Beijing Ruichi Xinan Technology Co ltd
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruichi Xinan Technology Co ltd, National Computer Network and Information Security Management Center filed Critical Beijing Ruichi Xinan Technology Co ltd
Priority to CN201810549417.1A priority Critical patent/CN108965245B/en
Publication of CN108965245A publication Critical patent/CN108965245A/en
Application granted granted Critical
Publication of CN108965245B publication Critical patent/CN108965245B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing

Abstract

The invention provides a phishing website detection method and system based on a self-adaptive heterogeneous multi-classification model. The method comprises the steps of constructing a self-adaptive heterogeneous multi-classification model by linear addition on multiple base classification algorithms, training the multi-classification model, wherein the input of the model is the input of each base classification algorithm, the output of the model is a sample label, and each base classification algorithm extracts corresponding characteristics from a sample record as input; and solving the model parameters by adopting a machine learning algorithm, and testing and optimizing by using the test set to finally obtain the detection model of the phishing website. The system comprises a domain name morpheme feature classifier, a topic index feature classifier, a content similarity feature classifier, a structure style feature classifier, a visual rule feature classifier, a linear addition training module, an integrated classifier, a training data set management module and a detection and alarm module. The invention realizes real-time detection of the phishing website and improves the accuracy and stability of the phishing website detection.

Description

Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model
Technical Field
The invention relates to the field of computer network security, in particular to a phishing website detection method and system based on a self-adaptive heterogeneous multi-classification model.
Background
With the vigorous development of internet technology, the problem of network security is endless. Phishing is a typical online fraud behavior, which takes the internet as a carrier, and deceives users to obtain sensitive information of the users by disguising as good-credited legitimate websites, and the deceived users can generate personal information leakage to different degrees, thereby causing economic loss. How to detect phishing websites quickly and accurately becomes a hotspot for research on Web (global wide area network) information security. The phishing website detection technology disclosed at present mainly comprises the following modes:
(1) detection technology based on black and white list mechanism: as a practical core technology, the black and white list has the advantages of high efficiency and accuracy. The phishing website can be quickly located by judging the domain name, which is one of the most common implementation technologies [1 ].
(2) The detection technology based on the visual similarity comprises the following steps: cao Jieshen et al propose a webpage similarity determination algorithm based on nested EMD (Earth Mover's Distance), segment a Web image, and construct an ARG (associated Relational graph) of a webpage by using the segmented subgraph features. After the distances of different AGR attributes are obtained through calculation, the similarity between the phishing website and the webpage of the protected website is obtained through a nested EMD method, and then the high-precision detection of the phishing website is achieved [2 ].
The lubrouhahan distinguishes phishing websites by using visual similarity based on an EMD algorithm. The algorithm adopts the result obtained by comparing the pixel similarity in the webpage as the basis for judging the phishing website [3 ].
(3) The detection technology based on the Bayesian algorithm comprises the following steps: jinqing et al formulated a series of rules for matching phishing websites based on a rule bayesian algorithm. And then distributing corresponding weight values for each rule, and calculating a correction coefficient of the rule to obtain the probability that the tested website is a phishing website. And further judging whether the phishing website is a phishing website or not through a probability threshold value [4] [5 ].
The system [6] capable of intelligently detecting phishing attacks is constructed by the Chinese patent of Su-maid and the like by classifying and integrating 8 characteristics of webpage label title contents, keyword information, page description information, picture links, website copyright information and the like in websites by using an expanded Bayesian algorithm and an improved support vector machine.
(4) Document structure based detection techniques: guo Mingzhi et al analyzed Web page document objects and extracted phishing sensitive information features in the text object model that are often utilized by phishers to determine whether the website is a phishing website. The algorithm effectively filters out Phishing pages in the website, and powerfully prevents malicious fish hooking attacks of phishers [7 ].
(5) The phishing website detection technology based on deep learning comprises the following steps: many people propose a DBN-KNN model with a multilayer structure in combination with deep learning technology, and the DBN-KNN model is applied to feature recognition of phishing websites to recognize the phishing websites [8 ].
(6) Other types of detection techniques: huanghua army et al propose phishing active defense [9] based on semi-fragile watermarks and phishing URL detection algorithm [10] based on abnormal features; zhangjia Yiet et al propose a network hookfish inspection algorithm [11] with text semantic understanding; the remaining defensive measures include elaboration of cross-site phishing attacks against web-mail mailboxes, cloud computing based URL filtering [12], SVM learning algorithms [13], and the like.
In the above technologies, the detection method based on the black and white list has poor timeliness and also has defects in the list range; the detection technology based on the visual similarity has complex algorithm and long detection time, and cannot be suitable for the online real-time detection of mass URLs (Uniform resource locators); the detection technology based on the Bayesian algorithm is not ideal in robustness and generalization performance; the detection technology based on the document structure has the problems of incomplete feature coverage and more missing reports; the phishing website detection technology based on deep learning has advantages in feature recognition, but the stability of features is poor, and the phishing website detection technology is easily interfered by sample pollution.
Reference documents:
[1]Huang C.,Ma S,Chen K.,Using One-Time Passwords to Prevent Password Phishing Attacks[J].Journal of Network and Computer Applications.2011,34(4):1292-1301.
[2] cao European New, Roman, etc. fishing webpage detection algorithm [ J ] based on nested EMD, computer science, 2009,32(5): 922-.
[3] Plum blossom, Liu Dong, Phishing detection method based on visual similarity [ J ]. school newspaper of Qinghua university (Nature science edition), 2009,49(1): 146-.
[4]Zhang H.,Liu G.,Chow T.W.S..et al.Textual and Visual Content-Based Anti-Phishing:A Baysian Approach[J].IEEE Transactions on Neural Networks.2011,22(10):1532-1546.
[5] Jinqing, wu nationality new, litdan, etc. filtering of internet fishing is achieved with rule-based bayes algorithm [ R ]. hunan: the fifth academic conference on information and communication technology security, 2007,372-378.
[6] The classification integration-based phishing website intelligent detection system [ J ] systematic engineering theory and practice, 2011,31(10):2008 + 2020.
[7] Guo Mingzhi, Yuanjinsheng, Wang Yachao, et al.
[8] Research on phishing website detection technology based on deep learning [ D ] university of electronic technology, 2017.
[9] Invar army, liu jia, liu liang.study of cross-site phishing attacks against the Web-mail na box [ J ] communication technology, 2010,43(8): 164-.
[10]Huang H.J.,Wang Y.J.,Xie L.L..et al.An Active Anti-Phishing Solution Based on Semi-fragile Watermark[J].Information Technology Journal.2013,12(1):198-203.
[11]Huang H.J.,Qian L.,Wang Y.J..A SVM-Based Technique to Detect Phishing URLs[J].Information Technology Journal.2012,11(7):921-925.
[12]Zhang H.,Liu G.,Chow T.W.S..et al.Textual and Visual Content-Based Anti-Phishing:A Baysian Approach[J].IEEE Transactions on Neural Networks.2011,22(10):1532-1546.
[13]Sheng S.,Wardman B.,Warner G..et al.An Empirical Analysis of Phishing Blacklists[C].In:Proc.of the sixth Conference on Email and Anti-Spam.2009:1-10.
Disclosure of Invention
Aiming at the problems of the existing method, the invention discloses a phishing website detection method and system based on a self-adaptive heterogeneous multi-classification model, which are used for detecting phishing websites in real time and have higher accuracy and stability.
The invention discloses a phishing website detection method based on a self-adaptive heterogeneous multi-classification model (AHMC), which comprises the steps of learning the self-adaptive heterogeneous multi-classification model and detecting the phishing website, and specifically comprises the following steps:
step 1, for a certain category of phishing website sample set D, | D | ═ n,and performing cross validation by using a leave-one-out method to classify the training set and the test set. Let the jth training set denote DjThe corresponding jth test set is denoted as
Figure BDA0001680048800000031
j is a positive integer. Each sample contains a sample record and a sample label; the sample record contains URL and webpage information of the website, and the sample label is used for marking whether the website is a phishing website or not.
Step 2, constructing an adaptive heterogeneous multi-classification model H through linear addition, which comprises the following steps:
Figure BDA0001680048800000032
wherein ω isiIn order to be a weight parameter, the weight parameter,
Figure BDA0001680048800000033
to adjust the factor
Wherein T is the number of the base classification algorithms. h isiFor the ith base classification algorithm, ωiFor the weighting parameters of the ith base classification algorithm,
Figure BDA0001680048800000034
is an adjustment factor. x represents the sample record.
Step 3, inputting the multi-classification model H into each base classification algorithm, and outputting a sample label; for training set DjThe corresponding features of each base classification algorithm are extracted from the sample records of each sample as input. The basic classification algorithm is characterized by a linear function, and the parameters of each classification algorithm are independently and equally distributed.
Step 4, based on the training set DjUsing machine learning algorithm to classify the parameters of each base classification algorithm and omega in multi-classification model Hi,
Figure BDA0001680048800000035
Training and parameter solving are carried out. In the training solution, extracting corresponding characteristics from the sample record as input for each base classification algorithmPreferably, the output of the multi-classification model is guaranteed to be a sample label.
Step 5, in the test set
Figure BDA0001680048800000036
The multi-classification model H is tested and optimized until the parameters of each base classification algorithm and the parameter omega in the multi-classification model Hi,
Figure BDA0001680048800000037
And converging, and finishing the machine learning algorithm of the multi-classification model H.
Step 6, obtaining parameters of each base classification algorithm and parameters omega in the multi-classification model H finallyi,
Figure BDA0001680048800000038
And obtaining a detection model H' of the phishing website.
And 7, acquiring records of the website to be detected, including URL (uniform resource locator) and webpage information of the website, and inputting a detection model H' to judge whether the website is a phishing website.
The invention also discloses a phishing website detection system based on the self-adaptive heterogeneous multi-classification model, which comprises a domain name morpheme feature classifier, a theme index feature classifier, a content similarity feature classifier, a structure style feature classifier, a visual rule feature classifier, a linear addition training module, an integrated classifier, a training data set management module and a detection and alarm module. The functions of each module in the system operation are as follows:
the domain name morpheme feature classifier is used for extracting and training the features of the domain name character string of the input website URL;
the topic index feature classifier is used for extracting and training the features of the contents of webpage labels < title >, < meta > and page footers of an input website;
the content similarity feature classifier is used for extracting features and abstracting semantics of information in webpage content input into a website and training the features;
the structure style feature classifier is used for extracting and training the features of the webpage source code structure of the input website;
the visual rule feature classifier extracts and learns the symbolic visual features of the web pages input into the website;
the linear addition training module linearly combines the domain name morpheme feature classifier, the topic index feature classifier, the content similarity feature classifier, the structure style feature classifier and the visual rule feature classifier, and the combination form is as follows:
Figure BDA0001680048800000041
wherein H (x) represents a multi-classification model of a combination of five classifiers, h1~h5Classification functions, omega, corresponding to the five classifiers respectivelyiAs a weight parameter of the i-th classifier,
Figure BDA0001680048800000042
for adjustment factors, x represents the sample record;
the linear addition training module utilizes the training set and the testing set to carry out the parameter sum omega of the five classifiersi
Figure BDA0001680048800000043
Training and optimizing;
the integrated classifier is a final model output by the linear addition training module, a detection model of the phishing website is constructed, and the weight of each classifier is dynamically maintained;
the training data set management module stores a training data set, performs label management and grouping on training samples, performs division and maintenance on the training set and a test set, and manages the sample sampling balance in different groups;
the detection and alarm module detects the website to be detected according to the constructed detection model of the phishing website, and alarms when the phishing website is detected.
Compared with the prior art, the invention has the following obvious advantages:
(1) the method and the system adopt the integrated learning technology to linearly combine a plurality of weak classifiers, obtain the generalization performance which is obviously superior to that of a single classifier, and improve the accuracy and the stability of the phishing website detection.
(2) The method and the system of the invention adopt weight adaptive learning, learn the weight through the sample, do not depend on prior experience and the suggestion of field experts, and automatically update the weight in the learning process of training under the condition that the sample characteristics and distribution change.
(3) The five weak classifiers in the system are heterogeneous, each single classifier has certain accuracy, and strong dependency relationship does not exist among the classifiers, so that the integrated model is ensured to have higher accuracy. The overall cost of five weak classifiers is no greater than that of a single complex learning model.
(4) The method can detect the phishing website in real time, and has higher accuracy, recall rate and robustness.
(5) The system can process high-performance real-time detection of mass URLs, can be applied to an online engineering system, and has high availability and stability.
Drawings
FIG. 1 is a flow chart of the phishing website detection method based on the adaptive heterogeneous multi-classification model of the invention;
FIG. 2 is a block diagram of the phishing website detection system based on the adaptive heterogeneous multi-classification model according to the present invention;
FIG. 3 is a schematic diagram of a linear addition training module for data training in the phishing website detection system of the present invention;
FIG. 4 is a deployment diagram of the phishing website detection system based on the adaptive heterogeneous multi-classification model of the invention.
Detailed Description
The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples. The examples are given solely for the purpose of illustration and are not intended to limit the scope of the invention.
As shown in FIG. 1, the invention provides a phishing website detection method based on an adaptive heterogeneous multi-classification model (AHMC), which comprises learning of the adaptive heterogeneous multi-classification model and detection of phishing websites, and the following specific implementation steps are described.
Step 1, selecting phishing websites of the same category, for example phishing websites which are counterfeited in the same bank type, as a sample set D, | D | ═ n, and n represents the number of samples in D. And performing cross validation by a leave-one-out method to classify the sample into a training set and a test set.
The jth training sample set is: dj={(x1,y1),(x2,y2),…,(xm,ym)}(1≤j≤n,1<m<n);
The corresponding jth test sample set:
Figure BDA0001680048800000053
wherein each sample comprises a record x and a label y of the sample, such as (x) in the sample set1,y1) Represents a specific sample instance, where x1Record representing the sample, y1Indicating a sample label. The record of the sample here includes url of the website and corresponding webpage information, and the sample label is used for marking whether the website is a phishing website. D/DjRemoving D from the set of representations Dj
In this step, the size n of the sample should be as large as possible, and it is recommended that it not be lower than 100.
Step 2, constructing an adaptive heterogeneous multi-classification model H as follows:
Figure BDA0001680048800000051
wherein T is the number of the base classification algorithms. h isiFor the ith base classification algorithm, ωiFor the weighting parameters of the ith base classification algorithm,
Figure BDA0001680048800000052
is an adjustment factor.
The base classification algorithm is also called a classification algorithm or a learning algorithm in the subsequent expression, and the corresponding classifier also corresponds to the corresponding learner. The base classification algorithm is a heterogeneous algorithm, the diversity of the algorithm is ensured, and 5 fixed classification algorithms are arranged in the embodiment of the invention, wherein h is1The representative domain name morpheme feature classification algorithm judges whether the domain name morpheme is a phishing website or not according to the domain name morpheme; h is2The representative topic index feature classification algorithm is used for judging whether the webpage is a phishing website according to the content under the topic label in the webpage; h is3The representative content similarity feature classification algorithm is used for comparing the similarity of the content marked by the content in the webpage to judge whether the webpage is a phishing website or not; h is4The representative structure style feature classification algorithm judges whether the phishing website is the phishing website according to the structure of the source code; h is5The representative visual rule feature classification algorithm judges whether the website is a phishing website according to icons, color matching, pictures and the like of the webpage. In practice the necessary extensions can be made according to the principles of isomerism.
Step 3, training set DjThe sample records in each sample in the multi-classification model H are used as input to the base classification algorithm, and the sample labels are used as output. Will (x)i,yi) X in (2)iAs classification algorithms h1-h5For each base classification algorithm, from xiExtracting the features to be input, and extracting yiAs a corresponding output, the features and parameters of the corresponding classification algorithm are trained as follows:
h1(xi)→yi,h2(xi)→yi,h3(xi)→yi,h4(xi)→yi,h5(xi)→yi
in the embodiment of the present invention, the multi-classification model H is represented as follows:
Figure BDA0001680048800000061
when the input is a sample record xiThen, the corresponding sample label y is outputiOn the right side of the above equation is a linear weighted combination of classification functions. The multi-classification model is trained to calculate and obtain a weight parameter omegaiAnd adjustment factor
Figure BDA0001680048800000062
When the method is used for training, the input sample is recorded as xiExtracting corresponding features as each classification algorithm h1-h5Preferably the output of the multi-classification model is a sample label yiThe output of each classification algorithm can be set to yiAnd training the parameters of each classification algorithm, the weight parameters of the multi-classification model and the adjustment factors. The base classification algorithm is characterized by a linear function, and the parameters of each classification algorithm are independently and uniformly distributed, so that the cost of the whole training is not higher than that of a single complex learning model. The input features of different base classification algorithms may be different, requiring the selection of suitable feature inputs from the sample records. E.g. domain name morpheme feature classification algorithm h1The input features of (1) include a top level domain name, a second level domain name, etc.
Step 4, parameters, weight parameters and adjustment factors of each classification algorithm are adjusted by adopting a machine learning algorithm
Figure BDA0001680048800000063
Training and parameter solving are carried out. For example, in the solution, the parameters of each classification algorithm may adopt a maximum likelihood estimation method, the parameters of the integrated model H may adopt an EM (maximum expectation) algorithm to iteratively solve, the constraint conditions may be formalized by adopting a minimum loss function, the solution process may be implemented by a unified computation framework — parameter solution of the maximum likelihood estimation method, and the matrix solution is uniformly performed in the process of computer execution.
Step 5, in the test sample set
Figure BDA0001680048800000064
Model H was tested and optimized as above. Polling all test and training samples until the parameters and adjustment factors
Figure BDA0001680048800000065
Converging to a stable threshold, and the learning algorithm of the model is ended.
This step performs both testing and optimization purposes. When a conflict occurs on the test sample, or an adjustment factor occurs
Figure BDA0001680048800000066
Under the condition that the convergence cannot be achieved, the sample is corrected, the training sample is subjected to class modification and independent processing, the sample label is corrected, the training set sample is updated, then the model H is trained again to adjust the parameters of the classification algorithm, namely the training process in the step 4 is executed again, and therefore the purpose of optimizing the parameters and adjusting the factors is achieved.
The method of the invention adopts a leave-one-out method to obtain training sets and test sets, K groups of training sets and test sets are obtained together, the steps 3-5 are executed on each group of training sets and test sets, and finally parameters and adjustment factors of a multi-group base classification algorithm can be obtained
Figure BDA0001680048800000067
At this time, the obtained classification algorithm parameters and adjustment factors may be subjected to combined average evaluation as a final result.
Step 6, obtaining the parameters of each classification algorithm and the adjustment factor omega according to the step 5i,
Figure BDA0001680048800000068
And obtaining the self-adaptive heterogeneous multi-classification model H' corresponding to the phishing websites.
And 5, optimizing to obtain H as a model example, carrying out parameter migration on the obtained model H, and initializing a detection algorithm H' of the phishing website. The models H 'and H are isomorphic, and H' in the embodiment of the invention is integrated with H1-h5The mixed model of (1).
And 7, acquiring records of the website to be detected, including website URL, webpage source code and other webpage information, and inputting the records into a detection module H' to acquire whether the URL is a phishing website, a counterfeit object and other information. The input webpage information does not need to be formatted, and the features used by each classifier are automatically obtained in the webpage source code structure.
In the step, the website information and the source code data corresponding to the URL are obtained, a crawler technology can be adopted, when new features and variants appear, the corresponding base classification algorithm and the features can be updated only, and the weight parameters and the adjustment factors are adjusted
Figure BDA0001680048800000071
The influence of (c) is small.
The invention adopts the thought of ensemble learning, and the difference from the existing classical ensemble learning is mainly reflected in that: the classical ensemble learning includes two stages, the first stage is to train the base classifiers first, and the second stage is to train the parameters after the combination of the base classifiers by using the output of the first stage as the input. The invention adopts a unified computing frame to train together, and does not carry out two-stage division.
The invention discloses a phishing website detection system based on a self-adaptive heterogeneous multi-classification model, which mainly comprises 9 parts, namely a domain name morpheme feature classifier, a theme index feature classifier, a content similarity feature classifier, a structure style feature classifier, a visual rule feature classifier, a linear addition training module, an integrated classifier, a training data collection management module and a detection and alarm module. The functions of the various modules in the operation of the system are described below, as shown in fig. 2.
A domain name morpheme feature classifier: the features of the classifier are mainly from the statistical features of the domain name part in the URL character string of the phishing website. The domain name morpheme feature classifier performs feature extraction and training on a domain name character string of an input website URL, and the realized functions include but are not limited to: 1) judging the suspicious degree of the top-level domain name; 2) extracting morpheme information contained in the secondary domain name; 3) acquiring the hierarchical structure of the domain name and the length of the sub-domain name; 4) and constructing and perfecting a morpheme feature library.
In the domain name morpheme feature classifier, 1) the suspicious degree of the top-level domain name comes from statistical experience, and the probability that the top-level domain names such as pw, win, top, xyz and the like appear in phishing websites is usually higher; 2) morpheme information in the secondary domain name refers to short names of banks, such as 95588, 95533, cmb, icbc, boc and the like, contained in character strings forming the secondary domain name; 3) short words of the bank website composed of hyphens such as www-bankofbeijing-com-cn can be contained in the third-level or fourth-level domain names.
The topic index feature classifier mainly performs feature extraction and training on the contents of webpage labels < title >, < meta > and page footer of an input website. The functions realized by the topic index feature classifier include but are not limited to 1) extracting features in the label, and performing conflict resolution and type classification of the features; 2) and constructing and perfecting a theme index feature library. The topic index feature classifier has the advantages of quick and accurate positioning and the disadvantages of weak generalization capability and high error report. The content in the < title > tag is not highly distinguishable from non-counterfeited normal websites or has no association with the content of the body of the web page. Therefore, the classifier needs to be matched with a white list library for classification.
Content similarity feature classifier: feature extraction and semantic abstraction are mainly performed on the short text information in the webpage content. Including but not limited to 1) extracting text in < body > tags, wherein the length extraction of the content in < a >, < p >, < div >, < span >, < td >, < table >, < form > and the like tags cannot exceed 15 characters, and the content in the text is extracted according to 2-8 characters; 2) vectorizing and normalizing the text features; 3) embedding words, and mapping the Word quantity into a low-dimensional characteristic vector by using a Word2Vec tool; 4) and constructing a word feature vector library.
The content similarity feature classifier has stable detection effect, and the indexes of accuracy and recall rate are better than those of other classifiers. Wherein the vectorization is the de-duplication and filtering of the short text; the normalization is to delete specific time words, frequently-changed numbers, interference words with too high occurrence frequency, advertisements without distinction, link words of a third party and the like.
The structure style feature classifier is used for extracting and training the features of the webpage source code structure of the input website. The main functions of the structure style classifier include: 1) analyzing source codes according to codes of the JS script, styles, form forms and DOM structures of the CSS; 2) analyzing a homologous code structure in a source code, and extracting a public code segment; 3) and constructing a homologous code similarity matrix.
The structure style feature classifier has prejudgment capability on new phishing websites developed by the same post-curtain organization, wherein public code segments comprise but are not limited to 1) the same function name; 2) compatible CSS color matching; 3) the same JS script; 4) the same selection list and < form > form; 5) the same hyperlinks and jump out-of-page links, etc.
The visual rule feature classifier mainly performs feature extraction and learning on symbolic visual features of web pages input into the website, wherein the extracted features include but are not limited to 1) log icons of target websites; 2) the whole color matching and the framework of the website are formed; 3) a symbolic picture module, etc.
The visual rule feature classifier has the disadvantages that the time for learning and detecting the visual features is long, and the same log can cause great errors on the difference of pixel levels, so the requirement on the quality of a training sample is strict. The size of the visual feature library is not less than 30000 strips.
The linear addition training module performs linear combination and training on heterogeneous base classifiers, namely a domain name morpheme feature classifier, a topic index feature classifier, a content similarity feature classifier, a structure style feature classifier and a visual rule feature classifier through the learning of the weighting parameters and the adjustment factors to obtain stable weighting parameters and adjustment factors.
In the linear addition training module, the training of linear addition mainly depends on the quality of a training sample, and the weighting parameters and the adjustment factors are automatically calculated according to a built-in algorithm. The linear addition training module utilizes the training set and the testing set to classify parameters and omega in the five classifiersi
Figure BDA0001680048800000081
Training and optimization are performed. During training, sample records of training data are input into five types of base classifiers in parallel, and the output of a multi-classification model combined by the five types of classifiers is a label of a corresponding sample, and the expression form is shown in fig. 3.
The integrated classifier is a final model output by the linear addition training module, and the realized functions include but are not limited to 1) construction of a detection model responsible for the phishing website; 2) matching the counterfeit object with the category label; 3) dynamic maintenance of classifier weights and iterative integration of feature libraries. The feature library is a feature library of the classifier used, such as a feature library of a domain name morpheme feature classifier, and the like.
In the ensemble classifier, the base classifiers do not each participate in the final detection, and if the weight parameter is 0, the corresponding base classifier is not enabled. Furthermore, in the integrated model, the performance of classifiers, such as visual rules feature classifiers, is considered, and is not used in the rough classification due to the long time consumption.
The training data set management module is used for storing training data. The training data set mainly comprises data samples such as URL, source codes and website information of the phishing website, and the function of the module comprises 1) performing label management on the training sample data and grouping; 2) carrying out test set and training set division and maintenance on the training samples; 3) equalization management of sample samples in different packets.
The quality of the training sample data and the quality of the classifier are equally important for the final effect. Therefore, the management of the samples is completed in an independent module, and the module focuses on the management of the distribution of the samples of different categories, so that the condition of unbalanced samples is prevented.
The detection and alarm module: this module includes two functions: 1) the detection function of the phishing website, wherein the model of the phishing website is composed of parameters and a feature library of an integrated classifier; 2) and (4) giving an alarm to the detected phishing website, wherein the alarm information and the level can be configured according to user definition.
In the module, the alarm information mainly aims at the classification of the alarm importance degree for different attention degrees of the user to different counterfeit objects. For example, one page shows lottery information on the first screen, but the second screen shows counterfeit information of China bank. But the bank is more concerned than the lottery, so the classification and alarm for the page is preferentially the bank.
The 5 base classifiers in the system are independent of each other and have no correlation; the linear addition training module is the core of the system, the training of all parameters is completed in the module, and the linear model ensures the performance of the system and the convergence of operation; the integration classifier determines the combination condition of the base classifiers in the base classifier module according to the parameters of the output of the linear addition, and only 2-3 classifiers are needed possibly, but not necessarily 5 classifiers are involved in the integration.
As shown in fig. 4, a deployment diagram of the system of the present invention is presented. The five base classifiers form a base learner server group, and the linear addition training module, the integrated classifier, the training data set management module and the detection and alarm module are distributed and deployed at the networking switch.

Claims (6)

1. A phishing website detection method based on an adaptive heterogeneous multi-classification model is characterized by comprising the following steps:
step 1, performing cross validation on a phishing website sample set D of the same category by adopting a leave-one-out method to classify a training set and a test set; let the jth training set denote DjThe corresponding jth test set is denoted as
Figure FDA0002892297230000011
Each sample contains a sample record and a sample label; the sample records comprise URL and webpage information of websites, and the sample labels are used for marking whether the websites are phishing websites or not; j is a positive integer;
step 2, constructing an adaptive heterogeneous multi-classification model H through linear addition, as follows:
Figure FDA0002892297230000012
wherein T is the number of the base classification algorithm, hiFor the ith base classification algorithm, ωiFor the weighting parameters of the ith base classification algorithm,
Figure FDA0002892297230000013
for adjustment factors, x represents the sample record;
the base classification algorithm includes: h is1The representative domain name morpheme feature classification algorithm judges whether the domain name morpheme is a phishing website or not according to the domain name morpheme; h is2The representative topic index feature classification algorithm is used for judging whether the webpage is a phishing website according to the content under the topic label in the webpage; h is3The representative content similarity feature classification algorithm is used for comparing the similarity of the content marked by the content in the webpage to judge whether the webpage is a phishing website or not; h is4The representative structure style feature classification algorithm judges whether the phishing website is the phishing website according to the structure of the source code; h is5The representative visual rule feature classification algorithm is used for judging whether the webpage is a phishing website according to the icon, color matching and picture of the webpage;
the basic classification algorithm is characterized by a linear function, and the parameters of each classification algorithm are independently and identically distributed;
step 3, inputting the multi-classification model H into each base classification algorithm, and outputting a sample label; for training set DjExtracting the corresponding characteristics of each base classification algorithm from the sample records of each sample as input;
step 4, based on the training set DjUsing machine learning algorithm to classify the parameters of each base classification algorithm and omega in multi-classification model Hi
Figure FDA0002892297230000014
Training and parameter solving are carried out;
solving the parameters of each base classification algorithm by adopting a maximum likelihood estimation method, and carrying out maximum expectation algorithm on the parameters omega in the multi-classification model Hi
Figure FDA0002892297230000015
Carrying out iterative solution;
step 5, in the test set
Figure FDA0002892297230000016
The multi-classification model H is tested and optimized until the parameters of each base classification algorithm and the parameter omega in the multi-classification model Hi
Figure FDA0002892297230000017
Converging, and finishing the machine learning algorithm of the multi-classification model H;
step 6, obtaining parameters of each base classification algorithm and parameters omega in the multi-classification model H finallyi
Figure FDA0002892297230000018
Obtaining a detection model H' of the phishing website;
and 7, acquiring records of the website to be detected, including URL (uniform resource locator) and webpage information of the website, and inputting a detection model H' to judge whether the website is a phishing website.
2. The method of claim 1, wherein the size of the sample set D cannot be below 100.
3. The method according to claim 1 or 2, wherein in step 1, the training set and the test set are represented as follows:
jth training set Dj={(x1,y1),(x2,y2),...,(xm,ym)},1≤j≤n,1<m<n;
Corresponding jth test set
Figure FDA0002892297230000019
Wherein n is the number of samples in D, and m is DjNumber of samples in, D/DjRepresenting the removal of D from the set Dj(ii) a The ith sample (x)i,yi) Record x of the ith sampleiAnd a label yi
4. The method of claim 1, wherein in step 5, when the parameter ω is in the multi-class model Hi
Figure FDA0002892297230000021
And when the convergence cannot be achieved, correcting the sample label, updating the training set sample, and re-executing the training process in the step 4.
5. A phishing website detection system based on a self-adaptive heterogeneous multi-classification model is characterized by comprising a domain name morpheme feature classifier, a theme index feature classifier, a content similarity feature classifier, a structure style feature classifier, a visual rule feature classifier, a linear addition training module, an integrated classifier, a training data set management module and a detection and alarm module;
the domain name morpheme feature classifier is used for extracting and training the features of the domain name character string of the input website URL;
the topic index feature classifier is used for extracting and training the features of the contents of webpage labels < title >, < meta > and page footers of an input website;
the content similarity feature classifier is used for extracting features and abstracting semantics of information in webpage content input into a website and training the features;
the structure style feature classifier is used for extracting and training the features of the webpage source code structure of the input website;
the visual rule feature classifier extracts and learns the symbolic visual features of the web pages input into the website;
the linear addition training module linearly combines the domain name morpheme feature classifier, the topic index feature classifier, the content similarity feature classifier, the structure style feature classifier and the visual rule feature classifier, and the combination form is as follows:
Figure FDA0002892297230000022
wherein H (x) represents a multi-classification model of a combination of five classifiers, h1~h5Classification functions, omega, corresponding to the five classifiers respectivelyiAs a weight parameter of the i-th classifier,
Figure FDA0002892297230000023
for adjustment factors, x represents the sample record; said classification function h1~h5The characteristics are linear functions, and the parameters of each classification function are independently and equally distributed;
the linear addition training module utilizes the training set and the testing set to carry out the parameter sum omega of the five classifiersi
Figure FDA0002892297230000024
Training and optimizing; solving the parameters of each base classification algorithm by adopting a maximum likelihood estimation method, and carrying out maximum expectation algorithm on the parameters omega in the multi-classification model Hi
Figure FDA0002892297230000025
Carrying out iterative solution;
the integrated classifier is a final model output by the linear addition training module, a detection model of the phishing website is constructed, and the weight sum of each classifier is dynamically maintained;
the training data set management module stores a training data set, performs label management and grouping on training samples, performs division and maintenance on the training set and a test set, and manages the sample sampling balance in different groups;
the detection and alarm module detects the website to be detected according to the constructed detection model of the phishing website, and alarms when the phishing website is detected.
6. The system of claim 5, wherein the domain name morpheme feature classifier performs functions comprising: judging the suspicious degree of the top-level domain name; extracting morpheme information contained in the secondary domain name; acquiring the hierarchical structure of the domain name and the length of the sub-domain name; constructing and perfecting a morpheme feature library;
when the top-level domain name appears pw, win, top or xyz, the suspicious degree of the phishing website is high; morpheme information in the secondary domain name refers to a short name that some banks are included in the character string forming the secondary domain name; the domain name morpheme feature classifier also extracts short words of the bank website formed by hyphens in the third-level or fourth-level domain name.
CN201810549417.1A 2018-05-31 2018-05-31 Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model Active CN108965245B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810549417.1A CN108965245B (en) 2018-05-31 2018-05-31 Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810549417.1A CN108965245B (en) 2018-05-31 2018-05-31 Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model

Publications (2)

Publication Number Publication Date
CN108965245A CN108965245A (en) 2018-12-07
CN108965245B true CN108965245B (en) 2021-04-13

Family

ID=64493105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810549417.1A Active CN108965245B (en) 2018-05-31 2018-05-31 Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model

Country Status (1)

Country Link
CN (1) CN108965245B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710925A (en) * 2018-12-12 2019-05-03 新华三大数据技术有限公司 Name entity recognition method and device
US11303674B2 (en) 2019-05-14 2022-04-12 International Business Machines Corporation Detection of phishing campaigns based on deep learning network detection of phishing exfiltration communications
CN110266647B (en) * 2019-05-22 2021-09-14 北京金睛云华科技有限公司 Command and control communication detection method and system
CN110336790B (en) * 2019-05-29 2021-05-25 网宿科技股份有限公司 Website detection method and system
CN110324316B (en) * 2019-05-31 2022-04-22 河南九域恩湃电力技术有限公司 Industrial control abnormal behavior detection method based on multiple machine learning algorithms
CN110334262B (en) * 2019-06-06 2023-12-29 创新先进技术有限公司 Model training method and device and electronic equipment
CN110766165B (en) * 2019-10-23 2023-08-08 扬州大学 Online active machine learning method for malicious URL detection
CN110912910A (en) * 2019-11-29 2020-03-24 北京工业大学 DNS network data filtering method and device
CN111125699B (en) * 2019-12-04 2023-04-18 中南大学 Malicious program visual detection method based on deep learning
CN111552649B (en) * 2020-05-18 2022-02-22 支付宝(杭州)信息技术有限公司 Packet testing method and device
CN111859451B (en) * 2020-07-23 2024-02-06 北京尚隐科技有限公司 Multi-source multi-mode data processing system and method for applying same
CN112507333A (en) * 2020-12-01 2021-03-16 北京天融信网络安全技术有限公司 Website detection and model training method and device and electronic equipment
CN113051500B (en) * 2021-03-25 2022-08-16 武汉大学 Phishing website identification method and system fusing multi-source data
CN113438209B (en) * 2021-06-04 2022-03-08 中国计量大学 Phishing website detection method based on improved Stacking strategy
CN114124564B (en) * 2021-12-03 2023-11-28 北京天融信网络安全技术有限公司 Method and device for detecting counterfeit website, electronic equipment and storage medium
CN114363019A (en) * 2021-12-20 2022-04-15 北京华云安信息技术有限公司 Method, device and equipment for training phishing website detection model and storage medium
CN114499980A (en) * 2021-12-28 2022-05-13 杭州安恒信息技术股份有限公司 Phishing mail detection method, device, equipment and storage medium
CN114070653B (en) * 2022-01-14 2022-06-24 浙江大学 Hybrid phishing website detection method and device, electronic equipment and storage medium
CN116028880B (en) * 2023-02-07 2023-07-04 支付宝(杭州)信息技术有限公司 Method for training behavior intention recognition model, behavior intention recognition method and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103379111A (en) * 2012-04-21 2013-10-30 中南林业科技大学 Intelligent anti-phishing defensive system
US9621566B2 (en) * 2013-05-31 2017-04-11 Adi Labs Incorporated System and method for detecting phishing webpages
CN104217160B (en) * 2014-09-19 2017-11-28 中国科学院深圳先进技术研究院 A kind of Chinese detection method for phishing site and system
CN106453351A (en) * 2016-10-31 2017-02-22 重庆邮电大学 Financial fishing webpage detection method based on Web page characteristics
CN106789888B (en) * 2016-11-18 2020-08-04 重庆邮电大学 Multi-feature fusion phishing webpage detection method
CN107181730A (en) * 2017-03-13 2017-09-19 烟台中科网络技术研究所 A kind of counterfeit website monitoring recognition methods and system
CN107911360A (en) * 2017-11-13 2018-04-13 哈尔滨工业大学(威海) One kind is hacked website detection method and system

Also Published As

Publication number Publication date
CN108965245A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
CN108965245B (en) Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model
Wang et al. PDRCNN: Precise phishing detection with recurrent convolutional neural networks
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
TWI424325B (en) Systems and methods for organizing collective social intelligence information using an organic object data model
CN105426356B (en) A kind of target information recognition methods and device
Opara et al. HTMLPhish: Enabling phishing web page detection by applying deep learning techniques on HTML analysis
CN109831460B (en) Web attack detection method based on collaborative training
Bhagat et al. Applying link-based classification to label blogs
CN112541476B (en) Malicious webpage identification method based on semantic feature extraction
CN110191096B (en) Word vector webpage intrusion detection method based on semantic analysis
CN111259219B (en) Malicious webpage identification model establishment method, malicious webpage identification method and malicious webpage identification system
WO2014029318A1 (en) Method and apparatus for identifying webpage type
CN115080756A (en) Attack and defense behavior and space-time information extraction method oriented to threat information map
Nowroozi et al. An adversarial attack analysis on malicious advertisement url detection framework
He et al. Malicious domain detection via domain relationship and graph models
Zhu et al. CCBLA: a lightweight phishing detection model based on CNN, BiLSTM, and attention mechanism
Wibowo et al. Detection of Fake News and Hoaxes on Information from Web Scraping using Classifier Methods
CN117176433A (en) Abnormal behavior detection system and method for network data
Rayyan et al. Uniform resource locator classification using classical machine learning & deep learning techniques
US20230353595A1 (en) Content-based deep learning for inline phishing detection
CN110851828A (en) Malicious URL monitoring method and device based on multi-dimensional features and electronic equipment
KR20240013640A (en) Method for detecting harmful url
Kasim Automatic detection of phishing pages with event-based request processing, deep-hybrid feature extraction and light gradient boosted machine model
Feng et al. Detecting phishing webpages via homology analysis of webpage structure
CN114915468A (en) Intelligent analysis and detection method for network crime based on knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant