CN108965245B

CN108965245B - Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model

Info

Publication number: CN108965245B
Application number: CN201810549417.1A
Authority: CN
Inventors: 臧天宁; 强倩; 杜飞; 周渊
Original assignee: Beijing Ruichi Xinan Technology Co ltd; National Computer Network and Information Security Management Center
Current assignee: Beijing Ruichi Xinan Technology Co ltd; National Computer Network and Information Security Management Center
Priority date: 2018-05-31
Filing date: 2018-05-31
Publication date: 2021-04-13
Anticipated expiration: 2038-05-31
Also published as: CN108965245A

Abstract

The invention provides a phishing website detection method and system based on a self-adaptive heterogeneous multi-classification model. The method comprises the steps of constructing a self-adaptive heterogeneous multi-classification model by linear addition on multiple base classification algorithms, training the multi-classification model, wherein the input of the model is the input of each base classification algorithm, the output of the model is a sample label, and each base classification algorithm extracts corresponding characteristics from a sample record as input; and solving the model parameters by adopting a machine learning algorithm, and testing and optimizing by using the test set to finally obtain the detection model of the phishing website. The system comprises a domain name morpheme feature classifier, a topic index feature classifier, a content similarity feature classifier, a structure style feature classifier, a visual rule feature classifier, a linear addition training module, an integrated classifier, a training data set management module and a detection and alarm module. The invention realizes real-time detection of the phishing website and improves the accuracy and stability of the phishing website detection.

Description

Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model

Technical Field

The invention relates to the field of computer network security, in particular to a phishing website detection method and system based on a self-adaptive heterogeneous multi-classification model.

Background

With the vigorous development of internet technology, the problem of network security is endless. Phishing is a typical online fraud behavior, which takes the internet as a carrier, and deceives users to obtain sensitive information of the users by disguising as good-credited legitimate websites, and the deceived users can generate personal information leakage to different degrees, thereby causing economic loss. How to detect phishing websites quickly and accurately becomes a hotspot for research on Web (global wide area network) information security. The phishing website detection technology disclosed at present mainly comprises the following modes:

(1) detection technology based on black and white list mechanism: as a practical core technology, the black and white list has the advantages of high efficiency and accuracy. The phishing website can be quickly located by judging the domain name, which is one of the most common implementation technologies [1 ].

(2) The detection technology based on the visual similarity comprises the following steps: cao Jieshen et al propose a webpage similarity determination algorithm based on nested EMD (Earth Mover's Distance), segment a Web image, and construct an ARG (associated Relational graph) of a webpage by using the segmented subgraph features. After the distances of different AGR attributes are obtained through calculation, the similarity between the phishing website and the webpage of the protected website is obtained through a nested EMD method, and then the high-precision detection of the phishing website is achieved [2 ].

The lubrouhahan distinguishes phishing websites by using visual similarity based on an EMD algorithm. The algorithm adopts the result obtained by comparing the pixel similarity in the webpage as the basis for judging the phishing website [3 ].

(3) The detection technology based on the Bayesian algorithm comprises the following steps: jinqing et al formulated a series of rules for matching phishing websites based on a rule bayesian algorithm. And then distributing corresponding weight values for each rule, and calculating a correction coefficient of the rule to obtain the probability that the tested website is a phishing website. And further judging whether the phishing website is a phishing website or not through a probability threshold value [4] [5 ].

The system [6] capable of intelligently detecting phishing attacks is constructed by the Chinese patent of Su-maid and the like by classifying and integrating 8 characteristics of webpage label title contents, keyword information, page description information, picture links, website copyright information and the like in websites by using an expanded Bayesian algorithm and an improved support vector machine.

(4) Document structure based detection techniques: guo Mingzhi et al analyzed Web page document objects and extracted phishing sensitive information features in the text object model that are often utilized by phishers to determine whether the website is a phishing website. The algorithm effectively filters out Phishing pages in the website, and powerfully prevents malicious fish hooking attacks of phishers [7 ].

(5) The phishing website detection technology based on deep learning comprises the following steps: many people propose a DBN-KNN model with a multilayer structure in combination with deep learning technology, and the DBN-KNN model is applied to feature recognition of phishing websites to recognize the phishing websites [8 ].

(6) Other types of detection techniques: huanghua army et al propose phishing active defense [9] based on semi-fragile watermarks and phishing URL detection algorithm [10] based on abnormal features; zhangjia Yiet et al propose a network hookfish inspection algorithm [11] with text semantic understanding; the remaining defensive measures include elaboration of cross-site phishing attacks against web-mail mailboxes, cloud computing based URL filtering [12], SVM learning algorithms [13], and the like.

In the above technologies, the detection method based on the black and white list has poor timeliness and also has defects in the list range; the detection technology based on the visual similarity has complex algorithm and long detection time, and cannot be suitable for the online real-time detection of mass URLs (Uniform resource locators); the detection technology based on the Bayesian algorithm is not ideal in robustness and generalization performance; the detection technology based on the document structure has the problems of incomplete feature coverage and more missing reports; the phishing website detection technology based on deep learning has advantages in feature recognition, but the stability of features is poor, and the phishing website detection technology is easily interfered by sample pollution.

Reference documents:

[1]Huang C.,Ma S,Chen K.,Using One-Time Passwords to Prevent Password Phishing Attacks[J].Journal of Network and Computer Applications.2011,34(4):1292-1301.

[2] cao European New, Roman, etc. fishing webpage detection algorithm [ J ] based on nested EMD, computer science, 2009,32(5): 922-.

[3] Plum blossom, Liu Dong, Phishing detection method based on visual similarity [ J ]. school newspaper of Qinghua university (Nature science edition), 2009,49(1): 146-.

[4]Zhang H.,Liu G.,Chow T.W.S..et al.Textual and Visual Content-Based Anti-Phishing:A Baysian Approach[J].IEEE Transactions on Neural Networks.2011,22(10):1532-1546.

[5] Jinqing, wu nationality new, litdan, etc. filtering of internet fishing is achieved with rule-based bayes algorithm [ R ]. hunan: the fifth academic conference on information and communication technology security, 2007,372-378.

[6] The classification integration-based phishing website intelligent detection system [ J ] systematic engineering theory and practice, 2011,31(10):2008 + 2020.

[7] Guo Mingzhi, Yuanjinsheng, Wang Yachao, et al.

[8] Research on phishing website detection technology based on deep learning [ D ] university of electronic technology, 2017.

[9] Invar army, liu jia, liu liang.study of cross-site phishing attacks against the Web-mail na box [ J ] communication technology, 2010,43(8): 164-.

[10]Huang H.J.,Wang Y.J.,Xie L.L..et al.An Active Anti-Phishing Solution Based on Semi-fragile Watermark[J].Information Technology Journal.2013,12(1):198-203.

[11]Huang H.J.,Qian L.,Wang Y.J..A SVM-Based Technique to Detect Phishing URLs[J].Information Technology Journal.2012,11(7):921-925.

[12]Zhang H.,Liu G.,Chow T.W.S..et al.Textual and Visual Content-Based Anti-Phishing:A Baysian Approach[J].IEEE Transactions on Neural Networks.2011,22(10):1532-1546.

[13]Sheng S.,Wardman B.,Warner G..et al.An Empirical Analysis of Phishing Blacklists[C].In:Proc.of the sixth Conference on Email and Anti-Spam.2009:1-10.

Disclosure of Invention

Aiming at the problems of the existing method, the invention discloses a phishing website detection method and system based on a self-adaptive heterogeneous multi-classification model, which are used for detecting phishing websites in real time and have higher accuracy and stability.

The invention discloses a phishing website detection method based on a self-adaptive heterogeneous multi-classification model (AHMC), which comprises the steps of learning the self-adaptive heterogeneous multi-classification model and detecting the phishing website, and specifically comprises the following steps:

step 1, for a certain category of phishing website sample set D, | D | ═ n,and performing cross validation by using a leave-one-out method to classify the training set and the test set. Let the jth training set denote D_jThe corresponding jth test set is denoted as

j is a positive integer. Each sample contains a sample record and a sample label; the sample record contains URL and webpage information of the website, and the sample label is used for marking whether the website is a phishing website or not.

Step 2, constructing an adaptive heterogeneous multi-classification model H through linear addition, which comprises the following steps:

wherein ω is_iIn order to be a weight parameter, the weight parameter,

to adjust the factor

Wherein T is the number of the base classification algorithms. h is_iFor the ith base classification algorithm, ω_iFor the weighting parameters of the ith base classification algorithm,

is an adjustment factor. x represents the sample record.

Step 3, inputting the multi-classification model H into each base classification algorithm, and outputting a sample label; for training set D_jThe corresponding features of each base classification algorithm are extracted from the sample records of each sample as input. The basic classification algorithm is characterized by a linear function, and the parameters of each classification algorithm are independently and equally distributed.

Step 4, based on the training set D_jUsing machine learning algorithm to classify the parameters of each base classification algorithm and omega in multi-classification model H_i,

Training and parameter solving are carried out. In the training solution, extracting corresponding characteristics from the sample record as input for each base classification algorithmPreferably, the output of the multi-classification model is guaranteed to be a sample label.

Step 5, in the test set

The multi-classification model H is tested and optimized until the parameters of each base classification algorithm and the parameter omega in the multi-classification model H_i,

And converging, and finishing the machine learning algorithm of the multi-classification model H.

Step 6, obtaining parameters of each base classification algorithm and parameters omega in the multi-classification model H finally_i,

And obtaining a detection model H' of the phishing website.

And 7, acquiring records of the website to be detected, including URL (uniform resource locator) and webpage information of the website, and inputting a detection model H' to judge whether the website is a phishing website.

The invention also discloses a phishing website detection system based on the self-adaptive heterogeneous multi-classification model, which comprises a domain name morpheme feature classifier, a theme index feature classifier, a content similarity feature classifier, a structure style feature classifier, a visual rule feature classifier, a linear addition training module, an integrated classifier, a training data set management module and a detection and alarm module. The functions of each module in the system operation are as follows:

the domain name morpheme feature classifier is used for extracting and training the features of the domain name character string of the input website URL;

the topic index feature classifier is used for extracting and training the features of the contents of webpage labels < title >, < meta > and page footers of an input website;

the content similarity feature classifier is used for extracting features and abstracting semantics of information in webpage content input into a website and training the features;

the structure style feature classifier is used for extracting and training the features of the webpage source code structure of the input website;

the visual rule feature classifier extracts and learns the symbolic visual features of the web pages input into the website;

the linear addition training module linearly combines the domain name morpheme feature classifier, the topic index feature classifier, the content similarity feature classifier, the structure style feature classifier and the visual rule feature classifier, and the combination form is as follows:

wherein H (x) represents a multi-classification model of a combination of five classifiers, h₁～h₅Classification functions, omega, corresponding to the five classifiers respectively_iAs a weight parameter of the i-th classifier,

for adjustment factors, x represents the sample record;

the linear addition training module utilizes the training set and the testing set to carry out the parameter sum omega of the five classifiers_i、

Training and optimizing;

the integrated classifier is a final model output by the linear addition training module, a detection model of the phishing website is constructed, and the weight of each classifier is dynamically maintained;

the training data set management module stores a training data set, performs label management and grouping on training samples, performs division and maintenance on the training set and a test set, and manages the sample sampling balance in different groups;

the detection and alarm module detects the website to be detected according to the constructed detection model of the phishing website, and alarms when the phishing website is detected.

Compared with the prior art, the invention has the following obvious advantages:

(1) the method and the system adopt the integrated learning technology to linearly combine a plurality of weak classifiers, obtain the generalization performance which is obviously superior to that of a single classifier, and improve the accuracy and the stability of the phishing website detection.

(2) The method and the system of the invention adopt weight adaptive learning, learn the weight through the sample, do not depend on prior experience and the suggestion of field experts, and automatically update the weight in the learning process of training under the condition that the sample characteristics and distribution change.

(3) The five weak classifiers in the system are heterogeneous, each single classifier has certain accuracy, and strong dependency relationship does not exist among the classifiers, so that the integrated model is ensured to have higher accuracy. The overall cost of five weak classifiers is no greater than that of a single complex learning model.

(4) The method can detect the phishing website in real time, and has higher accuracy, recall rate and robustness.

(5) The system can process high-performance real-time detection of mass URLs, can be applied to an online engineering system, and has high availability and stability.

Drawings

FIG. 1 is a flow chart of the phishing website detection method based on the adaptive heterogeneous multi-classification model of the invention;

FIG. 2 is a block diagram of the phishing website detection system based on the adaptive heterogeneous multi-classification model according to the present invention;

FIG. 3 is a schematic diagram of a linear addition training module for data training in the phishing website detection system of the present invention;

FIG. 4 is a deployment diagram of the phishing website detection system based on the adaptive heterogeneous multi-classification model of the invention.

Detailed Description

The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples. The examples are given solely for the purpose of illustration and are not intended to limit the scope of the invention.

As shown in FIG. 1, the invention provides a phishing website detection method based on an adaptive heterogeneous multi-classification model (AHMC), which comprises learning of the adaptive heterogeneous multi-classification model and detection of phishing websites, and the following specific implementation steps are described.

Step 1, selecting phishing websites of the same category, for example phishing websites which are counterfeited in the same bank type, as a sample set D, | D | ═ n, and n represents the number of samples in D. And performing cross validation by a leave-one-out method to classify the sample into a training set and a test set.

The jth training sample set is: d_j＝{(x₁,y₁),(x₂,y₂),…,(x_m,y_m)}(1≤j≤n,1<m<n)；

The corresponding jth test sample set:

wherein each sample comprises a record x and a label y of the sample, such as (x) in the sample set₁,y₁) Represents a specific sample instance, where x₁Record representing the sample, y₁Indicating a sample label. The record of the sample here includes url of the website and corresponding webpage information, and the sample label is used for marking whether the website is a phishing website. D/D_jRemoving D from the set of representations D_j。

In this step, the size n of the sample should be as large as possible, and it is recommended that it not be lower than 100.

Step 2, constructing an adaptive heterogeneous multi-classification model H as follows:

is an adjustment factor.

The base classification algorithm is also called a classification algorithm or a learning algorithm in the subsequent expression, and the corresponding classifier also corresponds to the corresponding learner. The base classification algorithm is a heterogeneous algorithm, the diversity of the algorithm is ensured, and 5 fixed classification algorithms are arranged in the embodiment of the invention, wherein h is₁The representative domain name morpheme feature classification algorithm judges whether the domain name morpheme is a phishing website or not according to the domain name morpheme; h is₂The representative topic index feature classification algorithm is used for judging whether the webpage is a phishing website according to the content under the topic label in the webpage; h is₃The representative content similarity feature classification algorithm is used for comparing the similarity of the content marked by the content in the webpage to judge whether the webpage is a phishing website or not; h is₄The representative structure style feature classification algorithm judges whether the phishing website is the phishing website according to the structure of the source code; h is₅The representative visual rule feature classification algorithm judges whether the website is a phishing website according to icons, color matching, pictures and the like of the webpage. In practice the necessary extensions can be made according to the principles of isomerism.

Step 3, training set D_jThe sample records in each sample in the multi-classification model H are used as input to the base classification algorithm, and the sample labels are used as output. Will (x)_i,y_i) X in (2)_iAs classification algorithms h₁-h₅For each base classification algorithm, from x_iExtracting the features to be input, and extracting y_iAs a corresponding output, the features and parameters of the corresponding classification algorithm are trained as follows:

h₁(x_i)→y_i,h₂(x_i)→y_i,h₃(x_i)→y_i,h₄(x_i)→y_i,h₅(x_i)→y_i

in the embodiment of the present invention, the multi-classification model H is represented as follows:

when the input is a sample record x_iThen, the corresponding sample label y is output_iOn the right side of the above equation is a linear weighted combination of classification functions. The multi-classification model is trained to calculate and obtain a weight parameter omega_iAnd adjustment factor

When the method is used for training, the input sample is recorded as x_iExtracting corresponding features as each classification algorithm h₁-h₅Preferably the output of the multi-classification model is a sample label y_iThe output of each classification algorithm can be set to y_iAnd training the parameters of each classification algorithm, the weight parameters of the multi-classification model and the adjustment factors. The base classification algorithm is characterized by a linear function, and the parameters of each classification algorithm are independently and uniformly distributed, so that the cost of the whole training is not higher than that of a single complex learning model. The input features of different base classification algorithms may be different, requiring the selection of suitable feature inputs from the sample records. E.g. domain name morpheme feature classification algorithm h₁The input features of (1) include a top level domain name, a second level domain name, etc.

Step 4, parameters, weight parameters and adjustment factors of each classification algorithm are adjusted by adopting a machine learning algorithm

Training and parameter solving are carried out. For example, in the solution, the parameters of each classification algorithm may adopt a maximum likelihood estimation method, the parameters of the integrated model H may adopt an EM (maximum expectation) algorithm to iteratively solve, the constraint conditions may be formalized by adopting a minimum loss function, the solution process may be implemented by a unified computation framework — parameter solution of the maximum likelihood estimation method, and the matrix solution is uniformly performed in the process of computer execution.

Step 5, in the test sample set

Model H was tested and optimized as above. Polling all test and training samples until the parameters and adjustment factors

Converging to a stable threshold, and the learning algorithm of the model is ended.

This step performs both testing and optimization purposes. When a conflict occurs on the test sample, or an adjustment factor occurs

Under the condition that the convergence cannot be achieved, the sample is corrected, the training sample is subjected to class modification and independent processing, the sample label is corrected, the training set sample is updated, then the model H is trained again to adjust the parameters of the classification algorithm, namely the training process in the step 4 is executed again, and therefore the purpose of optimizing the parameters and adjusting the factors is achieved.

The method of the invention adopts a leave-one-out method to obtain training sets and test sets, K groups of training sets and test sets are obtained together, the steps 3-5 are executed on each group of training sets and test sets, and finally parameters and adjustment factors of a multi-group base classification algorithm can be obtained

At this time, the obtained classification algorithm parameters and adjustment factors may be subjected to combined average evaluation as a final result.

Step 6, obtaining the parameters of each classification algorithm and the adjustment factor omega according to the step 5_i,

And obtaining the self-adaptive heterogeneous multi-classification model H' corresponding to the phishing websites.

And 5, optimizing to obtain H as a model example, carrying out parameter migration on the obtained model H, and initializing a detection algorithm H' of the phishing website. The models H 'and H are isomorphic, and H' in the embodiment of the invention is integrated with H₁-h₅The mixed model of (1).

And 7, acquiring records of the website to be detected, including website URL, webpage source code and other webpage information, and inputting the records into a detection module H' to acquire whether the URL is a phishing website, a counterfeit object and other information. The input webpage information does not need to be formatted, and the features used by each classifier are automatically obtained in the webpage source code structure.

In the step, the website information and the source code data corresponding to the URL are obtained, a crawler technology can be adopted, when new features and variants appear, the corresponding base classification algorithm and the features can be updated only, and the weight parameters and the adjustment factors are adjusted

The influence of (c) is small.

The invention adopts the thought of ensemble learning, and the difference from the existing classical ensemble learning is mainly reflected in that: the classical ensemble learning includes two stages, the first stage is to train the base classifiers first, and the second stage is to train the parameters after the combination of the base classifiers by using the output of the first stage as the input. The invention adopts a unified computing frame to train together, and does not carry out two-stage division.

The invention discloses a phishing website detection system based on a self-adaptive heterogeneous multi-classification model, which mainly comprises 9 parts, namely a domain name morpheme feature classifier, a theme index feature classifier, a content similarity feature classifier, a structure style feature classifier, a visual rule feature classifier, a linear addition training module, an integrated classifier, a training data collection management module and a detection and alarm module. The functions of the various modules in the operation of the system are described below, as shown in fig. 2.

A domain name morpheme feature classifier: the features of the classifier are mainly from the statistical features of the domain name part in the URL character string of the phishing website. The domain name morpheme feature classifier performs feature extraction and training on a domain name character string of an input website URL, and the realized functions include but are not limited to: 1) judging the suspicious degree of the top-level domain name; 2) extracting morpheme information contained in the secondary domain name; 3) acquiring the hierarchical structure of the domain name and the length of the sub-domain name; 4) and constructing and perfecting a morpheme feature library.

In the domain name morpheme feature classifier, 1) the suspicious degree of the top-level domain name comes from statistical experience, and the probability that the top-level domain names such as pw, win, top, xyz and the like appear in phishing websites is usually higher; 2) morpheme information in the secondary domain name refers to short names of banks, such as 95588, 95533, cmb, icbc, boc and the like, contained in character strings forming the secondary domain name; 3) short words of the bank website composed of hyphens such as www-bankofbeijing-com-cn can be contained in the third-level or fourth-level domain names.

The topic index feature classifier mainly performs feature extraction and training on the contents of webpage labels < title >, < meta > and page footer of an input website. The functions realized by the topic index feature classifier include but are not limited to 1) extracting features in the label, and performing conflict resolution and type classification of the features; 2) and constructing and perfecting a theme index feature library. The topic index feature classifier has the advantages of quick and accurate positioning and the disadvantages of weak generalization capability and high error report. The content in the < title > tag is not highly distinguishable from non-counterfeited normal websites or has no association with the content of the body of the web page. Therefore, the classifier needs to be matched with a white list library for classification.

Content similarity feature classifier: feature extraction and semantic abstraction are mainly performed on the short text information in the webpage content. Including but not limited to 1) extracting text in < body > tags, wherein the length extraction of the content in < a >, < p >, < div >, < span >, < td >, < table >, < form > and the like tags cannot exceed 15 characters, and the content in the text is extracted according to 2-8 characters; 2) vectorizing and normalizing the text features; 3) embedding words, and mapping the Word quantity into a low-dimensional characteristic vector by using a Word2Vec tool; 4) and constructing a word feature vector library.

The content similarity feature classifier has stable detection effect, and the indexes of accuracy and recall rate are better than those of other classifiers. Wherein the vectorization is the de-duplication and filtering of the short text; the normalization is to delete specific time words, frequently-changed numbers, interference words with too high occurrence frequency, advertisements without distinction, link words of a third party and the like.

The structure style feature classifier is used for extracting and training the features of the webpage source code structure of the input website. The main functions of the structure style classifier include: 1) analyzing source codes according to codes of the JS script, styles, form forms and DOM structures of the CSS; 2) analyzing a homologous code structure in a source code, and extracting a public code segment; 3) and constructing a homologous code similarity matrix.

The structure style feature classifier has prejudgment capability on new phishing websites developed by the same post-curtain organization, wherein public code segments comprise but are not limited to 1) the same function name; 2) compatible CSS color matching; 3) the same JS script; 4) the same selection list and < form > form; 5) the same hyperlinks and jump out-of-page links, etc.

The visual rule feature classifier mainly performs feature extraction and learning on symbolic visual features of web pages input into the website, wherein the extracted features include but are not limited to 1) log icons of target websites; 2) the whole color matching and the framework of the website are formed; 3) a symbolic picture module, etc.

The visual rule feature classifier has the disadvantages that the time for learning and detecting the visual features is long, and the same log can cause great errors on the difference of pixel levels, so the requirement on the quality of a training sample is strict. The size of the visual feature library is not less than 30000 strips.

The linear addition training module performs linear combination and training on heterogeneous base classifiers, namely a domain name morpheme feature classifier, a topic index feature classifier, a content similarity feature classifier, a structure style feature classifier and a visual rule feature classifier through the learning of the weighting parameters and the adjustment factors to obtain stable weighting parameters and adjustment factors.

In the linear addition training module, the training of linear addition mainly depends on the quality of a training sample, and the weighting parameters and the adjustment factors are automatically calculated according to a built-in algorithm. The linear addition training module utilizes the training set and the testing set to classify parameters and omega in the five classifiers_i、

Training and optimization are performed. During training, sample records of training data are input into five types of base classifiers in parallel, and the output of a multi-classification model combined by the five types of classifiers is a label of a corresponding sample, and the expression form is shown in fig. 3.

The integrated classifier is a final model output by the linear addition training module, and the realized functions include but are not limited to 1) construction of a detection model responsible for the phishing website; 2) matching the counterfeit object with the category label; 3) dynamic maintenance of classifier weights and iterative integration of feature libraries. The feature library is a feature library of the classifier used, such as a feature library of a domain name morpheme feature classifier, and the like.

In the ensemble classifier, the base classifiers do not each participate in the final detection, and if the weight parameter is 0, the corresponding base classifier is not enabled. Furthermore, in the integrated model, the performance of classifiers, such as visual rules feature classifiers, is considered, and is not used in the rough classification due to the long time consumption.

The training data set management module is used for storing training data. The training data set mainly comprises data samples such as URL, source codes and website information of the phishing website, and the function of the module comprises 1) performing label management on the training sample data and grouping; 2) carrying out test set and training set division and maintenance on the training samples; 3) equalization management of sample samples in different packets.

The quality of the training sample data and the quality of the classifier are equally important for the final effect. Therefore, the management of the samples is completed in an independent module, and the module focuses on the management of the distribution of the samples of different categories, so that the condition of unbalanced samples is prevented.

The detection and alarm module: this module includes two functions: 1) the detection function of the phishing website, wherein the model of the phishing website is composed of parameters and a feature library of an integrated classifier; 2) and (4) giving an alarm to the detected phishing website, wherein the alarm information and the level can be configured according to user definition.

In the module, the alarm information mainly aims at the classification of the alarm importance degree for different attention degrees of the user to different counterfeit objects. For example, one page shows lottery information on the first screen, but the second screen shows counterfeit information of China bank. But the bank is more concerned than the lottery, so the classification and alarm for the page is preferentially the bank.

The 5 base classifiers in the system are independent of each other and have no correlation; the linear addition training module is the core of the system, the training of all parameters is completed in the module, and the linear model ensures the performance of the system and the convergence of operation; the integration classifier determines the combination condition of the base classifiers in the base classifier module according to the parameters of the output of the linear addition, and only 2-3 classifiers are needed possibly, but not necessarily 5 classifiers are involved in the integration.

As shown in fig. 4, a deployment diagram of the system of the present invention is presented. The five base classifiers form a base learner server group, and the linear addition training module, the integrated classifier, the training data set management module and the detection and alarm module are distributed and deployed at the networking switch.

Claims

1. A phishing website detection method based on an adaptive heterogeneous multi-classification model is characterized by comprising the following steps:

step 1, performing cross validation on a phishing website sample set D of the same category by adopting a leave-one-out method to classify a training set and a test set; let the jth training set denote D_jThe corresponding jth test set is denoted as

Each sample contains a sample record and a sample label; the sample records comprise URL and webpage information of websites, and the sample labels are used for marking whether the websites are phishing websites or not; j is a positive integer;

step 2, constructing an adaptive heterogeneous multi-classification model H through linear addition, as follows:

wherein T is the number of the base classification algorithm, h_iFor the ith base classification algorithm, ω_iFor the weighting parameters of the ith base classification algorithm,

for adjustment factors, x represents the sample record;

the base classification algorithm includes: h is₁The representative domain name morpheme feature classification algorithm judges whether the domain name morpheme is a phishing website or not according to the domain name morpheme; h is₂The representative topic index feature classification algorithm is used for judging whether the webpage is a phishing website according to the content under the topic label in the webpage; h is₃The representative content similarity feature classification algorithm is used for comparing the similarity of the content marked by the content in the webpage to judge whether the webpage is a phishing website or not; h is₄The representative structure style feature classification algorithm judges whether the phishing website is the phishing website according to the structure of the source code; h is₅The representative visual rule feature classification algorithm is used for judging whether the webpage is a phishing website according to the icon, color matching and picture of the webpage;

the basic classification algorithm is characterized by a linear function, and the parameters of each classification algorithm are independently and identically distributed;

step 3, inputting the multi-classification model H into each base classification algorithm, and outputting a sample label; for training set D_jExtracting the corresponding characteristics of each base classification algorithm from the sample records of each sample as input;

step 4, based on the training set D_jUsing machine learning algorithm to classify the parameters of each base classification algorithm and omega in multi-classification model H_i，

Training and parameter solving are carried out;

solving the parameters of each base classification algorithm by adopting a maximum likelihood estimation method, and carrying out maximum expectation algorithm on the parameters omega in the multi-classification model H_i，

Carrying out iterative solution;

step 5, in the test set

The multi-classification model H is tested and optimized until the parameters of each base classification algorithm and the parameter omega in the multi-classification model H_i，

Converging, and finishing the machine learning algorithm of the multi-classification model H;

step 6, obtaining parameters of each base classification algorithm and parameters omega in the multi-classification model H finally_i，

Obtaining a detection model H' of the phishing website;

2. The method of claim 1, wherein the size of the sample set D cannot be below 100.

3. The method according to claim 1 or 2, wherein in step 1, the training set and the test set are represented as follows:

jth training set D_j＝{(x₁，y₁)，(x₂，y₂)，...，(x_m，y_m)}，1≤j≤n，1＜m＜n；

Corresponding jth test set

Wherein n is the number of samples in D, and m is D_jNumber of samples in, D/D_jRepresenting the removal of D from the set D_j(ii) a The ith sample (x)_i，y_i) Record x of the ith sample_iAnd a label y_i。

4. The method of claim 1, wherein in step 5, when the parameter ω is in the multi-class model H_i，

And when the convergence cannot be achieved, correcting the sample label, updating the training set sample, and re-executing the training process in the step 4.

5. A phishing website detection system based on a self-adaptive heterogeneous multi-classification model is characterized by comprising a domain name morpheme feature classifier, a theme index feature classifier, a content similarity feature classifier, a structure style feature classifier, a visual rule feature classifier, a linear addition training module, an integrated classifier, a training data set management module and a detection and alarm module;

for adjustment factors, x represents the sample record; said classification function h₁～h₅The characteristics are linear functions, and the parameters of each classification function are independently and equally distributed;

Training and optimizing; solving the parameters of each base classification algorithm by adopting a maximum likelihood estimation method, and carrying out maximum expectation algorithm on the parameters omega in the multi-classification model H_i，

Carrying out iterative solution;

the integrated classifier is a final model output by the linear addition training module, a detection model of the phishing website is constructed, and the weight sum of each classifier is dynamically maintained;

6. The system of claim 5, wherein the domain name morpheme feature classifier performs functions comprising: judging the suspicious degree of the top-level domain name; extracting morpheme information contained in the secondary domain name; acquiring the hierarchical structure of the domain name and the length of the sub-domain name; constructing and perfecting a morpheme feature library;

when the top-level domain name appears pw, win, top or xyz, the suspicious degree of the phishing website is high; morpheme information in the secondary domain name refers to a short name that some banks are included in the character string forming the secondary domain name; the domain name morpheme feature classifier also extracts short words of the bank website formed by hyphens in the third-level or fourth-level domain name.