CN108965245A

CN108965245A - Detection method for phishing site and system based on the more disaggregated models of adaptive isomery

Info

Publication number: CN108965245A
Application number: CN201810549417.1A
Authority: CN
Inventors: 臧天宁; 强倩; 杜飞; 周渊
Original assignee: BEIJING RUICHI XINAN TECHNOLOGY Co Ltd; National Computer Network and Information Security Management Center
Current assignee: BEIJING RUICHI XINAN TECHNOLOGY Co Ltd; National Computer Network and Information Security Management Center
Priority date: 2018-05-31
Filing date: 2018-05-31
Publication date: 2018-12-07
Anticipated expiration: 2038-05-31
Also published as: CN108965245B

Abstract

The present invention provides a kind of detection method for phishing site and system based on the more disaggregated models of adaptive isomery.The method constructs the adaptive more disaggregated models of isomery by linear addition to a variety of base sorting algorithms, more disaggregated models are trained, the mode input is the input of each base sorting algorithm, and output is sample label, and each base sorting algorithm extracts corresponding feature as input from sample record；It is tested and is optimized using machine learning algorithm solving model parameter, and with test set, finally obtain the detection model of such fishing website.The system comprises domain name morpheme feature classifiers, subject index feature classifiers, content similarities feature classifiers, structural style feature classifiers, ocular rules feature classifiers, linear addition training module, integrated classifier, training dataset management module and detection and alarm modules.The present invention is realized to fishing website real-time detection, and improves the Stability and veracity of fishing website detection.

Description

Detection method for phishing site and system based on the more disaggregated models of adaptive isomery

Technical field

The present invention relates to computer network security fields, and in particular to a kind of fishing based on the more disaggregated models of adaptive isomery Fishnet station detection method and system.

Background technique

With flourishing for Internet technology, network security problem emerges one after another.Phishing is that one kind typically exists Line fraud, it is using internet as carrier, by the legitimate site user cheating to enjoy a good reputation that disguises oneself as to obtain user's Sensitive information, cheated user can generate different degrees of personal information leakage, then lead to economic loss.It is how quickly quasi- Fishing website really is detected as Web (global wide area network) information security research hotspot.Published fishing website detection at present Technology mainly includes following methods:

(1) detection technique based on black and white lists mechanism: as a kind of practical core technology, black and white lists have efficient Accurate advantage.By the judgement to domain name, fishing website can be quickly positioned, is one of most common realization technology [1].

(2) detection technique of view-based access control model similarity: Cao Jiuxin et al. proposes one based on nesting EMD (Earth Mover ' s Distance) webpage similarity decision algorithm, Web graph picture is split, the Sub-Image Feature after segmentation is utilized To construct the ARG (Attributed Relational Graph) of webpage.After the distance of different AGR attributes is calculated, lead to It crosses nesting EMD method and obtains the similarity of fishing website Yu protected website and webpage, and then realize the high-precision to fishing website It detects [2].

Lee is noisy et al. to be based on EMD algorithm, differentiates fishing website using visual similarity.This kind of algorithm uses webpage The more resulting result of middle pixel similarity is as the foundation [3] for judging fishing website.

(3) based on the detection technique of bayesian algorithm: the rule-based bayesian algorithm of gold celebrating et al. has been formulated a series of Matching fishing website rule.Its corresponding weight then is distributed for each rule, its correction factor is calculated, obtains by survey grid Station is the probability of fishing website.And then it can judge whether it is fishing website [4] [5] by probability threshold value.

Zhuan Wei is luxuriant et al. to web page tag title content, in website keyword message, page-describing information, image link with And 8 features such as website copyright information, classified using extension bayesian algorithm and its improved SVM It is integrated, construct a kind of system [6] that can phishing attacks be carried out with intelligent measurement.

(4) based on the detection technique of file structure: Guo Minzhe et al. analyzes Web page document object, extracts text pair As the normal phishing sensitive information feature utilized by fisherman in model, to judge whether the website is fishing website.It should Algorithm has effective filtered out Phishing (phishing) page in website, the strong malice for having prevented middle phisher Hook fish attacks [7].

(5) the fishing website detection technique based on deep learning: Xu Long proposes multilayer in the technology for combining deep learning The DBN-KNN model of structure is applied in the feature identification of fishing website, identifies fishing website [8].

(6) other types of detection technique: yellow Hua Jun et al. proposes the phishing Initiative Defense based on semi-fragile watermarking [9] and based on off-note fishing URL detection algorithm [10]；Zhang Jianyi et al. proposes a kind of network hook that text semantic understands Fish checks algorithm [11]；Remaining defensive measure include it is skilful across grinding for station phishing attacks for web-mail mailbox, be based on The url filtering [12] of cloud computing, SVM learning algorithm [13] etc..

In the above technology, the detection method timeliness based on black and white lists is poor, there is also deficiencies for list range；It is based on The detection technique algorithm of vision similarity is complicated, and the time for detecting consumption is longer, is not applied for magnanimity URL (Uniform Resoure Locator: uniform resource locator) on-line real-time measuremen；Based on the detection technique of bayesian algorithm in robust It is less desirable in property and Generalization Capability；Detection technique existing characteristics based on file structure cover incomplete problem, fail to report It is more；Fishing website detection technique based on deep learning is upper advantageous in feature identification, but the stability of feature is poor, is easy Interference by sample contamination.

Bibliography:

[1]Huang C.,Ma S,Chen K.,Using One-Time Passwords to Prevent Password Phishing Attacks[J].Journal of Network and Computer Applications.2011,34(4): 1292-1301.

[2] Cao Ouxin, Mao Bo, Luo Junzhou wait fishing webpage detection algorithm [J] the Chinese journal of computers of based on nested EMD, 2009,32(5):922-929.

[3] Lee is noisy, and Dong Liu is (natural by Phishing detection method [J] Tsinghua University journal of the vision based on similar Scientific version), 2009,49 (1): 146-148.

[4]Zhang H.,Liu G.,Chow T.W.S..et al.Textual and Visual Content-Based Anti-Phishing:A Baysian Approach[J].IEEE Transactions on Neural Networks.2011,22(10):1532-1546.

[5] gold celebrating, Wu Guoxin, Li Dan wait to realize the lake filtering [R] of phishing with rule-based bayesian algorithm South: the 5th safe academic conference of Chinese information and communication technology (ICT), 2007,372-378.

[6] Zhuan Weiwei, Ye Yanfang, Li Tao wait fishing website intelligent checking system [J] the system work of based on classification ensemble Journey theory and practice, 2011,31 (10): 2008-2020.

[7] Guo Minzhe, Yuan Jinsheng, Wang Yachao wait phishing Web page detection algorithm machine [J] computer engineering, 2008,34(20):161-163.

[8] Xu Long is in research [D] the University of Electronic Science and Technology of fishing website detection technique of the based on deep learning, and 2017.

[9] Yin Shuijun, Liu Jiayong, Liu Liang are logical for research [J] across station phishing attacks of Web-mail Na case Letter technology, 2010,43 (8): 164-166.

[10]Huang H.J.,Wang Y.J.,Xie L.L..et al.An Active Anti-Phishing Solution Based on Semi-fragile Watermark[J].Information Technology Journal.2013,12(1):198-203.

[11]Huang H.J.,Qian L.,Wang Y.J..A SVM-Based Technique to Detect Phishing URLs[J].Information Technology Journal.2012,11(7):921-925.

[12]Zhang H.,Liu G.,Chow T.W.S..et al.Textual and Visual Content- Based Anti-Phishing:A Baysian Approach[J].IEEE Transactions on Neural Networks.2011,22(10):1532-1546.

[13]Sheng S.,Wardman B.,Warner G..et al.An Empirical Analysis of Phishing Blacklists[C].In:Proc.of the sixth Conference on Email and Anti- Spam.2009:1-10.

Summary of the invention

Above-mentioned existing method there are aiming at the problem that, the invention discloses a kind of based on the more disaggregated models of adaptive isomery Detection method for phishing site and system for being measured in real time to fishing website, and have high accuracy and stability.

The invention discloses a kind of detection method for phishing site for being based on the more disaggregated models of adaptive isomery (AHMC), the party Method includes the study of the more disaggregated models of adaptive isomery and the detection of fishing website, and specific steps include:

Step 1, to the fishing website sample set D of a certain classification, | D |=n is trained collection using leave one cross validation With the classification of test set.If j-th of training set is expressed as D_j, corresponding j-th of test set be expressed asJ is positive integer.Each Sample includes sample record and sample label；It include the URL and webpage information of website in sample record, sample label is for marking It whether is fishing website.

Step 2, the adaptive more disaggregated model H of isomery are constructed by linear addition, as follows:

Wherein ω_iFor weight parameter,For Dynamic gene

Wherein, T is the number of base sorting algorithm.h_iFor i-th kind of base sorting algorithm, ω_iFor the power of i-th kind of base sorting algorithm Weight parameter,For Dynamic gene.X indicates sample record.

Step 3, the input of more disaggregated model H is the input of each base sorting algorithm, and output is sample label；To training set D_j, the corresponding feature of each base sorting algorithm is extracted from the sample record of each sample as input.The base, which is classified, to be calculated Method is characterized as linear function, and the parameter of each sorting algorithm is independent same distribution.

Step 4, it is based on training set D_j, parameter and more disaggregated models using machine learning algorithm to each base sorting algorithm ω in H_i,It is trained and parametric solution.When training solves, phase is extracted from sample record to each base sorting algorithm The feature answered preferentially guarantees that the output of more disaggregated models is sample label as input.

Step 5, in test setOn more disaggregated model H are tested and are optimized, until the parameter of each base sorting algorithm With the parameter ω in more disaggregated model H_i,Convergence, terminates the machine learning algorithm of more disaggregated model H.

Step 6, by the parameter ω in the parameter and more disaggregated model H of finally obtained each base sorting algorithm_i,It is somebody's turn to do The detection model H ' of class fishing website.

Step 7, the record for obtaining website to be detected, URL and webpage information including website, input detection model H ' judgement It whether is fishing website.

Invention also discloses a kind of fishing website detection systems based on the more disaggregated models of adaptive isomery, including domain name Morpheme feature classifiers, subject index feature classifiers, content similarities feature classifiers, structural style feature classifiers, view Feel rule feature classifier, linear addition training module, integrated classifier, training dataset management module and detection and alarm mould Block.The function of modules is as follows in system operation:

The domain name morpheme feature classifiers carry out feature extraction and instruction to the domain name character string of the website URL of input Practice；

The subject index feature classifiers are in the web page tag<title>of input website,<meta>and footer Hold and carries out feature extraction and training；

The content similarities feature classifiers to input website web page contents in information carry out feature extraction and Semantic abstraction is trained feature；

The structural style feature classifiers carry out feature extraction and training to the web page source code structure of input website；

The ocular rules feature classifiers are extracted and are learned to the significant visual signature of the webpage of input website It practises；

The linear addition training module to the domain name morpheme feature classifiers, subject index feature classifiers, Content similarities feature classifiers, structural style feature classifiers and ocular rules feature classifiers carry out linear combination, combination Form is as follows:

Wherein, H (x) indicates more disaggregated models of five kinds of classifiers combinations, h₁~h₅Respectively five kinds corresponding point of classifier Class function, ω_iFor the weight parameter of i-th kind of classifier,For Dynamic gene, x indicates sample record；

The linear addition training module is using training set and test set to the parameter and ω in five kinds of classifiers_i、Into Row training and optimization；

The integrated classifier is the final mask of linear addition training module output, constructs the detection mould of fishing website Type carries out Dynamic Maintenance to the weight of each classifier；

Training dataset is stored in the training dataset management module, tag control is carried out to training sample and is divided Group is trained the harmony of specimen sample in the division and maintenance of collection and test set, and management different grouping；

Detection model of the detection and alarm module according to constructed fishing website, examines website to be detected It surveys, the alarm when detecting fishing website.

Compared with prior art, the present invention having following clear superiority:

(1) method and system of the invention uses the technology of integrated study, and multiple Weak Classifiers are carried out linear combination, are obtained Generalization Capability significantly more superior than single classifier is obtained, the Stability and veracity of fishing website detection is improved.

(2) method and system of the invention uses weight adaptive learning, is learnt by sample to weight, and disobeys Rely the suggestion in first experience and domain expert, in the case where sample characteristics and distribution change, in trained study Weight can be automatically updated in the process.

(3) five Weak Classifiers in present system are isomeries, and each single classifier has certain accurate Property, and dependence not strong each other, it is ensured that integrated model has higher accuracy.Five Weak Classifiers it is whole Body expense is big unlike single Complex learning model.

(4) the method for the present invention can be measured in real time fishing website, accuracy, recall rate and robust with higher Property.

(5) present system is capable of handling the high-performance real-time detection of magnanimity URL, can be practically applicable to online engineering department In system, there are higher availability and stability.

Detailed description of the invention

Fig. 1 is the flow diagram of the detection method for phishing site of the invention based on the more disaggregated models of adaptive isomery；

Fig. 2 is the module composition signal of the fishing website detection system of the invention based on the more disaggregated models of adaptive isomery Figure；

Fig. 3 is the schematic diagram that linear addition training module carries out data training in fishing website detection system of the invention；

Fig. 4 is the deployment diagram of the fishing website detection system of the invention based on the more disaggregated models of adaptive isomery.

Specific embodiment

Technical solution of the present invention is described in detail with reference to the accompanying drawings and examples.Example is served only for explaining The present invention is not intended to limit the scope of the present invention.

As shown in Figure 1, The present invention gives a kind of fishing websites based on the more disaggregated models of adaptive isomery (AHMC) to examine Survey method, this method include the study of the more disaggregated models of adaptive isomery and the detection of fishing website, illustrate each specific reality below Apply step.

Step 1, the fishing website of the same category is chosen, such as is all the counterfeit fishing website of banking style, as sample This set D, | D |=n, n indicate number of samples in D.Collection and test set are trained to sample using leave one cross validation Classification.

J-th of training sample set are as follows: D_j={ (x₁,y₁),(x₂,y₂),…,(x_m,y_m)}(1≤j≤n,1<m<n)；

Corresponding j-th of test sample collection:

Wherein, each sample includes the record x and label y of sample, such as (the x in sample set₁,y₁) expression one is specifically Sample instance, wherein x₁Indicate the record of sample, y₁Indicate sample label.The record of sample herein includes the url of website and right The webpage information answered, sample label is for marking whether as fishing website.D/D_jIt indicates to remove D in D set_j。

In this step, the scale n of sample, should be more as far as possible, it is proposed that cannot be below 100.

Step 2, the more disaggregated model H of adaptive isomery are constructed, as follows:

Wherein, T is the number of base sorting algorithm.h_iFor i-th kind of base sorting algorithm, ω_iFor the power of i-th kind of base sorting algorithm Weight parameter,For Dynamic gene.

Base sorting algorithm is in subsequent statement, and also referred to as sorting algorithm or learning algorithm, corresponding classifier also correspond to Corresponding learner.Base sorting algorithm is the algorithm of isomery, it is ensured that the diversity of algorithm is provided with 5 in the embodiment of the present invention The sorting algorithm of a fixation, wherein h₁Domain name morpheme tagsort algorithm is represented, is according to domain name morpheme to determine whether to fish Fishnet station；h₂Subject index tagsort algorithm is represented, is that Fishing net is judged whether it is according to content under theme label in webpage It stands；h₃Content similarities tagsort algorithm is represented, is that similarity system design judgement is carried out according to content under content tab in webpage It whether is fishing website；h₄Representative structure style characteristics sorting algorithm is the structure according to source code to determine whether for fishing Website；h₅Ocular rules tagsort algorithm is represented, is according to the icon of webpage, color matching, picture etc. to determine whether for fishing Website.In practice necessary extension can be carried out according to the principle of isomery.

Step 3, by training set D_jIn each sample in input of the sample record as base sorting algorithm, sample label As output, the more disaggregated model H of training.By (x_i,y_i) in x_iAs each sorting algorithm h₁-h₅Input, for each base point Class algorithm, from x_iIt is middle to extract the feature for needing to input, by y_iAs corresponding output, the feature and ginseng of the corresponding sorting algorithm of training Number, as follows:

h₁(x_i)→y_i,h₂(x_i)→y_i,h₃(x_i)→y_i,h₄(x_i)→y_i,h₅(x_i)→y_i

For more disaggregated model H in the embodiment of the present invention, it is expressed as follows:

When input is sample record x_iWhen, export corresponding sample label y_i, above formula the right is the linear weighted function of classification function Combination.By being trained to more disaggregated models, calculates and obtain weight parameter ω_iAnd Dynamic gene

The method of the present invention is that input sample is recorded x when being trained_iThe middle corresponding feature of extraction is calculated as each classification Method h₁-h₅Input, preferentially guarantee more disaggregated models output be sample label y_i, the output of settable each sorting algorithm is also y_i, parameter and the weight parameter of more disaggregated models, Dynamic gene to each sorting algorithm are trained.The base being arranged in the present invention Sorting algorithm is characterized as linear function, and the parameter of each sorting algorithm is independent same distribution, and the expense integrally trained in this way will not compare Single Complex learning model it is big.The input feature vector of different base sorting algorithms may be different, and be needed from sample record Choose applicable feature input.Such as domain name morpheme tagsort algorithm h₁Input feature vector include top level domain, second level domain Deng.

Step 4, using machine learning algorithm to the parameter and weight parameter and Dynamic gene of each sorting algorithmInto Row training and parametric solution.For example, Maximum Likelihood Estimation Method can be used in the parameter of each sorting algorithm, to integrated mould when solving The parameter of type H can using EM (greatest hope) algorithm iteration solve, constraint condition can using minimum loss function come Formalization, solution procedure can be realized by unified Computational frame-Maximum Likelihood Estimation Method parametric solution, in computer It is unified progress Matrix Solving during execution.

Step 5, in test sample setOn model H is tested and is optimized.Poll all test sample and instruction Practice sample, until parameter and Dynamic geneStable threshold value is converged on, the learning algorithm of model terminates.

Two purposes are tested and optimized to this step.When conflicting in test sample, or there is Dynamic geneThe case where can not restraining, under these conditions will be modified sample, carry out classification modification and list to training sample It stays alone reason, corrects sample label, update training set sample, primary training adjustment sorting algorithm then is re-started to model H Parameter re-execute the steps 4 training process, so achieve the purpose that Optimal Parameters and Dynamic gene.

The method of the present invention obtains training set and test set using leaving-one method, if obtaining K group training set and test set altogether, 3~step 5 of previous step is executed to every group of training set and test set, the parameter and tune of multiple groups base sorting algorithm may finally be obtained Integral divisorAt this point it is possible to be combined average evaluation as final to the sorting algorithm parameter and Dynamic gene of acquisition As a result.

Step 6, each sorting algorithm parameter and Dynamic gene ω that foundation step 5 obtains_i,It obtains corresponding to such Fishing net The more disaggregated model H ' of the adaptive isomery stood.

It is model instance that optimization, which obtains H, in steps of 5, and obtained model H is carried out parameter migration, initializes fishing website Detection algorithm H '.Model H ' and H are isomorphisms, and the H ' in the embodiment of the present invention is to be integrated with h₁-h₅Mixed model.

Step 7, by website to be detected, its record, including website URL and webpage source code etc. webpage information are obtained, Then it inputs in detection module H ', to obtain whether the URL is the information such as fishing website and counterfeit object.Input webpage information not It needs to be formatted it, the feature that each classifier is used will all obtain automatically in webpage source code structure.

In this step, the corresponding site information of URL and source code data are obtained, crawler technology can be used, when there is new spy When mutation of seeking peace occurs, corresponding base sorting algorithm and feature can be only updated, to weight parameter and Dynamic geneShadow Sound is smaller.

The present invention uses the thought of integrated study, and the difference with existing classical integrated study is mainly reflected in: classical collection It include two stages at study, first stage will first train each base classifier, and second stage is by the output of first stage The parameter after each base sort merge is trained as input.And the present invention is instructed together using unified Computational frame Practice, without two stage division.

The invention discloses a kind of fishing website detection systems based on the more disaggregated models of adaptive isomery, mainly by domain name Morpheme feature classifiers, subject index feature classifiers, content similarities feature classifiers, structural style feature classifiers, view Feel rule feature classifier, linear addition training module, integrated classifier, training dataset management module and detection and alarm mould 9 parts of block form.As shown in Fig. 2, the function of the running modules of following description system.

Domain name morpheme feature classifiers: the feature of classifier domain name portion in fishing website URL character string The statistical nature divided.Domain name morpheme feature classifiers carry out feature extraction and training to the domain name character string of the website URL of input, The function of realization includes but is not limited to: 1) judging the suspicious degree of top level domain；2) the morpheme letter for including in second level domain is extracted Breath；3) hierarchical structure of domain name and the length of subdomain name are obtained；4) construct and improve morpheme feature database.

In domain name morpheme feature classifiers, 1) the suspicious degree of top level domain is from statistical experience, usual pw, The probability that fishing website occur in the top level domain such as win, top, xyz is larger；2) the morpheme information in second level domain refers to composition second level It include the abbreviation of certain banks, such as 95588,95533, cmb, icbc, boc in domain name character string；3) in three-level or four It can be including, for example, the short word of the bank address of the such hyphen composition of www-bankofbeijing-com-cn in grade domain name.

Subject index feature classifiers are mainly to web page tag<title>, the content of<meta>and footer of input website Carry out feature extraction and training.The function that subject index feature classifiers are realized includes but is not limited to the spy 1) extracted in label Sign, carries out the conflict resolution and classification of type of feature；2) construct and improve subject index feature database.Subject index feature classifiers The advantages of be fast and accurate for positioning, the disadvantage is that generalization ability is weak, wrong report is high.Content in<title>label and non-counterfeit Normal website discrimination is not high, or does not have relevance with the content of Web page text.Therefore the classifier needs to cooperate white list Classify in library.

Content similarities feature classifiers: feature extraction and semanteme are carried out mainly for information in the short essay in web page contents It is abstract.Including but not limited to 1) extract<body>text in label, wherein<a>,<p>,<div>,<span>,<td>,< Table>, the content-length in the labels such as<form>extracts no more than 15 characters, and the content in text is mentioned according to 2-8 character It takes；2) vectorization and standardization are carried out to text feature；3) word is embedded in, and word amount is mapped as low-dimensional spy using Word2Vec tool Levy vector；4) word feature vector library is constructed.

The detection effect of content similarities feature classifiers is stablized, and the index of accuracy and recall rate is than other classifiers It is good.Vectorization therein is the duplicate removal and filtering to short text；Standardization be by specific time word, the frequent number of variation, The excessively high noise word of the frequency of occurrences, the not advertisement of discrimination, third-party link word etc. are deleted.

Structural style feature classifiers carry out feature extraction and training to the web page source code structure of input website.Structure sample The major function of formula classifier includes: the code for 1) being directed to JS script, and pattern, form list, the DOM structure of CSS carries out source generation Code analysis；2) the homologous code structure in source code is analyzed, public code snippet is extracted；3) building is the same as source code similar matrix.

Structural style feature classifiers have ability to predict, public affairs therein to the new fishing website of same organization development behind the scenes Code snippet includes but is not limited to 1) identical function name altogether；2) compatible CSS color matching；3) identical JS script；4) identical Selective listing and<form>list；5) identical hyperlink and page exterior chain etc. is jumped.

Ocular rules feature classifiers mainly carry out feature extraction to the significant visual signature of the webpage of input website And study, extracted feature include but is not limited to the log icon of 1) targeted website；2) the whole color matching and frame set of website At；3) there is significant picture module etc..

The shortcomings that ocular rules feature classifiers be carry out visual signature study and detection time it is all long, Er Qiexiang Same log may cause very big error in the difference of pixel scale, therefore more severe to the quality requirement of training sample It carves.The scale in visual signature library is not less than 30000.

Linear addition training module passes through the study of weighting parameters and Dynamic gene, to base classifier --- the domain name of isomery Morpheme feature classifiers, subject index feature classifiers, content similarities feature classifiers, structural style feature classifiers, view Feel that rule feature classifier carries out linear combination and training, obtains stable weighting parameters and Dynamic gene.

In linear addition training module, the training of linear addition depends on the quality of training sample, weighting parameters The operation automated with Dynamic gene according to built-in algorithm.Linear addition training module utilizes training set and test set pair Sorting parameter and ω in five kinds of classifiers_i、It is trained and optimizes.In training, the sample record of training data is defeated parallel Enter in five kinds of base classifiers, the output of more disaggregated models of five kinds of classifiers combinations is the label of corresponding sample, and the form of expression is such as Shown in Fig. 3.

Integrated classifier is the final mask of linear addition training module output, and the function of realization includes but is not limited to 1) to bear Blame the detection model building of fishing website；2) counterfeit object and class label are matched；3) to the Dynamic Maintenance of classifier weight It is integrated with the iteration of feature database.Feature database herein is the feature database of used classifier, such as domain name morpheme tagsort Feature database of device etc..

In integrated classifier, base classifier be not it is each be involved in final detection, if weight parameter be 0, Corresponding base classifier does not enable.In addition, in integrated model, to the performance of classifier also in the range of consideration, such as regard Feel that rule feature classifier in rough sort and is not used because the time of consumption is longer.

For storing training data in training dataset management module.Training dataset mainly include fishing website URL, The data samples such as source code, site information composition, the function of the module include 1) carrying out tag control to training sample data, into Row grouping；2) division and maintenance of test set and training set are carried out to training sample；3) in different grouping specimen sample it is equal The management of weighing apparatus property.

For final effect, the quality of training sample data and the quality no less important of classifier.Therefore by sample This management is completed in independent module, and the emphasis of the module is prevented not to the management of different classes of sample distribution The case where balanced sample.

Detection and alarm module: the module includes two functions: 1) detection function of fishing website, model is by collection ingredient The parameter and feature database of class device form；2) fishing website detected is alerted, warning information and rank can be according to users It custom-configures.

In the module, warning information is different mainly for degree of concern of the user to different counterfeit objects, important to alerting The classification of degree.If a page lottery industry information occurs in the first screen, but there is the counterfeit letter of Bank of China to the second screen Breath.But the attention rate of bank is greater than lottery industry, so the classification and alarm to the page are then preferably bank.

5 base classifiers in present system be it is independent of each other, without correlation；Linear addition training module is to be The training of the core of system, all parameters is completed in the module, and linear model ensure that the performance of system and the convergence of operation； Integrated classifier determines the combined situation to the base classifier in base classifier modules according to the parameter of the output of linear addition, Not necessarily 5 classifiers are involved in integrated, it is possible to only need 2-3.

As shown in figure 4, giving a deployment diagram of present system.Five base classifiers form base learner server Group, and linear addition training module, integrated classifier, training dataset management module, detection and alarm module distributed deployment At networking switch.

Claims

1. a kind of detection method for phishing site based on the more disaggregated models of adaptive isomery, which is characterized in that the described method includes:

Step 1, to the fishing website sample set D of the same category, collection and test set are trained using leave one cross validation Classification；If j-th of training set is expressed as D_j, corresponding j-th of test set be expressed asEach sample includes sample record and sample This label；It include the URL and webpage information of website in sample record, sample label is for marking whether as fishing website；J is positive Integer；

Wherein, T is the number of base sorting algorithm, h_iFor i-th kind of base sorting algorithm, ω_iFor the weight ginseng of i-th kind of base sorting algorithm Number,For Dynamic gene, x indicates sample record；

Step 3, the input of more disaggregated model H is the input of each base sorting algorithm, and output is sample label；For training set D_j, from The corresponding feature of each base sorting algorithm is extracted in the sample record of each sample as input；The base sorting algorithm characterization For linear function, the parameter of each sorting algorithm is independent same distribution；

Step 4, it is based on training set D_j, using machine learning algorithm in the parameter of each base sorting algorithm and more disaggregated model HIt is trained and parametric solution；

Step 5, in test setOn more disaggregated model H are tested and are optimized, until the parameter of each base sorting algorithm and more points Parameter in class model HConvergence, terminates the machine learning algorithm of more disaggregated model H；

Step 6, by the parameter in the parameter and more disaggregated model H of finally obtained each base sorting algorithmObtain such fishing The detection model H ' of website；

Step 7, the record for obtaining website to be detected, URL and webpage information including website, input detection model H ' judge whether For fishing website.

2. the method according to claim 1, wherein the scale of the sample set D cannot be below 100.

3. method according to claim 1 or 2, which is characterized in that in the step 1, training set and test set representations It is as follows:

J-th of training set D_j={ (x₁,y₁),(x₂,y₂),…,(x_m,y_m), 1≤j≤n, 1 < m < n；

Corresponding j-th of test set

Wherein, n is number of samples in D, m D_jIn number of samples, D/D_jExpression removes D from set D_j；I-th of sample (x_i, y_i) in include i-th of sample record x_iWith label y_i。

4. the method according to claim 1, wherein being solved in the step 4 using Maximum Likelihood Estimation Method The parameter of each base sorting algorithm, using EM algorithm to the parameter in more disaggregated model HIt is iterated solution.

5. the method according to claim 1, wherein the parameter in the step 5, in more disaggregated model HWhen can not restrain, sample label is corrected, training set sample is updated, re-execute the steps 4 training process.

6. a kind of fishing website detection system based on the more disaggregated models of adaptive isomery, which is characterized in that including domain name morpheme Feature classifiers, subject index feature classifiers, content similarities feature classifiers, structural style feature classifiers, vision rule Then feature classifiers, linear addition training module, integrated classifier, training dataset management module and detection and alarm module；

The domain name morpheme feature classifiers carry out feature extraction and training to the domain name character string of the website URL of input；

The subject index feature classifiers to the content of the input web page tag<title>of website,<meta>and footer into Row feature extraction and training；

The content similarities feature classifiers carry out feature extraction and semanteme to the information in the web page contents of input website It is abstract, feature is trained；

The ocular rules feature classifiers extract and learn to the significant visual signature of the webpage of input website；

The linear addition training module is to the domain name morpheme feature classifiers, subject index feature classifiers, content Similarity feature classifier, structural style feature classifiers and ocular rules feature classifiers carry out linear combination, combining form It is as follows:

Wherein, H (x) indicates more disaggregated models of five kinds of classifiers combinations, h₁~h₅The corresponding classification letter of respectively five kinds of classifiers Number, ω_iFor the weight parameter of i-th kind of classifier,For Dynamic gene, x indicates sample record；

The linear addition training module is using training set and test set to the parameter and ω in five kinds of classifiers_i、It is instructed Practice and optimizes；

The integrated classifier is the final mask of linear addition training module output, constructs the detection model of fishing website, Weight and progress Dynamic Maintenance to each classifier；

Training dataset is stored in the training dataset management module, tag control and grouping are carried out to training sample, into The harmony of specimen sample in the division and maintenance of row training set and test set, and management different grouping；

Detection model of the detection and alarm module according to constructed fishing website, detects website to be detected, The alarm when detecting fishing website.

7. system according to claim 6, which is characterized in that in the integrated classifier, when the power of certain classifier When weight parameter is 0, indicate that the classifier does not enable.

8. system according to claim 6, which is characterized in that the domain name morpheme feature classifiers, the function of realization It include: the suspicious degree for judging top level domain；Extract the morpheme information for including in second level domain；Obtain domain name hierarchical structure and The length of subdomain name；Construct and improve morpheme feature database；

It wherein, is that the suspicious degree of fishing website is high when pw, win, top or xyz occurs in top level domain；Word in second level domain Prime information refers to the abbreviation constituted in second level domain character string comprising certain banks；Domain name morpheme feature classifiers also extract three-level or The short word for the bank address that hyphen forms in person's level Four domain name.