CN108965245A - Detection method for phishing site and system based on the more disaggregated models of adaptive isomery - Google Patents

Detection method for phishing site and system based on the more disaggregated models of adaptive isomery Download PDF

Info

Publication number
CN108965245A
CN108965245A CN201810549417.1A CN201810549417A CN108965245A CN 108965245 A CN108965245 A CN 108965245A CN 201810549417 A CN201810549417 A CN 201810549417A CN 108965245 A CN108965245 A CN 108965245A
Authority
CN
China
Prior art keywords
sample
feature
training
website
classifiers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810549417.1A
Other languages
Chinese (zh)
Other versions
CN108965245B (en
Inventor
臧天宁
强倩
杜飞
周渊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING RUICHI XINAN TECHNOLOGY Co Ltd
National Computer Network and Information Security Management Center
Original Assignee
BEIJING RUICHI XINAN TECHNOLOGY Co Ltd
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING RUICHI XINAN TECHNOLOGY Co Ltd, National Computer Network and Information Security Management Center filed Critical BEIJING RUICHI XINAN TECHNOLOGY Co Ltd
Priority to CN201810549417.1A priority Critical patent/CN108965245B/en
Publication of CN108965245A publication Critical patent/CN108965245A/en
Application granted granted Critical
Publication of CN108965245B publication Critical patent/CN108965245B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of detection method for phishing site and system based on the more disaggregated models of adaptive isomery.The method constructs the adaptive more disaggregated models of isomery by linear addition to a variety of base sorting algorithms, more disaggregated models are trained, the mode input is the input of each base sorting algorithm, and output is sample label, and each base sorting algorithm extracts corresponding feature as input from sample record;It is tested and is optimized using machine learning algorithm solving model parameter, and with test set, finally obtain the detection model of such fishing website.The system comprises domain name morpheme feature classifiers, subject index feature classifiers, content similarities feature classifiers, structural style feature classifiers, ocular rules feature classifiers, linear addition training module, integrated classifier, training dataset management module and detection and alarm modules.The present invention is realized to fishing website real-time detection, and improves the Stability and veracity of fishing website detection.

Description

Detection method for phishing site and system based on the more disaggregated models of adaptive isomery
Technical field
The present invention relates to computer network security fields, and in particular to a kind of fishing based on the more disaggregated models of adaptive isomery Fishnet station detection method and system.
Background technique
With flourishing for Internet technology, network security problem emerges one after another.Phishing is that one kind typically exists Line fraud, it is using internet as carrier, by the legitimate site user cheating to enjoy a good reputation that disguises oneself as to obtain user's Sensitive information, cheated user can generate different degrees of personal information leakage, then lead to economic loss.It is how quickly quasi- Fishing website really is detected as Web (global wide area network) information security research hotspot.Published fishing website detection at present Technology mainly includes following methods:
(1) detection technique based on black and white lists mechanism: as a kind of practical core technology, black and white lists have efficient Accurate advantage.By the judgement to domain name, fishing website can be quickly positioned, is one of most common realization technology [1].
(2) detection technique of view-based access control model similarity: Cao Jiuxin et al. proposes one based on nesting EMD (Earth Mover ' s Distance) webpage similarity decision algorithm, Web graph picture is split, the Sub-Image Feature after segmentation is utilized To construct the ARG (Attributed Relational Graph) of webpage.After the distance of different AGR attributes is calculated, lead to It crosses nesting EMD method and obtains the similarity of fishing website Yu protected website and webpage, and then realize the high-precision to fishing website It detects [2].
Lee is noisy et al. to be based on EMD algorithm, differentiates fishing website using visual similarity.This kind of algorithm uses webpage The more resulting result of middle pixel similarity is as the foundation [3] for judging fishing website.
(3) based on the detection technique of bayesian algorithm: the rule-based bayesian algorithm of gold celebrating et al. has been formulated a series of Matching fishing website rule.Its corresponding weight then is distributed for each rule, its correction factor is calculated, obtains by survey grid Station is the probability of fishing website.And then it can judge whether it is fishing website [4] [5] by probability threshold value.
Zhuan Wei is luxuriant et al. to web page tag title content, in website keyword message, page-describing information, image link with And 8 features such as website copyright information, classified using extension bayesian algorithm and its improved SVM It is integrated, construct a kind of system [6] that can phishing attacks be carried out with intelligent measurement.
(4) based on the detection technique of file structure: Guo Minzhe et al. analyzes Web page document object, extracts text pair As the normal phishing sensitive information feature utilized by fisherman in model, to judge whether the website is fishing website.It should Algorithm has effective filtered out Phishing (phishing) page in website, the strong malice for having prevented middle phisher Hook fish attacks [7].
(5) the fishing website detection technique based on deep learning: Xu Long proposes multilayer in the technology for combining deep learning The DBN-KNN model of structure is applied in the feature identification of fishing website, identifies fishing website [8].
(6) other types of detection technique: yellow Hua Jun et al. proposes the phishing Initiative Defense based on semi-fragile watermarking [9] and based on off-note fishing URL detection algorithm [10];Zhang Jianyi et al. proposes a kind of network hook that text semantic understands Fish checks algorithm [11];Remaining defensive measure include it is skilful across grinding for station phishing attacks for web-mail mailbox, be based on The url filtering [12] of cloud computing, SVM learning algorithm [13] etc..
In the above technology, the detection method timeliness based on black and white lists is poor, there is also deficiencies for list range;It is based on The detection technique algorithm of vision similarity is complicated, and the time for detecting consumption is longer, is not applied for magnanimity URL (Uniform Resoure Locator: uniform resource locator) on-line real-time measuremen;Based on the detection technique of bayesian algorithm in robust It is less desirable in property and Generalization Capability;Detection technique existing characteristics based on file structure cover incomplete problem, fail to report It is more;Fishing website detection technique based on deep learning is upper advantageous in feature identification, but the stability of feature is poor, is easy Interference by sample contamination.
Bibliography:
[1]Huang C.,Ma S,Chen K.,Using One-Time Passwords to Prevent Password Phishing Attacks[J].Journal of Network and Computer Applications.2011,34(4): 1292-1301.
[2] Cao Ouxin, Mao Bo, Luo Junzhou wait fishing webpage detection algorithm [J] the Chinese journal of computers of based on nested EMD, 2009,32(5):922-929.
[3] Lee is noisy, and Dong Liu is (natural by Phishing detection method [J] Tsinghua University journal of the vision based on similar Scientific version), 2009,49 (1): 146-148.
[4]Zhang H.,Liu G.,Chow T.W.S..et al.Textual and Visual Content-Based Anti-Phishing:A Baysian Approach[J].IEEE Transactions on Neural Networks.2011,22(10):1532-1546.
[5] gold celebrating, Wu Guoxin, Li Dan wait to realize the lake filtering [R] of phishing with rule-based bayesian algorithm South: the 5th safe academic conference of Chinese information and communication technology (ICT), 2007,372-378.
[6] Zhuan Weiwei, Ye Yanfang, Li Tao wait fishing website intelligent checking system [J] the system work of based on classification ensemble Journey theory and practice, 2011,31 (10): 2008-2020.
[7] Guo Minzhe, Yuan Jinsheng, Wang Yachao wait phishing Web page detection algorithm machine [J] computer engineering, 2008,34(20):161-163.
[8] Xu Long is in research [D] the University of Electronic Science and Technology of fishing website detection technique of the based on deep learning, and 2017.
[9] Yin Shuijun, Liu Jiayong, Liu Liang are logical for research [J] across station phishing attacks of Web-mail Na case Letter technology, 2010,43 (8): 164-166.
[10]Huang H.J.,Wang Y.J.,Xie L.L..et al.An Active Anti-Phishing Solution Based on Semi-fragile Watermark[J].Information Technology Journal.2013,12(1):198-203.
[11]Huang H.J.,Qian L.,Wang Y.J..A SVM-Based Technique to Detect Phishing URLs[J].Information Technology Journal.2012,11(7):921-925.
[12]Zhang H.,Liu G.,Chow T.W.S..et al.Textual and Visual Content- Based Anti-Phishing:A Baysian Approach[J].IEEE Transactions on Neural Networks.2011,22(10):1532-1546.
[13]Sheng S.,Wardman B.,Warner G..et al.An Empirical Analysis of Phishing Blacklists[C].In:Proc.of the sixth Conference on Email and Anti- Spam.2009:1-10.
Summary of the invention
Above-mentioned existing method there are aiming at the problem that, the invention discloses a kind of based on the more disaggregated models of adaptive isomery Detection method for phishing site and system for being measured in real time to fishing website, and have high accuracy and stability.
The invention discloses a kind of detection method for phishing site for being based on the more disaggregated models of adaptive isomery (AHMC), the party Method includes the study of the more disaggregated models of adaptive isomery and the detection of fishing website, and specific steps include:
Step 1, to the fishing website sample set D of a certain classification, | D |=n is trained collection using leave one cross validation With the classification of test set.If j-th of training set is expressed as Dj, corresponding j-th of test set be expressed asJ is positive integer.Each Sample includes sample record and sample label;It include the URL and webpage information of website in sample record, sample label is for marking It whether is fishing website.
Step 2, the adaptive more disaggregated model H of isomery are constructed by linear addition, as follows:
Wherein ωiFor weight parameter,For Dynamic gene
Wherein, T is the number of base sorting algorithm.hiFor i-th kind of base sorting algorithm, ωiFor the power of i-th kind of base sorting algorithm Weight parameter,For Dynamic gene.X indicates sample record.
Step 3, the input of more disaggregated model H is the input of each base sorting algorithm, and output is sample label;To training set Dj, the corresponding feature of each base sorting algorithm is extracted from the sample record of each sample as input.The base, which is classified, to be calculated Method is characterized as linear function, and the parameter of each sorting algorithm is independent same distribution.
Step 4, it is based on training set Dj, parameter and more disaggregated models using machine learning algorithm to each base sorting algorithm ω in Hi,It is trained and parametric solution.When training solves, phase is extracted from sample record to each base sorting algorithm The feature answered preferentially guarantees that the output of more disaggregated models is sample label as input.
Step 5, in test setOn more disaggregated model H are tested and are optimized, until the parameter of each base sorting algorithm With the parameter ω in more disaggregated model Hi,Convergence, terminates the machine learning algorithm of more disaggregated model H.
Step 6, by the parameter ω in the parameter and more disaggregated model H of finally obtained each base sorting algorithmi,It is somebody's turn to do The detection model H ' of class fishing website.
Step 7, the record for obtaining website to be detected, URL and webpage information including website, input detection model H ' judgement It whether is fishing website.
Invention also discloses a kind of fishing website detection systems based on the more disaggregated models of adaptive isomery, including domain name Morpheme feature classifiers, subject index feature classifiers, content similarities feature classifiers, structural style feature classifiers, view Feel rule feature classifier, linear addition training module, integrated classifier, training dataset management module and detection and alarm mould Block.The function of modules is as follows in system operation:
The domain name morpheme feature classifiers carry out feature extraction and instruction to the domain name character string of the website URL of input Practice;
The subject index feature classifiers are in the web page tag<title>of input website,<meta>and footer Hold and carries out feature extraction and training;
The content similarities feature classifiers to input website web page contents in information carry out feature extraction and Semantic abstraction is trained feature;
The structural style feature classifiers carry out feature extraction and training to the web page source code structure of input website;
The ocular rules feature classifiers are extracted and are learned to the significant visual signature of the webpage of input website It practises;
The linear addition training module to the domain name morpheme feature classifiers, subject index feature classifiers, Content similarities feature classifiers, structural style feature classifiers and ocular rules feature classifiers carry out linear combination, combination Form is as follows:
Wherein, H (x) indicates more disaggregated models of five kinds of classifiers combinations, h1~h5Respectively five kinds corresponding point of classifier Class function, ωiFor the weight parameter of i-th kind of classifier,For Dynamic gene, x indicates sample record;
The linear addition training module is using training set and test set to the parameter and ω in five kinds of classifiersiInto Row training and optimization;
The integrated classifier is the final mask of linear addition training module output, constructs the detection mould of fishing website Type carries out Dynamic Maintenance to the weight of each classifier;
Training dataset is stored in the training dataset management module, tag control is carried out to training sample and is divided Group is trained the harmony of specimen sample in the division and maintenance of collection and test set, and management different grouping;
Detection model of the detection and alarm module according to constructed fishing website, examines website to be detected It surveys, the alarm when detecting fishing website.
Compared with prior art, the present invention having following clear superiority:
(1) method and system of the invention uses the technology of integrated study, and multiple Weak Classifiers are carried out linear combination, are obtained Generalization Capability significantly more superior than single classifier is obtained, the Stability and veracity of fishing website detection is improved.
(2) method and system of the invention uses weight adaptive learning, is learnt by sample to weight, and disobeys Rely the suggestion in first experience and domain expert, in the case where sample characteristics and distribution change, in trained study Weight can be automatically updated in the process.
(3) five Weak Classifiers in present system are isomeries, and each single classifier has certain accurate Property, and dependence not strong each other, it is ensured that integrated model has higher accuracy.Five Weak Classifiers it is whole Body expense is big unlike single Complex learning model.
(4) the method for the present invention can be measured in real time fishing website, accuracy, recall rate and robust with higher Property.
(5) present system is capable of handling the high-performance real-time detection of magnanimity URL, can be practically applicable to online engineering department In system, there are higher availability and stability.
Detailed description of the invention
Fig. 1 is the flow diagram of the detection method for phishing site of the invention based on the more disaggregated models of adaptive isomery;
Fig. 2 is the module composition signal of the fishing website detection system of the invention based on the more disaggregated models of adaptive isomery Figure;
Fig. 3 is the schematic diagram that linear addition training module carries out data training in fishing website detection system of the invention;
Fig. 4 is the deployment diagram of the fishing website detection system of the invention based on the more disaggregated models of adaptive isomery.
Specific embodiment
Technical solution of the present invention is described in detail with reference to the accompanying drawings and examples.Example is served only for explaining The present invention is not intended to limit the scope of the present invention.
As shown in Figure 1, The present invention gives a kind of fishing websites based on the more disaggregated models of adaptive isomery (AHMC) to examine Survey method, this method include the study of the more disaggregated models of adaptive isomery and the detection of fishing website, illustrate each specific reality below Apply step.
Step 1, the fishing website of the same category is chosen, such as is all the counterfeit fishing website of banking style, as sample This set D, | D |=n, n indicate number of samples in D.Collection and test set are trained to sample using leave one cross validation Classification.
J-th of training sample set are as follows: Dj={ (x1,y1),(x2,y2),…,(xm,ym)}(1≤j≤n,1<m<n);
Corresponding j-th of test sample collection:
Wherein, each sample includes the record x and label y of sample, such as (the x in sample set1,y1) expression one is specifically Sample instance, wherein x1Indicate the record of sample, y1Indicate sample label.The record of sample herein includes the url of website and right The webpage information answered, sample label is for marking whether as fishing website.D/DjIt indicates to remove D in D setj
In this step, the scale n of sample, should be more as far as possible, it is proposed that cannot be below 100.
Step 2, the more disaggregated model H of adaptive isomery are constructed, as follows:
Wherein, T is the number of base sorting algorithm.hiFor i-th kind of base sorting algorithm, ωiFor the power of i-th kind of base sorting algorithm Weight parameter,For Dynamic gene.
Base sorting algorithm is in subsequent statement, and also referred to as sorting algorithm or learning algorithm, corresponding classifier also correspond to Corresponding learner.Base sorting algorithm is the algorithm of isomery, it is ensured that the diversity of algorithm is provided with 5 in the embodiment of the present invention The sorting algorithm of a fixation, wherein h1Domain name morpheme tagsort algorithm is represented, is according to domain name morpheme to determine whether to fish Fishnet station;h2Subject index tagsort algorithm is represented, is that Fishing net is judged whether it is according to content under theme label in webpage It stands;h3Content similarities tagsort algorithm is represented, is that similarity system design judgement is carried out according to content under content tab in webpage It whether is fishing website;h4Representative structure style characteristics sorting algorithm is the structure according to source code to determine whether for fishing Website;h5Ocular rules tagsort algorithm is represented, is according to the icon of webpage, color matching, picture etc. to determine whether for fishing Website.In practice necessary extension can be carried out according to the principle of isomery.
Step 3, by training set DjIn each sample in input of the sample record as base sorting algorithm, sample label As output, the more disaggregated model H of training.By (xi,yi) in xiAs each sorting algorithm h1-h5Input, for each base point Class algorithm, from xiIt is middle to extract the feature for needing to input, by yiAs corresponding output, the feature and ginseng of the corresponding sorting algorithm of training Number, as follows:
h1(xi)→yi,h2(xi)→yi,h3(xi)→yi,h4(xi)→yi,h5(xi)→yi
For more disaggregated model H in the embodiment of the present invention, it is expressed as follows:
When input is sample record xiWhen, export corresponding sample label yi, above formula the right is the linear weighted function of classification function Combination.By being trained to more disaggregated models, calculates and obtain weight parameter ωiAnd Dynamic gene
The method of the present invention is that input sample is recorded x when being trainediThe middle corresponding feature of extraction is calculated as each classification Method h1-h5Input, preferentially guarantee more disaggregated models output be sample label yi, the output of settable each sorting algorithm is also yi, parameter and the weight parameter of more disaggregated models, Dynamic gene to each sorting algorithm are trained.The base being arranged in the present invention Sorting algorithm is characterized as linear function, and the parameter of each sorting algorithm is independent same distribution, and the expense integrally trained in this way will not compare Single Complex learning model it is big.The input feature vector of different base sorting algorithms may be different, and be needed from sample record Choose applicable feature input.Such as domain name morpheme tagsort algorithm h1Input feature vector include top level domain, second level domain Deng.
Step 4, using machine learning algorithm to the parameter and weight parameter and Dynamic gene of each sorting algorithmInto Row training and parametric solution.For example, Maximum Likelihood Estimation Method can be used in the parameter of each sorting algorithm, to integrated mould when solving The parameter of type H can using EM (greatest hope) algorithm iteration solve, constraint condition can using minimum loss function come Formalization, solution procedure can be realized by unified Computational frame-Maximum Likelihood Estimation Method parametric solution, in computer It is unified progress Matrix Solving during execution.
Step 5, in test sample setOn model H is tested and is optimized.Poll all test sample and instruction Practice sample, until parameter and Dynamic geneStable threshold value is converged on, the learning algorithm of model terminates.
Two purposes are tested and optimized to this step.When conflicting in test sample, or there is Dynamic geneThe case where can not restraining, under these conditions will be modified sample, carry out classification modification and list to training sample It stays alone reason, corrects sample label, update training set sample, primary training adjustment sorting algorithm then is re-started to model H Parameter re-execute the steps 4 training process, so achieve the purpose that Optimal Parameters and Dynamic gene.
The method of the present invention obtains training set and test set using leaving-one method, if obtaining K group training set and test set altogether, 3~step 5 of previous step is executed to every group of training set and test set, the parameter and tune of multiple groups base sorting algorithm may finally be obtained Integral divisorAt this point it is possible to be combined average evaluation as final to the sorting algorithm parameter and Dynamic gene of acquisition As a result.
Step 6, each sorting algorithm parameter and Dynamic gene ω that foundation step 5 obtainsi,It obtains corresponding to such Fishing net The more disaggregated model H ' of the adaptive isomery stood.
It is model instance that optimization, which obtains H, in steps of 5, and obtained model H is carried out parameter migration, initializes fishing website Detection algorithm H '.Model H ' and H are isomorphisms, and the H ' in the embodiment of the present invention is to be integrated with h1-h5Mixed model.
Step 7, by website to be detected, its record, including website URL and webpage source code etc. webpage information are obtained, Then it inputs in detection module H ', to obtain whether the URL is the information such as fishing website and counterfeit object.Input webpage information not It needs to be formatted it, the feature that each classifier is used will all obtain automatically in webpage source code structure.
In this step, the corresponding site information of URL and source code data are obtained, crawler technology can be used, when there is new spy When mutation of seeking peace occurs, corresponding base sorting algorithm and feature can be only updated, to weight parameter and Dynamic geneShadow Sound is smaller.
The present invention uses the thought of integrated study, and the difference with existing classical integrated study is mainly reflected in: classical collection It include two stages at study, first stage will first train each base classifier, and second stage is by the output of first stage The parameter after each base sort merge is trained as input.And the present invention is instructed together using unified Computational frame Practice, without two stage division.
The invention discloses a kind of fishing website detection systems based on the more disaggregated models of adaptive isomery, mainly by domain name Morpheme feature classifiers, subject index feature classifiers, content similarities feature classifiers, structural style feature classifiers, view Feel rule feature classifier, linear addition training module, integrated classifier, training dataset management module and detection and alarm mould 9 parts of block form.As shown in Fig. 2, the function of the running modules of following description system.
Domain name morpheme feature classifiers: the feature of classifier domain name portion in fishing website URL character string The statistical nature divided.Domain name morpheme feature classifiers carry out feature extraction and training to the domain name character string of the website URL of input, The function of realization includes but is not limited to: 1) judging the suspicious degree of top level domain;2) the morpheme letter for including in second level domain is extracted Breath;3) hierarchical structure of domain name and the length of subdomain name are obtained;4) construct and improve morpheme feature database.
In domain name morpheme feature classifiers, 1) the suspicious degree of top level domain is from statistical experience, usual pw, The probability that fishing website occur in the top level domain such as win, top, xyz is larger;2) the morpheme information in second level domain refers to composition second level It include the abbreviation of certain banks, such as 95588,95533, cmb, icbc, boc in domain name character string;3) in three-level or four It can be including, for example, the short word of the bank address of the such hyphen composition of www-bankofbeijing-com-cn in grade domain name.
Subject index feature classifiers are mainly to web page tag<title>, the content of<meta>and footer of input website Carry out feature extraction and training.The function that subject index feature classifiers are realized includes but is not limited to the spy 1) extracted in label Sign, carries out the conflict resolution and classification of type of feature;2) construct and improve subject index feature database.Subject index feature classifiers The advantages of be fast and accurate for positioning, the disadvantage is that generalization ability is weak, wrong report is high.Content in<title>label and non-counterfeit Normal website discrimination is not high, or does not have relevance with the content of Web page text.Therefore the classifier needs to cooperate white list Classify in library.
Content similarities feature classifiers: feature extraction and semanteme are carried out mainly for information in the short essay in web page contents It is abstract.Including but not limited to 1) extract<body>text in label, wherein<a>,<p>,<div>,<span>,<td>,< Table>, the content-length in the labels such as<form>extracts no more than 15 characters, and the content in text is mentioned according to 2-8 character It takes;2) vectorization and standardization are carried out to text feature;3) word is embedded in, and word amount is mapped as low-dimensional spy using Word2Vec tool Levy vector;4) word feature vector library is constructed.
The detection effect of content similarities feature classifiers is stablized, and the index of accuracy and recall rate is than other classifiers It is good.Vectorization therein is the duplicate removal and filtering to short text;Standardization be by specific time word, the frequent number of variation, The excessively high noise word of the frequency of occurrences, the not advertisement of discrimination, third-party link word etc. are deleted.
Structural style feature classifiers carry out feature extraction and training to the web page source code structure of input website.Structure sample The major function of formula classifier includes: the code for 1) being directed to JS script, and pattern, form list, the DOM structure of CSS carries out source generation Code analysis;2) the homologous code structure in source code is analyzed, public code snippet is extracted;3) building is the same as source code similar matrix.
Structural style feature classifiers have ability to predict, public affairs therein to the new fishing website of same organization development behind the scenes Code snippet includes but is not limited to 1) identical function name altogether;2) compatible CSS color matching;3) identical JS script;4) identical Selective listing and<form>list;5) identical hyperlink and page exterior chain etc. is jumped.
Ocular rules feature classifiers mainly carry out feature extraction to the significant visual signature of the webpage of input website And study, extracted feature include but is not limited to the log icon of 1) targeted website;2) the whole color matching and frame set of website At;3) there is significant picture module etc..
The shortcomings that ocular rules feature classifiers be carry out visual signature study and detection time it is all long, Er Qiexiang Same log may cause very big error in the difference of pixel scale, therefore more severe to the quality requirement of training sample It carves.The scale in visual signature library is not less than 30000.
Linear addition training module passes through the study of weighting parameters and Dynamic gene, to base classifier --- the domain name of isomery Morpheme feature classifiers, subject index feature classifiers, content similarities feature classifiers, structural style feature classifiers, view Feel that rule feature classifier carries out linear combination and training, obtains stable weighting parameters and Dynamic gene.
In linear addition training module, the training of linear addition depends on the quality of training sample, weighting parameters The operation automated with Dynamic gene according to built-in algorithm.Linear addition training module utilizes training set and test set pair Sorting parameter and ω in five kinds of classifiersiIt is trained and optimizes.In training, the sample record of training data is defeated parallel Enter in five kinds of base classifiers, the output of more disaggregated models of five kinds of classifiers combinations is the label of corresponding sample, and the form of expression is such as Shown in Fig. 3.
Integrated classifier is the final mask of linear addition training module output, and the function of realization includes but is not limited to 1) to bear Blame the detection model building of fishing website;2) counterfeit object and class label are matched;3) to the Dynamic Maintenance of classifier weight It is integrated with the iteration of feature database.Feature database herein is the feature database of used classifier, such as domain name morpheme tagsort Feature database of device etc..
In integrated classifier, base classifier be not it is each be involved in final detection, if weight parameter be 0, Corresponding base classifier does not enable.In addition, in integrated model, to the performance of classifier also in the range of consideration, such as regard Feel that rule feature classifier in rough sort and is not used because the time of consumption is longer.
For storing training data in training dataset management module.Training dataset mainly include fishing website URL, The data samples such as source code, site information composition, the function of the module include 1) carrying out tag control to training sample data, into Row grouping;2) division and maintenance of test set and training set are carried out to training sample;3) in different grouping specimen sample it is equal The management of weighing apparatus property.
For final effect, the quality of training sample data and the quality no less important of classifier.Therefore by sample This management is completed in independent module, and the emphasis of the module is prevented not to the management of different classes of sample distribution The case where balanced sample.
Detection and alarm module: the module includes two functions: 1) detection function of fishing website, model is by collection ingredient The parameter and feature database of class device form;2) fishing website detected is alerted, warning information and rank can be according to users It custom-configures.
In the module, warning information is different mainly for degree of concern of the user to different counterfeit objects, important to alerting The classification of degree.If a page lottery industry information occurs in the first screen, but there is the counterfeit letter of Bank of China to the second screen Breath.But the attention rate of bank is greater than lottery industry, so the classification and alarm to the page are then preferably bank.
5 base classifiers in present system be it is independent of each other, without correlation;Linear addition training module is to be The training of the core of system, all parameters is completed in the module, and linear model ensure that the performance of system and the convergence of operation; Integrated classifier determines the combined situation to the base classifier in base classifier modules according to the parameter of the output of linear addition, Not necessarily 5 classifiers are involved in integrated, it is possible to only need 2-3.
As shown in figure 4, giving a deployment diagram of present system.Five base classifiers form base learner server Group, and linear addition training module, integrated classifier, training dataset management module, detection and alarm module distributed deployment At networking switch.

Claims (8)

1. a kind of detection method for phishing site based on the more disaggregated models of adaptive isomery, which is characterized in that the described method includes:
Step 1, to the fishing website sample set D of the same category, collection and test set are trained using leave one cross validation Classification;If j-th of training set is expressed as Dj, corresponding j-th of test set be expressed asEach sample includes sample record and sample This label;It include the URL and webpage information of website in sample record, sample label is for marking whether as fishing website;J is positive Integer;
Step 2, the adaptive more disaggregated model H of isomery are constructed by linear addition, as follows:
Wherein, T is the number of base sorting algorithm, hiFor i-th kind of base sorting algorithm, ωiFor the weight ginseng of i-th kind of base sorting algorithm Number,For Dynamic gene, x indicates sample record;
Step 3, the input of more disaggregated model H is the input of each base sorting algorithm, and output is sample label;For training set Dj, from The corresponding feature of each base sorting algorithm is extracted in the sample record of each sample as input;The base sorting algorithm characterization For linear function, the parameter of each sorting algorithm is independent same distribution;
Step 4, it is based on training set Dj, using machine learning algorithm in the parameter of each base sorting algorithm and more disaggregated model HIt is trained and parametric solution;
Step 5, in test setOn more disaggregated model H are tested and are optimized, until the parameter of each base sorting algorithm and more points Parameter in class model HConvergence, terminates the machine learning algorithm of more disaggregated model H;
Step 6, by the parameter in the parameter and more disaggregated model H of finally obtained each base sorting algorithmObtain such fishing The detection model H ' of website;
Step 7, the record for obtaining website to be detected, URL and webpage information including website, input detection model H ' judge whether For fishing website.
2. the method according to claim 1, wherein the scale of the sample set D cannot be below 100.
3. method according to claim 1 or 2, which is characterized in that in the step 1, training set and test set representations It is as follows:
J-th of training set Dj={ (x1,y1),(x2,y2),…,(xm,ym), 1≤j≤n, 1 < m < n;
Corresponding j-th of test set
Wherein, n is number of samples in D, m DjIn number of samples, D/DjExpression removes D from set Dj;I-th of sample (xi, yi) in include i-th of sample record xiWith label yi
4. the method according to claim 1, wherein being solved in the step 4 using Maximum Likelihood Estimation Method The parameter of each base sorting algorithm, using EM algorithm to the parameter in more disaggregated model HIt is iterated solution.
5. the method according to claim 1, wherein the parameter in the step 5, in more disaggregated model HWhen can not restrain, sample label is corrected, training set sample is updated, re-execute the steps 4 training process.
6. a kind of fishing website detection system based on the more disaggregated models of adaptive isomery, which is characterized in that including domain name morpheme Feature classifiers, subject index feature classifiers, content similarities feature classifiers, structural style feature classifiers, vision rule Then feature classifiers, linear addition training module, integrated classifier, training dataset management module and detection and alarm module;
The domain name morpheme feature classifiers carry out feature extraction and training to the domain name character string of the website URL of input;
The subject index feature classifiers to the content of the input web page tag<title>of website,<meta>and footer into Row feature extraction and training;
The content similarities feature classifiers carry out feature extraction and semanteme to the information in the web page contents of input website It is abstract, feature is trained;
The structural style feature classifiers carry out feature extraction and training to the web page source code structure of input website;
The ocular rules feature classifiers extract and learn to the significant visual signature of the webpage of input website;
The linear addition training module is to the domain name morpheme feature classifiers, subject index feature classifiers, content Similarity feature classifier, structural style feature classifiers and ocular rules feature classifiers carry out linear combination, combining form It is as follows:
Wherein, H (x) indicates more disaggregated models of five kinds of classifiers combinations, h1~h5The corresponding classification letter of respectively five kinds of classifiers Number, ωiFor the weight parameter of i-th kind of classifier,For Dynamic gene, x indicates sample record;
The linear addition training module is using training set and test set to the parameter and ω in five kinds of classifiersiIt is instructed Practice and optimizes;
The integrated classifier is the final mask of linear addition training module output, constructs the detection model of fishing website, Weight and progress Dynamic Maintenance to each classifier;
Training dataset is stored in the training dataset management module, tag control and grouping are carried out to training sample, into The harmony of specimen sample in the division and maintenance of row training set and test set, and management different grouping;
Detection model of the detection and alarm module according to constructed fishing website, detects website to be detected, The alarm when detecting fishing website.
7. system according to claim 6, which is characterized in that in the integrated classifier, when the power of certain classifier When weight parameter is 0, indicate that the classifier does not enable.
8. system according to claim 6, which is characterized in that the domain name morpheme feature classifiers, the function of realization It include: the suspicious degree for judging top level domain;Extract the morpheme information for including in second level domain;Obtain domain name hierarchical structure and The length of subdomain name;Construct and improve morpheme feature database;
It wherein, is that the suspicious degree of fishing website is high when pw, win, top or xyz occurs in top level domain;Word in second level domain Prime information refers to the abbreviation constituted in second level domain character string comprising certain banks;Domain name morpheme feature classifiers also extract three-level or The short word for the bank address that hyphen forms in person's level Four domain name.
CN201810549417.1A 2018-05-31 2018-05-31 Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model Active CN108965245B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810549417.1A CN108965245B (en) 2018-05-31 2018-05-31 Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810549417.1A CN108965245B (en) 2018-05-31 2018-05-31 Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model

Publications (2)

Publication Number Publication Date
CN108965245A true CN108965245A (en) 2018-12-07
CN108965245B CN108965245B (en) 2021-04-13

Family

ID=64493105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810549417.1A Active CN108965245B (en) 2018-05-31 2018-05-31 Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model

Country Status (1)

Country Link
CN (1) CN108965245B (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710925A (en) * 2018-12-12 2019-05-03 新华三大数据技术有限公司 Name entity recognition method and device
CN110162624A (en) * 2019-04-16 2019-08-23 腾讯科技(深圳)有限公司 A kind of text handling method, device and relevant device
CN110266647A (en) * 2019-05-22 2019-09-20 北京金睛云华科技有限公司 It is a kind of to order and control communication check method and system
CN110324316A (en) * 2019-05-31 2019-10-11 河南恩湃高科集团有限公司 A kind of industry control anomaly detection method based on a variety of machine learning algorithms
CN110334262A (en) * 2019-06-06 2019-10-15 阿里巴巴集团控股有限公司 A kind of model training method, device and electronic equipment
CN110766165A (en) * 2019-10-23 2020-02-07 扬州大学 Online active machine learning method for malicious URL detection
CN110912910A (en) * 2019-11-29 2020-03-24 北京工业大学 DNS network data filtering method and device
CN111125699A (en) * 2019-12-04 2020-05-08 中南大学 Malicious program visual detection method based on deep learning
CN111552649A (en) * 2020-05-18 2020-08-18 支付宝(杭州)信息技术有限公司 Packet testing method and device
CN111859451A (en) * 2020-07-23 2020-10-30 北京尚隐科技有限公司 Processing system of multi-source multi-modal data and method applying same
WO2020230053A1 (en) * 2019-05-14 2020-11-19 International Business Machines Corporation Detection of phishing campaigns
CN112507333A (en) * 2020-12-01 2021-03-16 北京天融信网络安全技术有限公司 Website detection and model training method and device and electronic equipment
EP3771171A4 (en) * 2019-05-29 2021-06-02 Wangsu Science & Technology Co., Ltd. Website detection method and system
CN113051500A (en) * 2021-03-25 2021-06-29 武汉大学 Phishing website identification method and system fusing multi-source data
CN113438209A (en) * 2021-06-04 2021-09-24 中国计量大学 Phishing website detection method based on improved Stacking strategy
CN114070653A (en) * 2022-01-14 2022-02-18 浙江大学 Hybrid phishing website detection method and device, electronic equipment and storage medium
CN114124564A (en) * 2021-12-03 2022-03-01 北京天融信网络安全技术有限公司 Counterfeit website detection method and device, electronic equipment and storage medium
CN114363019A (en) * 2021-12-20 2022-04-15 北京华云安信息技术有限公司 Method, device and equipment for training phishing website detection model and storage medium
CN114499980A (en) * 2021-12-28 2022-05-13 杭州安恒信息技术股份有限公司 Phishing mail detection method, device, equipment and storage medium
CN114896348A (en) * 2022-05-11 2022-08-12 天津大学 Data exploration method and system
CN116028880A (en) * 2023-02-07 2023-04-28 支付宝(杭州)信息技术有限公司 Method for training behavior intention recognition model, behavior intention recognition method and device
CN114896348B (en) * 2022-05-11 2024-06-04 天津大学 Visual data pattern recognition method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103379111A (en) * 2012-04-21 2013-10-30 中南林业科技大学 Intelligent anti-phishing defensive system
US20140359760A1 (en) * 2013-05-31 2014-12-04 Adi Labs, Inc. System and method for detecting phishing webpages
CN104217160A (en) * 2014-09-19 2014-12-17 中国科学院深圳先进技术研究院 Method and system for detecting Chinese phishing website
CN106453351A (en) * 2016-10-31 2017-02-22 重庆邮电大学 Financial fishing webpage detection method based on Web page characteristics
CN106789888A (en) * 2016-11-18 2017-05-31 重庆邮电大学 A kind of fishing webpage detection method of multiple features fusion
CN107181730A (en) * 2017-03-13 2017-09-19 烟台中科网络技术研究所 A kind of counterfeit website monitoring recognition methods and system
CN107911360A (en) * 2017-11-13 2018-04-13 哈尔滨工业大学(威海) One kind is hacked website detection method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103379111A (en) * 2012-04-21 2013-10-30 中南林业科技大学 Intelligent anti-phishing defensive system
US20140359760A1 (en) * 2013-05-31 2014-12-04 Adi Labs, Inc. System and method for detecting phishing webpages
CN104217160A (en) * 2014-09-19 2014-12-17 中国科学院深圳先进技术研究院 Method and system for detecting Chinese phishing website
CN106453351A (en) * 2016-10-31 2017-02-22 重庆邮电大学 Financial fishing webpage detection method based on Web page characteristics
CN106789888A (en) * 2016-11-18 2017-05-31 重庆邮电大学 A kind of fishing webpage detection method of multiple features fusion
CN107181730A (en) * 2017-03-13 2017-09-19 烟台中科网络技术研究所 A kind of counterfeit website monitoring recognition methods and system
CN107911360A (en) * 2017-11-13 2018-04-13 哈尔滨工业大学(威海) One kind is hacked website detection method and system

Cited By (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710925A (en) * 2018-12-12 2019-05-03 新华三大数据技术有限公司 Name entity recognition method and device
CN110162624A (en) * 2019-04-16 2019-08-23 腾讯科技(深圳)有限公司 A kind of text handling method, device and relevant device
CN110162624B (en) * 2019-04-16 2024-04-09 腾讯科技(深圳)有限公司 Text processing method and device and related equipment
CN113812130A (en) * 2019-05-14 2021-12-17 国际商业机器公司 Detection of phishing activities
WO2020230053A1 (en) * 2019-05-14 2020-11-19 International Business Machines Corporation Detection of phishing campaigns
GB2600028A (en) * 2019-05-14 2022-04-20 Ibm Detection of phishing campaigns
US11303674B2 (en) 2019-05-14 2022-04-12 International Business Machines Corporation Detection of phishing campaigns based on deep learning network detection of phishing exfiltration communications
GB2600028B (en) * 2019-05-14 2023-09-13 Crowdstrike Inc Detection of phishing campaigns
US11818170B2 (en) 2019-05-14 2023-11-14 Crowdstrike, Inc. Detection of phishing campaigns based on deep learning network detection of phishing exfiltration communications
CN110266647A (en) * 2019-05-22 2019-09-20 北京金睛云华科技有限公司 It is a kind of to order and control communication check method and system
EP3771171A4 (en) * 2019-05-29 2021-06-02 Wangsu Science & Technology Co., Ltd. Website detection method and system
CN110324316A (en) * 2019-05-31 2019-10-11 河南恩湃高科集团有限公司 A kind of industry control anomaly detection method based on a variety of machine learning algorithms
CN110324316B (en) * 2019-05-31 2022-04-22 河南九域恩湃电力技术有限公司 Industrial control abnormal behavior detection method based on multiple machine learning algorithms
CN110334262A (en) * 2019-06-06 2019-10-15 阿里巴巴集团控股有限公司 A kind of model training method, device and electronic equipment
CN110334262B (en) * 2019-06-06 2023-12-29 创新先进技术有限公司 Model training method and device and electronic equipment
CN110766165B (en) * 2019-10-23 2023-08-08 扬州大学 Online active machine learning method for malicious URL detection
CN110766165A (en) * 2019-10-23 2020-02-07 扬州大学 Online active machine learning method for malicious URL detection
CN110912910A (en) * 2019-11-29 2020-03-24 北京工业大学 DNS network data filtering method and device
CN111125699A (en) * 2019-12-04 2020-05-08 中南大学 Malicious program visual detection method based on deep learning
CN111125699B (en) * 2019-12-04 2023-04-18 中南大学 Malicious program visual detection method based on deep learning
CN111552649A (en) * 2020-05-18 2020-08-18 支付宝(杭州)信息技术有限公司 Packet testing method and device
CN111552649B (en) * 2020-05-18 2022-02-22 支付宝(杭州)信息技术有限公司 Packet testing method and device
CN111859451A (en) * 2020-07-23 2020-10-30 北京尚隐科技有限公司 Processing system of multi-source multi-modal data and method applying same
CN111859451B (en) * 2020-07-23 2024-02-06 北京尚隐科技有限公司 Multi-source multi-mode data processing system and method for applying same
CN112507333A (en) * 2020-12-01 2021-03-16 北京天融信网络安全技术有限公司 Website detection and model training method and device and electronic equipment
CN113051500B (en) * 2021-03-25 2022-08-16 武汉大学 Phishing website identification method and system fusing multi-source data
CN113051500A (en) * 2021-03-25 2021-06-29 武汉大学 Phishing website identification method and system fusing multi-source data
CN113438209A (en) * 2021-06-04 2021-09-24 中国计量大学 Phishing website detection method based on improved Stacking strategy
CN114124564A (en) * 2021-12-03 2022-03-01 北京天融信网络安全技术有限公司 Counterfeit website detection method and device, electronic equipment and storage medium
CN114124564B (en) * 2021-12-03 2023-11-28 北京天融信网络安全技术有限公司 Method and device for detecting counterfeit website, electronic equipment and storage medium
CN114363019A (en) * 2021-12-20 2022-04-15 北京华云安信息技术有限公司 Method, device and equipment for training phishing website detection model and storage medium
CN114363019B (en) * 2021-12-20 2024-04-16 北京华云安信息技术有限公司 Training method, device, equipment and storage medium for phishing website detection model
CN114499980A (en) * 2021-12-28 2022-05-13 杭州安恒信息技术股份有限公司 Phishing mail detection method, device, equipment and storage medium
CN114070653A (en) * 2022-01-14 2022-02-18 浙江大学 Hybrid phishing website detection method and device, electronic equipment and storage medium
CN114070653B (en) * 2022-01-14 2022-06-24 浙江大学 Hybrid phishing website detection method and device, electronic equipment and storage medium
CN114896348A (en) * 2022-05-11 2022-08-12 天津大学 Data exploration method and system
CN114896348B (en) * 2022-05-11 2024-06-04 天津大学 Visual data pattern recognition method and system
CN116028880A (en) * 2023-02-07 2023-04-28 支付宝(杭州)信息技术有限公司 Method for training behavior intention recognition model, behavior intention recognition method and device

Also Published As

Publication number Publication date
CN108965245B (en) 2021-04-13

Similar Documents

Publication Publication Date Title
CN108965245A (en) Detection method for phishing site and system based on the more disaggregated models of adaptive isomery
US11475143B2 (en) Sensitive data classification
CN105426356B (en) A kind of target information recognition methods and device
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
CN107835113A (en) Abnormal user detection method in a kind of social networks based on network mapping
CN113055386B (en) Method and device for identifying and analyzing attack organization
CN104504335B (en) Fishing APP detection methods and system based on page feature and URL features
CN104077396A (en) Method and device for detecting phishing website
CN103544436A (en) System and method for distinguishing phishing websites
CN112541476B (en) Malicious webpage identification method based on semantic feature extraction
CN108319672B (en) Mobile terminal bad information filtering method and system based on cloud computing
CN111259219B (en) Malicious webpage identification model establishment method, malicious webpage identification method and malicious webpage identification system
Liu et al. An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment
CN110830489B (en) Method and system for detecting counterattack type fraud website based on content abstract representation
Makkar et al. PROTECTOR: An optimized deep learning-based framework for image spam detection and prevention
CN115080756A (en) Attack and defense behavior and space-time information extraction method oriented to threat information map
CN114915468A (en) Intelligent analysis and detection method for network crime based on knowledge graph
CN110958244A (en) Method and device for detecting counterfeit domain name based on deep learning
Peng et al. Malicious URL recognition and detection using attention-based CNN-LSTM
CN112052424B (en) Content auditing method and device
Surekha et al. Digital misinformation and fake news detection using WoT integration with Asian social networks fusion based feature extraction with text and image classification by machine learning architectures
CN116776889A (en) Guangdong rumor detection method based on graph convolution network and external knowledge embedding
Feng et al. Detecting phishing webpages via homology analysis of webpage structure
Gao et al. Informative scene graph generation via debiasing
Zong et al. Application of artificial fish swarm optimization semi-supervised kernel fuzzy clustering algorithm in network intrusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant