CN108965245A - Detection method for phishing site and system based on the more disaggregated models of adaptive isomery - Google Patents
Detection method for phishing site and system based on the more disaggregated models of adaptive isomery Download PDFInfo
- Publication number
- CN108965245A CN108965245A CN201810549417.1A CN201810549417A CN108965245A CN 108965245 A CN108965245 A CN 108965245A CN 201810549417 A CN201810549417 A CN 201810549417A CN 108965245 A CN108965245 A CN 108965245A
- Authority
- CN
- China
- Prior art keywords
- sample
- feature
- training
- website
- classifiers
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/1483—Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Computing Systems (AREA)
- Computer Hardware Design (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of detection method for phishing site and system based on the more disaggregated models of adaptive isomery.The method constructs the adaptive more disaggregated models of isomery by linear addition to a variety of base sorting algorithms, more disaggregated models are trained, the mode input is the input of each base sorting algorithm, and output is sample label, and each base sorting algorithm extracts corresponding feature as input from sample record;It is tested and is optimized using machine learning algorithm solving model parameter, and with test set, finally obtain the detection model of such fishing website.The system comprises domain name morpheme feature classifiers, subject index feature classifiers, content similarities feature classifiers, structural style feature classifiers, ocular rules feature classifiers, linear addition training module, integrated classifier, training dataset management module and detection and alarm modules.The present invention is realized to fishing website real-time detection, and improves the Stability and veracity of fishing website detection.
Description
Technical field
The present invention relates to computer network security fields, and in particular to a kind of fishing based on the more disaggregated models of adaptive isomery
Fishnet station detection method and system.
Background technique
With flourishing for Internet technology, network security problem emerges one after another.Phishing is that one kind typically exists
Line fraud, it is using internet as carrier, by the legitimate site user cheating to enjoy a good reputation that disguises oneself as to obtain user's
Sensitive information, cheated user can generate different degrees of personal information leakage, then lead to economic loss.It is how quickly quasi-
Fishing website really is detected as Web (global wide area network) information security research hotspot.Published fishing website detection at present
Technology mainly includes following methods:
(1) detection technique based on black and white lists mechanism: as a kind of practical core technology, black and white lists have efficient
Accurate advantage.By the judgement to domain name, fishing website can be quickly positioned, is one of most common realization technology [1].
(2) detection technique of view-based access control model similarity: Cao Jiuxin et al. proposes one based on nesting EMD (Earth
Mover ' s Distance) webpage similarity decision algorithm, Web graph picture is split, the Sub-Image Feature after segmentation is utilized
To construct the ARG (Attributed Relational Graph) of webpage.After the distance of different AGR attributes is calculated, lead to
It crosses nesting EMD method and obtains the similarity of fishing website Yu protected website and webpage, and then realize the high-precision to fishing website
It detects [2].
Lee is noisy et al. to be based on EMD algorithm, differentiates fishing website using visual similarity.This kind of algorithm uses webpage
The more resulting result of middle pixel similarity is as the foundation [3] for judging fishing website.
(3) based on the detection technique of bayesian algorithm: the rule-based bayesian algorithm of gold celebrating et al. has been formulated a series of
Matching fishing website rule.Its corresponding weight then is distributed for each rule, its correction factor is calculated, obtains by survey grid
Station is the probability of fishing website.And then it can judge whether it is fishing website [4] [5] by probability threshold value.
Zhuan Wei is luxuriant et al. to web page tag title content, in website keyword message, page-describing information, image link with
And 8 features such as website copyright information, classified using extension bayesian algorithm and its improved SVM
It is integrated, construct a kind of system [6] that can phishing attacks be carried out with intelligent measurement.
(4) based on the detection technique of file structure: Guo Minzhe et al. analyzes Web page document object, extracts text pair
As the normal phishing sensitive information feature utilized by fisherman in model, to judge whether the website is fishing website.It should
Algorithm has effective filtered out Phishing (phishing) page in website, the strong malice for having prevented middle phisher
Hook fish attacks [7].
(5) the fishing website detection technique based on deep learning: Xu Long proposes multilayer in the technology for combining deep learning
The DBN-KNN model of structure is applied in the feature identification of fishing website, identifies fishing website [8].
(6) other types of detection technique: yellow Hua Jun et al. proposes the phishing Initiative Defense based on semi-fragile watermarking
[9] and based on off-note fishing URL detection algorithm [10];Zhang Jianyi et al. proposes a kind of network hook that text semantic understands
Fish checks algorithm [11];Remaining defensive measure include it is skilful across grinding for station phishing attacks for web-mail mailbox, be based on
The url filtering [12] of cloud computing, SVM learning algorithm [13] etc..
In the above technology, the detection method timeliness based on black and white lists is poor, there is also deficiencies for list range;It is based on
The detection technique algorithm of vision similarity is complicated, and the time for detecting consumption is longer, is not applied for magnanimity URL (Uniform
Resoure Locator: uniform resource locator) on-line real-time measuremen;Based on the detection technique of bayesian algorithm in robust
It is less desirable in property and Generalization Capability;Detection technique existing characteristics based on file structure cover incomplete problem, fail to report
It is more;Fishing website detection technique based on deep learning is upper advantageous in feature identification, but the stability of feature is poor, is easy
Interference by sample contamination.
Bibliography:
[1]Huang C.,Ma S,Chen K.,Using One-Time Passwords to Prevent Password
Phishing Attacks[J].Journal of Network and Computer Applications.2011,34(4):
1292-1301.
[2] Cao Ouxin, Mao Bo, Luo Junzhou wait fishing webpage detection algorithm [J] the Chinese journal of computers of based on nested EMD,
2009,32(5):922-929.
[3] Lee is noisy, and Dong Liu is (natural by Phishing detection method [J] Tsinghua University journal of the vision based on similar
Scientific version), 2009,49 (1): 146-148.
[4]Zhang H.,Liu G.,Chow T.W.S..et al.Textual and Visual Content-Based
Anti-Phishing:A Baysian Approach[J].IEEE Transactions on Neural
Networks.2011,22(10):1532-1546.
[5] gold celebrating, Wu Guoxin, Li Dan wait to realize the lake filtering [R] of phishing with rule-based bayesian algorithm
South: the 5th safe academic conference of Chinese information and communication technology (ICT), 2007,372-378.
[6] Zhuan Weiwei, Ye Yanfang, Li Tao wait fishing website intelligent checking system [J] the system work of based on classification ensemble
Journey theory and practice, 2011,31 (10): 2008-2020.
[7] Guo Minzhe, Yuan Jinsheng, Wang Yachao wait phishing Web page detection algorithm machine [J] computer engineering,
2008,34(20):161-163.
[8] Xu Long is in research [D] the University of Electronic Science and Technology of fishing website detection technique of the based on deep learning, and 2017.
[9] Yin Shuijun, Liu Jiayong, Liu Liang are logical for research [J] across station phishing attacks of Web-mail Na case
Letter technology, 2010,43 (8): 164-166.
[10]Huang H.J.,Wang Y.J.,Xie L.L..et al.An Active Anti-Phishing
Solution Based on Semi-fragile Watermark[J].Information Technology
Journal.2013,12(1):198-203.
[11]Huang H.J.,Qian L.,Wang Y.J..A SVM-Based Technique to Detect
Phishing URLs[J].Information Technology Journal.2012,11(7):921-925.
[12]Zhang H.,Liu G.,Chow T.W.S..et al.Textual and Visual Content-
Based Anti-Phishing:A Baysian Approach[J].IEEE Transactions on Neural
Networks.2011,22(10):1532-1546.
[13]Sheng S.,Wardman B.,Warner G..et al.An Empirical Analysis of
Phishing Blacklists[C].In:Proc.of the sixth Conference on Email and Anti-
Spam.2009:1-10.
Summary of the invention
Above-mentioned existing method there are aiming at the problem that, the invention discloses a kind of based on the more disaggregated models of adaptive isomery
Detection method for phishing site and system for being measured in real time to fishing website, and have high accuracy and stability.
The invention discloses a kind of detection method for phishing site for being based on the more disaggregated models of adaptive isomery (AHMC), the party
Method includes the study of the more disaggregated models of adaptive isomery and the detection of fishing website, and specific steps include:
Step 1, to the fishing website sample set D of a certain classification, | D |=n is trained collection using leave one cross validation
With the classification of test set.If j-th of training set is expressed as Dj, corresponding j-th of test set be expressed asJ is positive integer.Each
Sample includes sample record and sample label;It include the URL and webpage information of website in sample record, sample label is for marking
It whether is fishing website.
Step 2, the adaptive more disaggregated model H of isomery are constructed by linear addition, as follows:
Wherein ωiFor weight parameter,For Dynamic gene
Wherein, T is the number of base sorting algorithm.hiFor i-th kind of base sorting algorithm, ωiFor the power of i-th kind of base sorting algorithm
Weight parameter,For Dynamic gene.X indicates sample record.
Step 3, the input of more disaggregated model H is the input of each base sorting algorithm, and output is sample label;To training set
Dj, the corresponding feature of each base sorting algorithm is extracted from the sample record of each sample as input.The base, which is classified, to be calculated
Method is characterized as linear function, and the parameter of each sorting algorithm is independent same distribution.
Step 4, it is based on training set Dj, parameter and more disaggregated models using machine learning algorithm to each base sorting algorithm
ω in Hi,It is trained and parametric solution.When training solves, phase is extracted from sample record to each base sorting algorithm
The feature answered preferentially guarantees that the output of more disaggregated models is sample label as input.
Step 5, in test setOn more disaggregated model H are tested and are optimized, until the parameter of each base sorting algorithm
With the parameter ω in more disaggregated model Hi,Convergence, terminates the machine learning algorithm of more disaggregated model H.
Step 6, by the parameter ω in the parameter and more disaggregated model H of finally obtained each base sorting algorithmi,It is somebody's turn to do
The detection model H ' of class fishing website.
Step 7, the record for obtaining website to be detected, URL and webpage information including website, input detection model H ' judgement
It whether is fishing website.
Invention also discloses a kind of fishing website detection systems based on the more disaggregated models of adaptive isomery, including domain name
Morpheme feature classifiers, subject index feature classifiers, content similarities feature classifiers, structural style feature classifiers, view
Feel rule feature classifier, linear addition training module, integrated classifier, training dataset management module and detection and alarm mould
Block.The function of modules is as follows in system operation:
The domain name morpheme feature classifiers carry out feature extraction and instruction to the domain name character string of the website URL of input
Practice;
The subject index feature classifiers are in the web page tag<title>of input website,<meta>and footer
Hold and carries out feature extraction and training;
The content similarities feature classifiers to input website web page contents in information carry out feature extraction and
Semantic abstraction is trained feature;
The structural style feature classifiers carry out feature extraction and training to the web page source code structure of input website;
The ocular rules feature classifiers are extracted and are learned to the significant visual signature of the webpage of input website
It practises;
The linear addition training module to the domain name morpheme feature classifiers, subject index feature classifiers,
Content similarities feature classifiers, structural style feature classifiers and ocular rules feature classifiers carry out linear combination, combination
Form is as follows:
Wherein, H (x) indicates more disaggregated models of five kinds of classifiers combinations, h1~h5Respectively five kinds corresponding point of classifier
Class function, ωiFor the weight parameter of i-th kind of classifier,For Dynamic gene, x indicates sample record;
The linear addition training module is using training set and test set to the parameter and ω in five kinds of classifiersi、Into
Row training and optimization;
The integrated classifier is the final mask of linear addition training module output, constructs the detection mould of fishing website
Type carries out Dynamic Maintenance to the weight of each classifier;
Training dataset is stored in the training dataset management module, tag control is carried out to training sample and is divided
Group is trained the harmony of specimen sample in the division and maintenance of collection and test set, and management different grouping;
Detection model of the detection and alarm module according to constructed fishing website, examines website to be detected
It surveys, the alarm when detecting fishing website.
Compared with prior art, the present invention having following clear superiority:
(1) method and system of the invention uses the technology of integrated study, and multiple Weak Classifiers are carried out linear combination, are obtained
Generalization Capability significantly more superior than single classifier is obtained, the Stability and veracity of fishing website detection is improved.
(2) method and system of the invention uses weight adaptive learning, is learnt by sample to weight, and disobeys
Rely the suggestion in first experience and domain expert, in the case where sample characteristics and distribution change, in trained study
Weight can be automatically updated in the process.
(3) five Weak Classifiers in present system are isomeries, and each single classifier has certain accurate
Property, and dependence not strong each other, it is ensured that integrated model has higher accuracy.Five Weak Classifiers it is whole
Body expense is big unlike single Complex learning model.
(4) the method for the present invention can be measured in real time fishing website, accuracy, recall rate and robust with higher
Property.
(5) present system is capable of handling the high-performance real-time detection of magnanimity URL, can be practically applicable to online engineering department
In system, there are higher availability and stability.
Detailed description of the invention
Fig. 1 is the flow diagram of the detection method for phishing site of the invention based on the more disaggregated models of adaptive isomery;
Fig. 2 is the module composition signal of the fishing website detection system of the invention based on the more disaggregated models of adaptive isomery
Figure;
Fig. 3 is the schematic diagram that linear addition training module carries out data training in fishing website detection system of the invention;
Fig. 4 is the deployment diagram of the fishing website detection system of the invention based on the more disaggregated models of adaptive isomery.
Specific embodiment
Technical solution of the present invention is described in detail with reference to the accompanying drawings and examples.Example is served only for explaining
The present invention is not intended to limit the scope of the present invention.
As shown in Figure 1, The present invention gives a kind of fishing websites based on the more disaggregated models of adaptive isomery (AHMC) to examine
Survey method, this method include the study of the more disaggregated models of adaptive isomery and the detection of fishing website, illustrate each specific reality below
Apply step.
Step 1, the fishing website of the same category is chosen, such as is all the counterfeit fishing website of banking style, as sample
This set D, | D |=n, n indicate number of samples in D.Collection and test set are trained to sample using leave one cross validation
Classification.
J-th of training sample set are as follows: Dj={ (x1,y1),(x2,y2),…,(xm,ym)}(1≤j≤n,1<m<n);
Corresponding j-th of test sample collection:
Wherein, each sample includes the record x and label y of sample, such as (the x in sample set1,y1) expression one is specifically
Sample instance, wherein x1Indicate the record of sample, y1Indicate sample label.The record of sample herein includes the url of website and right
The webpage information answered, sample label is for marking whether as fishing website.D/DjIt indicates to remove D in D setj。
In this step, the scale n of sample, should be more as far as possible, it is proposed that cannot be below 100.
Step 2, the more disaggregated model H of adaptive isomery are constructed, as follows:
Wherein, T is the number of base sorting algorithm.hiFor i-th kind of base sorting algorithm, ωiFor the power of i-th kind of base sorting algorithm
Weight parameter,For Dynamic gene.
Base sorting algorithm is in subsequent statement, and also referred to as sorting algorithm or learning algorithm, corresponding classifier also correspond to
Corresponding learner.Base sorting algorithm is the algorithm of isomery, it is ensured that the diversity of algorithm is provided with 5 in the embodiment of the present invention
The sorting algorithm of a fixation, wherein h1Domain name morpheme tagsort algorithm is represented, is according to domain name morpheme to determine whether to fish
Fishnet station;h2Subject index tagsort algorithm is represented, is that Fishing net is judged whether it is according to content under theme label in webpage
It stands;h3Content similarities tagsort algorithm is represented, is that similarity system design judgement is carried out according to content under content tab in webpage
It whether is fishing website;h4Representative structure style characteristics sorting algorithm is the structure according to source code to determine whether for fishing
Website;h5Ocular rules tagsort algorithm is represented, is according to the icon of webpage, color matching, picture etc. to determine whether for fishing
Website.In practice necessary extension can be carried out according to the principle of isomery.
Step 3, by training set DjIn each sample in input of the sample record as base sorting algorithm, sample label
As output, the more disaggregated model H of training.By (xi,yi) in xiAs each sorting algorithm h1-h5Input, for each base point
Class algorithm, from xiIt is middle to extract the feature for needing to input, by yiAs corresponding output, the feature and ginseng of the corresponding sorting algorithm of training
Number, as follows:
h1(xi)→yi,h2(xi)→yi,h3(xi)→yi,h4(xi)→yi,h5(xi)→yi
For more disaggregated model H in the embodiment of the present invention, it is expressed as follows:
When input is sample record xiWhen, export corresponding sample label yi, above formula the right is the linear weighted function of classification function
Combination.By being trained to more disaggregated models, calculates and obtain weight parameter ωiAnd Dynamic gene
The method of the present invention is that input sample is recorded x when being trainediThe middle corresponding feature of extraction is calculated as each classification
Method h1-h5Input, preferentially guarantee more disaggregated models output be sample label yi, the output of settable each sorting algorithm is also
yi, parameter and the weight parameter of more disaggregated models, Dynamic gene to each sorting algorithm are trained.The base being arranged in the present invention
Sorting algorithm is characterized as linear function, and the parameter of each sorting algorithm is independent same distribution, and the expense integrally trained in this way will not compare
Single Complex learning model it is big.The input feature vector of different base sorting algorithms may be different, and be needed from sample record
Choose applicable feature input.Such as domain name morpheme tagsort algorithm h1Input feature vector include top level domain, second level domain
Deng.
Step 4, using machine learning algorithm to the parameter and weight parameter and Dynamic gene of each sorting algorithmInto
Row training and parametric solution.For example, Maximum Likelihood Estimation Method can be used in the parameter of each sorting algorithm, to integrated mould when solving
The parameter of type H can using EM (greatest hope) algorithm iteration solve, constraint condition can using minimum loss function come
Formalization, solution procedure can be realized by unified Computational frame-Maximum Likelihood Estimation Method parametric solution, in computer
It is unified progress Matrix Solving during execution.
Step 5, in test sample setOn model H is tested and is optimized.Poll all test sample and instruction
Practice sample, until parameter and Dynamic geneStable threshold value is converged on, the learning algorithm of model terminates.
Two purposes are tested and optimized to this step.When conflicting in test sample, or there is Dynamic geneThe case where can not restraining, under these conditions will be modified sample, carry out classification modification and list to training sample
It stays alone reason, corrects sample label, update training set sample, primary training adjustment sorting algorithm then is re-started to model H
Parameter re-execute the steps 4 training process, so achieve the purpose that Optimal Parameters and Dynamic gene.
The method of the present invention obtains training set and test set using leaving-one method, if obtaining K group training set and test set altogether,
3~step 5 of previous step is executed to every group of training set and test set, the parameter and tune of multiple groups base sorting algorithm may finally be obtained
Integral divisorAt this point it is possible to be combined average evaluation as final to the sorting algorithm parameter and Dynamic gene of acquisition
As a result.
Step 6, each sorting algorithm parameter and Dynamic gene ω that foundation step 5 obtainsi,It obtains corresponding to such Fishing net
The more disaggregated model H ' of the adaptive isomery stood.
It is model instance that optimization, which obtains H, in steps of 5, and obtained model H is carried out parameter migration, initializes fishing website
Detection algorithm H '.Model H ' and H are isomorphisms, and the H ' in the embodiment of the present invention is to be integrated with h1-h5Mixed model.
Step 7, by website to be detected, its record, including website URL and webpage source code etc. webpage information are obtained,
Then it inputs in detection module H ', to obtain whether the URL is the information such as fishing website and counterfeit object.Input webpage information not
It needs to be formatted it, the feature that each classifier is used will all obtain automatically in webpage source code structure.
In this step, the corresponding site information of URL and source code data are obtained, crawler technology can be used, when there is new spy
When mutation of seeking peace occurs, corresponding base sorting algorithm and feature can be only updated, to weight parameter and Dynamic geneShadow
Sound is smaller.
The present invention uses the thought of integrated study, and the difference with existing classical integrated study is mainly reflected in: classical collection
It include two stages at study, first stage will first train each base classifier, and second stage is by the output of first stage
The parameter after each base sort merge is trained as input.And the present invention is instructed together using unified Computational frame
Practice, without two stage division.
The invention discloses a kind of fishing website detection systems based on the more disaggregated models of adaptive isomery, mainly by domain name
Morpheme feature classifiers, subject index feature classifiers, content similarities feature classifiers, structural style feature classifiers, view
Feel rule feature classifier, linear addition training module, integrated classifier, training dataset management module and detection and alarm mould
9 parts of block form.As shown in Fig. 2, the function of the running modules of following description system.
Domain name morpheme feature classifiers: the feature of classifier domain name portion in fishing website URL character string
The statistical nature divided.Domain name morpheme feature classifiers carry out feature extraction and training to the domain name character string of the website URL of input,
The function of realization includes but is not limited to: 1) judging the suspicious degree of top level domain;2) the morpheme letter for including in second level domain is extracted
Breath;3) hierarchical structure of domain name and the length of subdomain name are obtained;4) construct and improve morpheme feature database.
In domain name morpheme feature classifiers, 1) the suspicious degree of top level domain is from statistical experience, usual pw,
The probability that fishing website occur in the top level domain such as win, top, xyz is larger;2) the morpheme information in second level domain refers to composition second level
It include the abbreviation of certain banks, such as 95588,95533, cmb, icbc, boc in domain name character string;3) in three-level or four
It can be including, for example, the short word of the bank address of the such hyphen composition of www-bankofbeijing-com-cn in grade domain name.
Subject index feature classifiers are mainly to web page tag<title>, the content of<meta>and footer of input website
Carry out feature extraction and training.The function that subject index feature classifiers are realized includes but is not limited to the spy 1) extracted in label
Sign, carries out the conflict resolution and classification of type of feature;2) construct and improve subject index feature database.Subject index feature classifiers
The advantages of be fast and accurate for positioning, the disadvantage is that generalization ability is weak, wrong report is high.Content in<title>label and non-counterfeit
Normal website discrimination is not high, or does not have relevance with the content of Web page text.Therefore the classifier needs to cooperate white list
Classify in library.
Content similarities feature classifiers: feature extraction and semanteme are carried out mainly for information in the short essay in web page contents
It is abstract.Including but not limited to 1) extract<body>text in label, wherein<a>,<p>,<div>,<span>,<td>,<
Table>, the content-length in the labels such as<form>extracts no more than 15 characters, and the content in text is mentioned according to 2-8 character
It takes;2) vectorization and standardization are carried out to text feature;3) word is embedded in, and word amount is mapped as low-dimensional spy using Word2Vec tool
Levy vector;4) word feature vector library is constructed.
The detection effect of content similarities feature classifiers is stablized, and the index of accuracy and recall rate is than other classifiers
It is good.Vectorization therein is the duplicate removal and filtering to short text;Standardization be by specific time word, the frequent number of variation,
The excessively high noise word of the frequency of occurrences, the not advertisement of discrimination, third-party link word etc. are deleted.
Structural style feature classifiers carry out feature extraction and training to the web page source code structure of input website.Structure sample
The major function of formula classifier includes: the code for 1) being directed to JS script, and pattern, form list, the DOM structure of CSS carries out source generation
Code analysis;2) the homologous code structure in source code is analyzed, public code snippet is extracted;3) building is the same as source code similar matrix.
Structural style feature classifiers have ability to predict, public affairs therein to the new fishing website of same organization development behind the scenes
Code snippet includes but is not limited to 1) identical function name altogether;2) compatible CSS color matching;3) identical JS script;4) identical
Selective listing and<form>list;5) identical hyperlink and page exterior chain etc. is jumped.
Ocular rules feature classifiers mainly carry out feature extraction to the significant visual signature of the webpage of input website
And study, extracted feature include but is not limited to the log icon of 1) targeted website;2) the whole color matching and frame set of website
At;3) there is significant picture module etc..
The shortcomings that ocular rules feature classifiers be carry out visual signature study and detection time it is all long, Er Qiexiang
Same log may cause very big error in the difference of pixel scale, therefore more severe to the quality requirement of training sample
It carves.The scale in visual signature library is not less than 30000.
Linear addition training module passes through the study of weighting parameters and Dynamic gene, to base classifier --- the domain name of isomery
Morpheme feature classifiers, subject index feature classifiers, content similarities feature classifiers, structural style feature classifiers, view
Feel that rule feature classifier carries out linear combination and training, obtains stable weighting parameters and Dynamic gene.
In linear addition training module, the training of linear addition depends on the quality of training sample, weighting parameters
The operation automated with Dynamic gene according to built-in algorithm.Linear addition training module utilizes training set and test set pair
Sorting parameter and ω in five kinds of classifiersi、It is trained and optimizes.In training, the sample record of training data is defeated parallel
Enter in five kinds of base classifiers, the output of more disaggregated models of five kinds of classifiers combinations is the label of corresponding sample, and the form of expression is such as
Shown in Fig. 3.
Integrated classifier is the final mask of linear addition training module output, and the function of realization includes but is not limited to 1) to bear
Blame the detection model building of fishing website;2) counterfeit object and class label are matched;3) to the Dynamic Maintenance of classifier weight
It is integrated with the iteration of feature database.Feature database herein is the feature database of used classifier, such as domain name morpheme tagsort
Feature database of device etc..
In integrated classifier, base classifier be not it is each be involved in final detection, if weight parameter be 0,
Corresponding base classifier does not enable.In addition, in integrated model, to the performance of classifier also in the range of consideration, such as regard
Feel that rule feature classifier in rough sort and is not used because the time of consumption is longer.
For storing training data in training dataset management module.Training dataset mainly include fishing website URL,
The data samples such as source code, site information composition, the function of the module include 1) carrying out tag control to training sample data, into
Row grouping;2) division and maintenance of test set and training set are carried out to training sample;3) in different grouping specimen sample it is equal
The management of weighing apparatus property.
For final effect, the quality of training sample data and the quality no less important of classifier.Therefore by sample
This management is completed in independent module, and the emphasis of the module is prevented not to the management of different classes of sample distribution
The case where balanced sample.
Detection and alarm module: the module includes two functions: 1) detection function of fishing website, model is by collection ingredient
The parameter and feature database of class device form;2) fishing website detected is alerted, warning information and rank can be according to users
It custom-configures.
In the module, warning information is different mainly for degree of concern of the user to different counterfeit objects, important to alerting
The classification of degree.If a page lottery industry information occurs in the first screen, but there is the counterfeit letter of Bank of China to the second screen
Breath.But the attention rate of bank is greater than lottery industry, so the classification and alarm to the page are then preferably bank.
5 base classifiers in present system be it is independent of each other, without correlation;Linear addition training module is to be
The training of the core of system, all parameters is completed in the module, and linear model ensure that the performance of system and the convergence of operation;
Integrated classifier determines the combined situation to the base classifier in base classifier modules according to the parameter of the output of linear addition,
Not necessarily 5 classifiers are involved in integrated, it is possible to only need 2-3.
As shown in figure 4, giving a deployment diagram of present system.Five base classifiers form base learner server
Group, and linear addition training module, integrated classifier, training dataset management module, detection and alarm module distributed deployment
At networking switch.
Claims (8)
1. a kind of detection method for phishing site based on the more disaggregated models of adaptive isomery, which is characterized in that the described method includes:
Step 1, to the fishing website sample set D of the same category, collection and test set are trained using leave one cross validation
Classification;If j-th of training set is expressed as Dj, corresponding j-th of test set be expressed asEach sample includes sample record and sample
This label;It include the URL and webpage information of website in sample record, sample label is for marking whether as fishing website;J is positive
Integer;
Step 2, the adaptive more disaggregated model H of isomery are constructed by linear addition, as follows:
Wherein, T is the number of base sorting algorithm, hiFor i-th kind of base sorting algorithm, ωiFor the weight ginseng of i-th kind of base sorting algorithm
Number,For Dynamic gene, x indicates sample record;
Step 3, the input of more disaggregated model H is the input of each base sorting algorithm, and output is sample label;For training set Dj, from
The corresponding feature of each base sorting algorithm is extracted in the sample record of each sample as input;The base sorting algorithm characterization
For linear function, the parameter of each sorting algorithm is independent same distribution;
Step 4, it is based on training set Dj, using machine learning algorithm in the parameter of each base sorting algorithm and more disaggregated model HIt is trained and parametric solution;
Step 5, in test setOn more disaggregated model H are tested and are optimized, until the parameter of each base sorting algorithm and more points
Parameter in class model HConvergence, terminates the machine learning algorithm of more disaggregated model H;
Step 6, by the parameter in the parameter and more disaggregated model H of finally obtained each base sorting algorithmObtain such fishing
The detection model H ' of website;
Step 7, the record for obtaining website to be detected, URL and webpage information including website, input detection model H ' judge whether
For fishing website.
2. the method according to claim 1, wherein the scale of the sample set D cannot be below 100.
3. method according to claim 1 or 2, which is characterized in that in the step 1, training set and test set representations
It is as follows:
J-th of training set Dj={ (x1,y1),(x2,y2),…,(xm,ym), 1≤j≤n, 1 < m < n;
Corresponding j-th of test set
Wherein, n is number of samples in D, m DjIn number of samples, D/DjExpression removes D from set Dj;I-th of sample (xi,
yi) in include i-th of sample record xiWith label yi。
4. the method according to claim 1, wherein being solved in the step 4 using Maximum Likelihood Estimation Method
The parameter of each base sorting algorithm, using EM algorithm to the parameter in more disaggregated model HIt is iterated solution.
5. the method according to claim 1, wherein the parameter in the step 5, in more disaggregated model HWhen can not restrain, sample label is corrected, training set sample is updated, re-execute the steps 4 training process.
6. a kind of fishing website detection system based on the more disaggregated models of adaptive isomery, which is characterized in that including domain name morpheme
Feature classifiers, subject index feature classifiers, content similarities feature classifiers, structural style feature classifiers, vision rule
Then feature classifiers, linear addition training module, integrated classifier, training dataset management module and detection and alarm module;
The domain name morpheme feature classifiers carry out feature extraction and training to the domain name character string of the website URL of input;
The subject index feature classifiers to the content of the input web page tag<title>of website,<meta>and footer into
Row feature extraction and training;
The content similarities feature classifiers carry out feature extraction and semanteme to the information in the web page contents of input website
It is abstract, feature is trained;
The structural style feature classifiers carry out feature extraction and training to the web page source code structure of input website;
The ocular rules feature classifiers extract and learn to the significant visual signature of the webpage of input website;
The linear addition training module is to the domain name morpheme feature classifiers, subject index feature classifiers, content
Similarity feature classifier, structural style feature classifiers and ocular rules feature classifiers carry out linear combination, combining form
It is as follows:
Wherein, H (x) indicates more disaggregated models of five kinds of classifiers combinations, h1~h5The corresponding classification letter of respectively five kinds of classifiers
Number, ωiFor the weight parameter of i-th kind of classifier,For Dynamic gene, x indicates sample record;
The linear addition training module is using training set and test set to the parameter and ω in five kinds of classifiersi、It is instructed
Practice and optimizes;
The integrated classifier is the final mask of linear addition training module output, constructs the detection model of fishing website,
Weight and progress Dynamic Maintenance to each classifier;
Training dataset is stored in the training dataset management module, tag control and grouping are carried out to training sample, into
The harmony of specimen sample in the division and maintenance of row training set and test set, and management different grouping;
Detection model of the detection and alarm module according to constructed fishing website, detects website to be detected,
The alarm when detecting fishing website.
7. system according to claim 6, which is characterized in that in the integrated classifier, when the power of certain classifier
When weight parameter is 0, indicate that the classifier does not enable.
8. system according to claim 6, which is characterized in that the domain name morpheme feature classifiers, the function of realization
It include: the suspicious degree for judging top level domain;Extract the morpheme information for including in second level domain;Obtain domain name hierarchical structure and
The length of subdomain name;Construct and improve morpheme feature database;
It wherein, is that the suspicious degree of fishing website is high when pw, win, top or xyz occurs in top level domain;Word in second level domain
Prime information refers to the abbreviation constituted in second level domain character string comprising certain banks;Domain name morpheme feature classifiers also extract three-level or
The short word for the bank address that hyphen forms in person's level Four domain name.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810549417.1A CN108965245B (en) | 2018-05-31 | 2018-05-31 | Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810549417.1A CN108965245B (en) | 2018-05-31 | 2018-05-31 | Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108965245A true CN108965245A (en) | 2018-12-07 |
CN108965245B CN108965245B (en) | 2021-04-13 |
Family
ID=64493105
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810549417.1A Active CN108965245B (en) | 2018-05-31 | 2018-05-31 | Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108965245B (en) |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109710925A (en) * | 2018-12-12 | 2019-05-03 | 新华三大数据技术有限公司 | Name entity recognition method and device |
CN110162624A (en) * | 2019-04-16 | 2019-08-23 | 腾讯科技(深圳)有限公司 | A kind of text handling method, device and relevant device |
CN110266647A (en) * | 2019-05-22 | 2019-09-20 | 北京金睛云华科技有限公司 | It is a kind of to order and control communication check method and system |
CN110324316A (en) * | 2019-05-31 | 2019-10-11 | 河南恩湃高科集团有限公司 | A kind of industry control anomaly detection method based on a variety of machine learning algorithms |
CN110334262A (en) * | 2019-06-06 | 2019-10-15 | 阿里巴巴集团控股有限公司 | A kind of model training method, device and electronic equipment |
CN110766165A (en) * | 2019-10-23 | 2020-02-07 | 扬州大学 | Online active machine learning method for malicious URL detection |
CN110912910A (en) * | 2019-11-29 | 2020-03-24 | 北京工业大学 | DNS network data filtering method and device |
CN111125699A (en) * | 2019-12-04 | 2020-05-08 | 中南大学 | Malicious program visual detection method based on deep learning |
CN111552649A (en) * | 2020-05-18 | 2020-08-18 | 支付宝(杭州)信息技术有限公司 | Packet testing method and device |
CN111859451A (en) * | 2020-07-23 | 2020-10-30 | 北京尚隐科技有限公司 | Processing system of multi-source multi-modal data and method applying same |
WO2020230053A1 (en) * | 2019-05-14 | 2020-11-19 | International Business Machines Corporation | Detection of phishing campaigns |
CN112507333A (en) * | 2020-12-01 | 2021-03-16 | 北京天融信网络安全技术有限公司 | Website detection and model training method and device and electronic equipment |
EP3771171A4 (en) * | 2019-05-29 | 2021-06-02 | Wangsu Science & Technology Co., Ltd. | Website detection method and system |
CN113051500A (en) * | 2021-03-25 | 2021-06-29 | 武汉大学 | Phishing website identification method and system fusing multi-source data |
CN113438209A (en) * | 2021-06-04 | 2021-09-24 | 中国计量大学 | Phishing website detection method based on improved Stacking strategy |
CN114070653A (en) * | 2022-01-14 | 2022-02-18 | 浙江大学 | Hybrid phishing website detection method and device, electronic equipment and storage medium |
CN114124564A (en) * | 2021-12-03 | 2022-03-01 | 北京天融信网络安全技术有限公司 | Counterfeit website detection method and device, electronic equipment and storage medium |
CN114363019A (en) * | 2021-12-20 | 2022-04-15 | 北京华云安信息技术有限公司 | Method, device and equipment for training phishing website detection model and storage medium |
CN114499980A (en) * | 2021-12-28 | 2022-05-13 | 杭州安恒信息技术股份有限公司 | Phishing mail detection method, device, equipment and storage medium |
CN114896348A (en) * | 2022-05-11 | 2022-08-12 | 天津大学 | Data exploration method and system |
CN116028880A (en) * | 2023-02-07 | 2023-04-28 | 支付宝(杭州)信息技术有限公司 | Method for training behavior intention recognition model, behavior intention recognition method and device |
CN114896348B (en) * | 2022-05-11 | 2024-06-04 | 天津大学 | Visual data pattern recognition method and system |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103379111A (en) * | 2012-04-21 | 2013-10-30 | 中南林业科技大学 | Intelligent anti-phishing defensive system |
US20140359760A1 (en) * | 2013-05-31 | 2014-12-04 | Adi Labs, Inc. | System and method for detecting phishing webpages |
CN104217160A (en) * | 2014-09-19 | 2014-12-17 | 中国科学院深圳先进技术研究院 | Method and system for detecting Chinese phishing website |
CN106453351A (en) * | 2016-10-31 | 2017-02-22 | 重庆邮电大学 | Financial fishing webpage detection method based on Web page characteristics |
CN106789888A (en) * | 2016-11-18 | 2017-05-31 | 重庆邮电大学 | A kind of fishing webpage detection method of multiple features fusion |
CN107181730A (en) * | 2017-03-13 | 2017-09-19 | 烟台中科网络技术研究所 | A kind of counterfeit website monitoring recognition methods and system |
CN107911360A (en) * | 2017-11-13 | 2018-04-13 | 哈尔滨工业大学(威海) | One kind is hacked website detection method and system |
-
2018
- 2018-05-31 CN CN201810549417.1A patent/CN108965245B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103379111A (en) * | 2012-04-21 | 2013-10-30 | 中南林业科技大学 | Intelligent anti-phishing defensive system |
US20140359760A1 (en) * | 2013-05-31 | 2014-12-04 | Adi Labs, Inc. | System and method for detecting phishing webpages |
CN104217160A (en) * | 2014-09-19 | 2014-12-17 | 中国科学院深圳先进技术研究院 | Method and system for detecting Chinese phishing website |
CN106453351A (en) * | 2016-10-31 | 2017-02-22 | 重庆邮电大学 | Financial fishing webpage detection method based on Web page characteristics |
CN106789888A (en) * | 2016-11-18 | 2017-05-31 | 重庆邮电大学 | A kind of fishing webpage detection method of multiple features fusion |
CN107181730A (en) * | 2017-03-13 | 2017-09-19 | 烟台中科网络技术研究所 | A kind of counterfeit website monitoring recognition methods and system |
CN107911360A (en) * | 2017-11-13 | 2018-04-13 | 哈尔滨工业大学(威海) | One kind is hacked website detection method and system |
Cited By (38)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109710925A (en) * | 2018-12-12 | 2019-05-03 | 新华三大数据技术有限公司 | Name entity recognition method and device |
CN110162624A (en) * | 2019-04-16 | 2019-08-23 | 腾讯科技(深圳)有限公司 | A kind of text handling method, device and relevant device |
CN110162624B (en) * | 2019-04-16 | 2024-04-09 | 腾讯科技(深圳)有限公司 | Text processing method and device and related equipment |
CN113812130A (en) * | 2019-05-14 | 2021-12-17 | 国际商业机器公司 | Detection of phishing activities |
WO2020230053A1 (en) * | 2019-05-14 | 2020-11-19 | International Business Machines Corporation | Detection of phishing campaigns |
GB2600028A (en) * | 2019-05-14 | 2022-04-20 | Ibm | Detection of phishing campaigns |
US11303674B2 (en) | 2019-05-14 | 2022-04-12 | International Business Machines Corporation | Detection of phishing campaigns based on deep learning network detection of phishing exfiltration communications |
GB2600028B (en) * | 2019-05-14 | 2023-09-13 | Crowdstrike Inc | Detection of phishing campaigns |
US11818170B2 (en) | 2019-05-14 | 2023-11-14 | Crowdstrike, Inc. | Detection of phishing campaigns based on deep learning network detection of phishing exfiltration communications |
CN110266647A (en) * | 2019-05-22 | 2019-09-20 | 北京金睛云华科技有限公司 | It is a kind of to order and control communication check method and system |
EP3771171A4 (en) * | 2019-05-29 | 2021-06-02 | Wangsu Science & Technology Co., Ltd. | Website detection method and system |
CN110324316A (en) * | 2019-05-31 | 2019-10-11 | 河南恩湃高科集团有限公司 | A kind of industry control anomaly detection method based on a variety of machine learning algorithms |
CN110324316B (en) * | 2019-05-31 | 2022-04-22 | 河南九域恩湃电力技术有限公司 | Industrial control abnormal behavior detection method based on multiple machine learning algorithms |
CN110334262A (en) * | 2019-06-06 | 2019-10-15 | 阿里巴巴集团控股有限公司 | A kind of model training method, device and electronic equipment |
CN110334262B (en) * | 2019-06-06 | 2023-12-29 | 创新先进技术有限公司 | Model training method and device and electronic equipment |
CN110766165B (en) * | 2019-10-23 | 2023-08-08 | 扬州大学 | Online active machine learning method for malicious URL detection |
CN110766165A (en) * | 2019-10-23 | 2020-02-07 | 扬州大学 | Online active machine learning method for malicious URL detection |
CN110912910A (en) * | 2019-11-29 | 2020-03-24 | 北京工业大学 | DNS network data filtering method and device |
CN111125699A (en) * | 2019-12-04 | 2020-05-08 | 中南大学 | Malicious program visual detection method based on deep learning |
CN111125699B (en) * | 2019-12-04 | 2023-04-18 | 中南大学 | Malicious program visual detection method based on deep learning |
CN111552649A (en) * | 2020-05-18 | 2020-08-18 | 支付宝(杭州)信息技术有限公司 | Packet testing method and device |
CN111552649B (en) * | 2020-05-18 | 2022-02-22 | 支付宝(杭州)信息技术有限公司 | Packet testing method and device |
CN111859451A (en) * | 2020-07-23 | 2020-10-30 | 北京尚隐科技有限公司 | Processing system of multi-source multi-modal data and method applying same |
CN111859451B (en) * | 2020-07-23 | 2024-02-06 | 北京尚隐科技有限公司 | Multi-source multi-mode data processing system and method for applying same |
CN112507333A (en) * | 2020-12-01 | 2021-03-16 | 北京天融信网络安全技术有限公司 | Website detection and model training method and device and electronic equipment |
CN113051500B (en) * | 2021-03-25 | 2022-08-16 | 武汉大学 | Phishing website identification method and system fusing multi-source data |
CN113051500A (en) * | 2021-03-25 | 2021-06-29 | 武汉大学 | Phishing website identification method and system fusing multi-source data |
CN113438209A (en) * | 2021-06-04 | 2021-09-24 | 中国计量大学 | Phishing website detection method based on improved Stacking strategy |
CN114124564A (en) * | 2021-12-03 | 2022-03-01 | 北京天融信网络安全技术有限公司 | Counterfeit website detection method and device, electronic equipment and storage medium |
CN114124564B (en) * | 2021-12-03 | 2023-11-28 | 北京天融信网络安全技术有限公司 | Method and device for detecting counterfeit website, electronic equipment and storage medium |
CN114363019A (en) * | 2021-12-20 | 2022-04-15 | 北京华云安信息技术有限公司 | Method, device and equipment for training phishing website detection model and storage medium |
CN114363019B (en) * | 2021-12-20 | 2024-04-16 | 北京华云安信息技术有限公司 | Training method, device, equipment and storage medium for phishing website detection model |
CN114499980A (en) * | 2021-12-28 | 2022-05-13 | 杭州安恒信息技术股份有限公司 | Phishing mail detection method, device, equipment and storage medium |
CN114070653A (en) * | 2022-01-14 | 2022-02-18 | 浙江大学 | Hybrid phishing website detection method and device, electronic equipment and storage medium |
CN114070653B (en) * | 2022-01-14 | 2022-06-24 | 浙江大学 | Hybrid phishing website detection method and device, electronic equipment and storage medium |
CN114896348A (en) * | 2022-05-11 | 2022-08-12 | 天津大学 | Data exploration method and system |
CN114896348B (en) * | 2022-05-11 | 2024-06-04 | 天津大学 | Visual data pattern recognition method and system |
CN116028880A (en) * | 2023-02-07 | 2023-04-28 | 支付宝(杭州)信息技术有限公司 | Method for training behavior intention recognition model, behavior intention recognition method and device |
Also Published As
Publication number | Publication date |
---|---|
CN108965245B (en) | 2021-04-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108965245A (en) | Detection method for phishing site and system based on the more disaggregated models of adaptive isomery | |
US11475143B2 (en) | Sensitive data classification | |
CN105426356B (en) | A kind of target information recognition methods and device | |
CN108737423B (en) | Phishing website discovery method and system based on webpage key content similarity analysis | |
CN107835113A (en) | Abnormal user detection method in a kind of social networks based on network mapping | |
CN113055386B (en) | Method and device for identifying and analyzing attack organization | |
CN104504335B (en) | Fishing APP detection methods and system based on page feature and URL features | |
CN104077396A (en) | Method and device for detecting phishing website | |
CN103544436A (en) | System and method for distinguishing phishing websites | |
CN112541476B (en) | Malicious webpage identification method based on semantic feature extraction | |
CN108319672B (en) | Mobile terminal bad information filtering method and system based on cloud computing | |
CN111259219B (en) | Malicious webpage identification model establishment method, malicious webpage identification method and malicious webpage identification system | |
Liu et al. | An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment | |
CN110830489B (en) | Method and system for detecting counterattack type fraud website based on content abstract representation | |
Makkar et al. | PROTECTOR: An optimized deep learning-based framework for image spam detection and prevention | |
CN115080756A (en) | Attack and defense behavior and space-time information extraction method oriented to threat information map | |
CN114915468A (en) | Intelligent analysis and detection method for network crime based on knowledge graph | |
CN110958244A (en) | Method and device for detecting counterfeit domain name based on deep learning | |
Peng et al. | Malicious URL recognition and detection using attention-based CNN-LSTM | |
CN112052424B (en) | Content auditing method and device | |
Surekha et al. | Digital misinformation and fake news detection using WoT integration with Asian social networks fusion based feature extraction with text and image classification by machine learning architectures | |
CN116776889A (en) | Guangdong rumor detection method based on graph convolution network and external knowledge embedding | |
Feng et al. | Detecting phishing webpages via homology analysis of webpage structure | |
Gao et al. | Informative scene graph generation via debiasing | |
Zong et al. | Application of artificial fish swarm optimization semi-supervised kernel fuzzy clustering algorithm in network intrusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |