CN108038173A - A kind of Web page classification method, system and a kind of Web page classifying equipment - Google Patents

A kind of Web page classification method, system and a kind of Web page classifying equipment Download PDF

Info

Publication number
CN108038173A
CN108038173A CN201711285419.6A CN201711285419A CN108038173A CN 108038173 A CN108038173 A CN 108038173A CN 201711285419 A CN201711285419 A CN 201711285419A CN 108038173 A CN108038173 A CN 108038173A
Authority
CN
China
Prior art keywords
webpage
sorted
web page
dimensional
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711285419.6A
Other languages
Chinese (zh)
Other versions
CN108038173B (en
Inventor
刘文印
黎宇坤
陈旭
袁华平
杨振国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201711285419.6A priority Critical patent/CN108038173B/en
Publication of CN108038173A publication Critical patent/CN108038173A/en
Application granted granted Critical
Publication of CN108038173B publication Critical patent/CN108038173B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation

Abstract

The invention discloses a kind of Web page classification method to include:Obtain the N-dimensional current signature of webpage to be sorted;Wherein, N is positive integer;The N-dimensional current signature is inputted in the stacking models of training completion and carry out feature extension, obtain the N+n dimensional features of the webpage to be sorted;Wherein, the stacking models are model of the p base categories model through q layers of stacking, and n is the product of p and q, and n, p, q are positive integer;The classification results of the webpage to be sorted are obtained using sorting algorithm according to the N+n dimensional features.Web page classification method disclosed by the invention, is extended the N-dimensional current signature of webpage to be sorted using stacking models, the accuracy of Web page classifying is improved on the premise of not depending on search engine or third party's service.The invention also discloses a kind of web page classification system and a kind of Web page classifying equipment and a kind of computer-readable recording medium, above-mentioned technique effect can be equally realized.

Description

A kind of Web page classification method, system and a kind of Web page classifying equipment
Technical field
The present invention relates to technical field of network security, more specifically to a kind of Web page classification method, system and one kind Web page classifying equipment and a kind of computer-readable recording medium.
Background technology
Phishing is a kind of network fraud behavior, refers to criminal and utilizes various means, the URL of counterfeit actual site Location and content of pages, the privacy informations such as important account, bank or the credit card account of user, password are defrauded of with this.It is illegal Molecule would generally the Pages Design of fishing website it is completely the same with actual site interface, lure that visitor submits account and close into Code.
In recent years, many researchers devise actual solution for anti-phishing problem.These solutions Mainly there is the following aspects:(1) method based on blacklist and white list;(2) extracted from webpage text, image or The feature of URL, helps to detect fishing website using search engine;(3) it is similar with the vision of well-known webpage using fishing webpage Property detects fishing website;(4) fishing website is found using the DNS exception informations of webpage;(5) text is extracted in HTML This, image either passes through heuritic approach after special URL features or machine learning algorithm detects fishing website.The above method In, the method based on black and white lists needs constantly artificially to safeguard list;It is frequently limited by the limited to search for using the method for search engine The performance of engine, it is impossible to accomplish to detect in real time;The method of view-based access control model similitude is easier the shadow by target identification accuracy rate Ring;Third party's service is needed using the method for webpage DNS to provide DNS information, and development cost is larger.
Therefore, the accuracy of Web page classifying how is improved on the premise of not depending on search engine or third party's service is Those skilled in the art's problem to be solved.
The content of the invention
It is an object of the invention to provide a kind of Web page classification method, system and a kind of Web page classifying equipment and a kind of calculating Machine readable storage medium storing program for executing, improves the accuracy of Web page classifying on the premise of not depending on search engine or third party's service.
To achieve the above object, an embodiment of the present invention provides a kind of Web page classification method, including:
Obtain the N-dimensional current signature of webpage to be sorted;Wherein, N is positive integer;
The N-dimensional current signature is inputted in the stacking models of training completion and carry out feature extension, obtain described treat The N+n dimensional features of classification webpage;Wherein, the stacking models are model of the p base categories model through q layers of stacking, and n is The product of p and q, n, p, q are positive integer;
The classification results of the webpage to be sorted are obtained using sorting algorithm according to the N+n dimensional features.
Wherein, the N-dimensional current signature is inputted in the stacking models of training completion and carries out feature extension, obtain institute The N+n dimensional features of webpage to be sorted are stated, including:
S1:Training set is obtained, and the training set is divided into m parts of training samples;Wherein, it is each in the training set A webpage includes N-dimensional feature;
S2:Choose a training sample and train the base categories model, the base categories mould completed using training Type predicts the webpage to be sorted, is completed until the m parts of training sample is all chosen, obtains m medium range forecast as a result, and leading to Cross and the m medium range forecast result is voted to obtain one-dimensional increase feature;
S3:S2 is repeated, until all training completion obtains p dimension increase features to all base categories models, and will The p dimension increases feature merges with the current signature of the webpage to be sorted, the current signature as the webpage to be sorted;
S4:S2-S3 is repeated, until the q layers for completing the stacking models stack, obtains the N+ of the webpage to be sorted N dimensional features.
Wherein, if p=3, the base categories model include GBDT disaggregated models, XGBoost disaggregated models and LightGBM disaggregated models.
Wherein, after the N-dimensional feature for obtaining webpage to be sorted, further include:
The N-dimensional feature is standardized using Z-score methods.
Wherein, the sorting algorithm includes GBDT algorithms.
Wherein, the N-dimensional current signature includes the URL features and HTML features of the webpage to be sorted, the URL features Including top level domain number and similar famous brand name, the HTML features include spare anchor link number, title brand and the URL Whether middle brand consistent, in the HTML the most brands of occurrence number and brand in the URL whether consistent, inside and outside resource Number and Word2vec features.
To achieve the above object, an embodiment of the present invention provides a kind of web page classification system, including:
Acquisition module, for obtaining the N-dimensional current signature of webpage to be sorted;Wherein, N is positive integer;
Expansion module, feature expansion is carried out for the N-dimensional current signature to be inputted in the stacking models trained and completed Exhibition, obtains the N+n dimensional features of the webpage to be sorted;Wherein, the stacking models are p base categories model through q layers The model of stacking, n are the product of p and q, and n, p, q are positive integer;
Sort module, for obtaining the classification knot of the webpage to be sorted using sorting algorithm according to the N+n dimensional features Fruit.
Wherein, the expansion module includes:
Division unit, m parts of training samples are divided into for obtaining training set, and by the training set;Wherein, the training Each webpage concentrated includes N-dimensional feature;
Predicting unit, is trained the base categories model for choosing a training sample, is completed using training Webpage to be sorted described in base categories model prediction, is completed until the m parts of training sample is all chosen, and is obtained pre- among m Survey as a result, and by being voted to obtain one-dimensional increase feature to the m medium range forecast result, described in repeated priming in advance The workflow of unit is surveyed, until all training completion obtains p dimension increase features to all base categories models, starts and closes And the workflow of unit;
The combining unit, merges for the p to be tieed up increase feature with the initial characteristics of the webpage to be sorted, as The initial characteristics of the webpage to be sorted, start the workflow of the predicting unit, until completing the stacking models Q layers stack, obtain the N+n dimensional features of the webpage to be sorted.
To achieve the above object, an embodiment of the present invention provides a kind of Web page classifying equipment, including:
Memory, for storing Web page classifying program;
Processor, is realized such as the step of above-mentioned Web page classification method during for performing the Web page classifying program.
To achieve the above object, an embodiment of the present invention provides a kind of computer-readable recording medium, the computer can Read to be stored with Web page classifying program on storage medium, realized when the Web page classifying program is executed by processor such as above-mentioned webpage point Class method.
By above scheme, a kind of Web page classification method provided in an embodiment of the present invention includes:Obtain net to be sorted The N-dimensional current signature of page;Wherein, N is positive integer;The N-dimensional current signature is inputted in the stacking models that training is completed Feature extension is carried out, obtains the N+n dimensional features of the webpage to be sorted;Wherein, the stacking models are p base categories Model of the model through q layers of stacking, n are the product of p and q, and n, p, q are positive integer;Calculated according to the N+n dimensional features using classification Method obtains the classification results of the webpage to be sorted.
Web page classification method provided in an embodiment of the present invention, it is current to the N-dimensional of webpage to be sorted using stacking models Feature is extended, and the accuracy of Web page classifying is improved on the premise of not depending on search engine or third party's service.This Invention also discloses a kind of web page classification system and a kind of Web page classifying equipment and a kind of computer-readable recording medium, same energy Realize above-mentioned technique effect.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is attached drawing needed in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, without creative efforts, can be with Other attached drawings are obtained according to these attached drawings.
Fig. 1 is a kind of flow chart of Web page classification method disclosed by the embodiments of the present invention;
Fig. 2 is the flow chart of another Web page classification method disclosed by the embodiments of the present invention;
Fig. 3 is the flow that feature extends in another Web page classification method Stacking models disclosed by the embodiments of the present invention Figure;
Fig. 4 is a kind of structure chart of web page classification system disclosed by the embodiments of the present invention;
Fig. 5 is a kind of structure chart of Web page classifying equipment disclosed by the embodiments of the present invention.
Embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other without creative efforts Embodiment, belongs to the scope of protection of the invention.
The embodiment of the invention discloses a kind of Web page classification method, do not depend on before search engine or third party's service Put the accuracy for improving Web page classifying.
Referring to Fig. 1, a kind of flow chart of Web page classification method disclosed by the embodiments of the present invention, as shown in Figure 1, including:
S101:Obtain the N-dimensional current signature of webpage to be sorted;Wherein, N is positive integer;
Web page classification method provided in an embodiment of the present invention can be applied to the detection of fishing website, in specific implementation, institute Stating N-dimensional current signature includes the URL features and HTML features of the webpage to be sorted.Wherein, URL features are described as follows:
For IP address, the Main Domain of fishing website usually only has IP address, such as:http://62.141.45.54/ PortaleTitolaris8/, is matched by canonical, it can be determined that whether the Main Domain of website is IP address.If it is this feature It is otherwise 0 for 1.
For suspicious symbol, including '@', '-', '~'.' if@' symbol is appeared in URL, when browser resolves network address When, all character strings on the right of '@' symbol can be ignored.And '-' and '~' symbol is since implication is unknown, the URL in normal website In be rarely used.
For https, https is safe http data transmission methods, there is provided authentication and encryption communication.
For length information, the length of main total length and Main Domain including URL addresses.
For the number of ' ' in Main Domain, in order to make user easily remember, the Main Domain of normal network address is usually shorter, The number of wherein ' ' is usually no more than 3, such as:Www.baidu.com etc..
For sensitive vocabulary, in the URL addresses of fishing website, commonly using some sensitive vocabulary.In the present embodiment may be used To create a sensitive word lists, for example, [' secure', ' account', ' webscr', ' login', ' Ebayisapi', ' signin', ' banking', ' confirm', ' submit', ' update'], and count and go out in URL addresses The number of these existing sensitive vocabulary is as this feature.
For abnormal top level domain, top level domain is divided into two classes:First, (English is complete for countries and regions top level domain Claim:Country code top-level domains, English abbreviation:NTLDs), for example, China is ' cn ', Japan is ' jp ' Deng.Second, international top level domain (English full name:Generic top-level domains, English abbreviation:GTLDs), for example, Represent ' .com ' of industrial and commercial enterprises, represent ' .net ' of network provider, represent ' .org ' of non-profit organization etc.. The top level domain that the website that Stuffgate has counted before Alexa rankings 1,000,000 uses shares 735.If in the Main Domain of website There are multiple common top level domain, such as:http://www.ebay.com.urgd.com/path or URL paths, i.e., main domain Name followed by part in still there is top level domain, then it is very suspicious.3 features can be extracted in the present embodiment:Top Level domain name whether in the top level domain list of Stuffgate, in Main Domain top level domain number, top level domain in URL paths The number of name.
For similar famous brand name.The brand that comparing has popularity can be utilized by fisherman, for example, fisherman can be Paypal is changed into paypail to confuse user.Levenshtein distance (Chinese full name:Editing distance) two can be measured Similarity between a character string.It is represented as minimum editor's number needed for a character string is converted into another, including character Replacement, insertion, deletion.Can be with by the Levenshtein distance for calculating given famous brand name and character string in URL Find out similar brand name.
HTML features are described as follows:
For inside and outside number of links, interior chain refers to that Main Domain is identical with the Main Domain of URL addresses and links.Fishing website will User cheating believes that the webpage is legal, it will usually utilizes external resource.The resource of object is especially palmed off, causes Fishing net The internal links stood are considerably less, and the number of external linkage is relatively more.We by the all-links in HTML by cutting out master The Main Domain of domain name and URL addresses is contrasted to find out interior chain and exterior chain.
Linked for spare anchor, including<A href=" "></a>With<A href=" # "></a>Two kinds of forms, this link Any reaction will not be produced after click.Fishing website so design is in order to allow webpage to appear to have many hyperlink, with this To confuse user, the present embodiment can count spare anchor and connect number as this feature.
For login window, fishing website lures that user reveals personal sensitive information, this literary grace into often through login window Judge whether webpage includes login window with following logic.Found first in the page all<form>Label, then find The inside<input>Label, finally to each<input>Label, matches the keyword such as password, pass.If Password, pass are not matched to, then all<form>The strategy of the keywords such as login, signin is matched in label.
For the length characteristic of HTML content, the purpose of fishing website is all to gain the log-on message of user by cheating, so setting Can be fairly simple on meter, the HTML codes of these websites can handle inadequate in detail, be exactly to go fishing than relatively straightforward information The code length of website is usually shorter than normal website.Length characteristic is refine in the length of label substance by we, such as: ‘<style>’、‘<script>’、‘<link>’、‘<!-->’、‘<from>' these label substances length.Why select These labels are considered, such as '<style>' code in label mainly sets web page style, CSS is set, is fished The designer at fishnet station will not usually do the style of webpage excessive design to quickly develop;Also such as '<!-->' mark The function of label is code annotation, and fishing website is disposably to develop, and developer will not carry out secondary maintenance, so they are usually not It can go to write annotation.
For hide/limitation information, these be usually found in '<div>’、‘<button>' and '<input>' label On.<div>Label:<Div style=" visibility:hidden”>Or<Div style=" display:none”>, Function is all to hide this<div>The content of the inside, does not show in the page is rendered.<button>Label:<button Disabled=" disabled ">, it is that the click function of this button is forbidden.<input>Label:<Input type=" hidden”>It is that input frame is hidden,<Input disabled=" disabled ">It is the input function for forbidding input frame, also Have<Input value=" hello ">It is to insert some unrelated information in input frame in advance.
Whether consistent for title brand and brand in the URL, the title of general webpage can include the product of the webpage Board title, fishing website are also such.The brand name of well-known website is one-to-one with their Main Domain, if it is known that The brand of the webpage, but Main Domain is but inconsistent with the brand in its URL, then this webpage is likely to be then Fishing net Page.
It is whether consistent for the most brands of occurrence number in HTML and brand in the URL, inside and outside link principle with above-mentioned It is similar.Link in normal Website page, it is most of all referring to itself brand, so occurring in normal web site url most Brand should be consistent with the brand of URL.And link resources of the fishing website due to largely using targeted website, so its The brand for occurring most brands and URL in link can be inconsistent.The all-links inside HTML code are extracted first, Then the brand name occurred in these link Main Domains and the number that they occur are counted, obtains a brand dictionary;At this The most brand name B of occurrence number is found out in brand dictionary, brand name B and the brand name A in URL are compared, if this two A brand name is consistent, then this feature is set to 0, and exports the occurrence number of brand name B as feature;Otherwise this feature puts 1.
For inside and outside resource, the thought using external resource is partial to based on fishing website, be directed to respectively '<link >’、‘<img>’、‘<script>' and '<noscript>' these four labels, count them and use internal resource and external resource Number as feature.
The number occurred for URL brand names in HTML, due in being commonly used in the HTML code of normal website Portion's resource, so the number that brand name occurs in HTML in its URL can be relatively more.On the contrary, fishing website is then fewer.
For warning window, part fishing website can allow user to input individual in the window in a manner of ejecting warning window Information.
For redirecting, many fishing website designers can first make a normal website, when user accesses the website During link, the link that user accesses is redirected to by fishing website by the method for redirection.The HTML generations of the webpage of redirection Generally there is ' redirect ' character string in code.
For Word2Vec features, Word2Vec is the method more than the comparison used in natural language processing, it is using deeply Spend study thought, can by training, the vector operation processing to content of text being reduced in K dimensional vector spaces, and Similarity in vector space can be used for representing the similarity on text semantic.Google was in the word2vec to increase income in 2013 A efficient tool that word is characterized as to real number value vector, its act on be exactly by a word be mapped to specified dimension to Amount.HTML code is characterized as K dimensional vectors using word2vec in the present embodiment, this K dimensional vector as feature, specific practice It is as follows:
1st, remove in HTML code<style>、<script>、<noscript>All the elements of label;
2nd, the content of text in remaining label but Hold sticker is removed, such as:”<Div id=" doc " class=" wrap”><h2>This is my homepage</h2></div>" be converted into " This is my homepage ";
3rd, Chinese and English matching is carried out to content of text according to certain rule, text is divided into English content and Chinese content;
4th, for the Chinese content in text, Chinese word segmentation is carried out using jieba instruments, English content then according to space into Row participle, so as to obtain the word lists in this HTML code with k vocabulary;
5th, to each word in word lists, the vector of 300 dimensions is all translated into word2vec;
6th, by each dimension divided by k after all addition of vectors, one 300 vector tieed up is finally obtained to characterize this The content of text of HTML code.
Feature extraction is carried out by method described above, combines URL features and HTML features, is finally extracted altogether The feature of 338 dimensions.
S102:The N-dimensional current signature is inputted in the stacking models of training completion and carry out feature extension, obtain institute State the N+n dimensional features of webpage to be sorted;Wherein, the stacking models are mould of the p base categories model through q layers of stacking Type, n are the product of p and q, and n, p, q are positive integer;
In specific implementation, the N-dimensional current signature is inputted in the stacking models of training completion and carries out feature expansion Exhibition, obtains the N+n dimensional features of the webpage to be sorted, comprises the following steps:
S1:Training set is obtained, and the training set is divided into m parts of training samples;Wherein, it is each in the training set A webpage includes N-dimensional feature;
S2:Choose a training sample and train the base categories model, the base categories mould completed using training Type predicts the webpage to be sorted, is completed until the m parts of training sample is all chosen, obtains m medium range forecast as a result, and leading to Cross and the m medium range forecast result is voted to obtain one-dimensional increase feature;
In specific implementation, a training sample training base categories model is selected, i.e., by any portion of training sample Choose once, obtain m prediction result.
S3:S2 is repeated, until all training completion obtains p dimension increase features to all base categories models, and will The p dimension increases feature merges with the current signature of the webpage to be sorted, the current signature as the webpage to be sorted;
S4:S2-S3 is repeated, until the q layers for completing the stacking models stack, obtains the N+ of the webpage to be sorted N dimensional features.
Wherein, the step of S2-S3 is one layer of stacking models stacking, the step of repeating S2-S3, when completing institute The q layers for stating stacking models stack, and obtain the N+n dimensional features of the webpage to be sorted, complete the spy to webpage to be sorted Sign extension.The present embodiment does not limit the number of plies that Stacking models stack, and the stacking model numbers of plies are preferably 2, are at this time System accuracy rate highest.
S103:The classification results of the webpage to be sorted are obtained using sorting algorithm according to the N+n dimensional features.
In specific implementation, the sorting algorithm includes GBDT algorithms, and certain those skilled in the art can also be according to reality The other sorting algorithms of border situation selection, are not especially limited herein.Detection for fishing website, that is, export the net to be sorted Page whether be fishing website testing result.
Web page classification method provided in an embodiment of the present invention, it is current to the N-dimensional of webpage to be sorted using stacking models Feature is extended, and the accuracy of Web page classifying is improved on the premise of not depending on search engine or third party's service.
The embodiment of the invention discloses a kind of Web page classification method, and relative to a upper embodiment, the present embodiment is to technical side Case has made further instruction and optimization.Specifically:
Referring to Fig. 2 and Fig. 3, Fig. 2 is the flow chart of another Web page classification method provided in an embodiment of the present invention, and Fig. 3 is The flow chart that feature extends in another kind Web page classification method Stacking models disclosed by the embodiments of the present invention, as shown in Fig. 2, Including:
S211:Obtain 338 dimension current signatures of webpage to be sorted;
S212:Described 338 dimension current signatures are standardized using Z-score methods;
S221:Training set is obtained, and the training set is divided into m parts of training samples;Wherein, it is every in the training set One webpage includes 338 dimensional features;
S222:Choose a training sample and train the base categories model, the base categories completed using training Webpage to be sorted described in model prediction, completes until the m parts of training sample is all chosen, obtains m medium range forecast as a result, simultaneously By being voted to obtain one-dimensional increase feature to the m medium range forecast result;
In the present embodiment, the base categories model include GBDT disaggregated models, XGBoost disaggregated models and LightGBM disaggregated models.Certainly, those skilled in the art can also select other disaggregated models according to actual conditions, all should Within the scope of the present invention.
S223:S222 is repeated, until all training completion obtains three-dimensional increase feature to all base categories models, And the three-dimensional feature that increases is merged with the current signature of the webpage to be sorted, the current spy as the webpage to be sorted Sign;
S224:S222-S223 is repeated, until completing two layers of stacking of stacking models, obtains the webpage to be sorted 344 dimensional features;
S203:The classification results of the webpage to be sorted are obtained using GBDT algorithms according to 344 dimensional feature.
For above-described embodiment, a bigger data set S is chosen, 10000 normal nets are contained in the data set The webpage source code stood with 10000 fishing websites.Net of the wherein normal website sources between Alexa rankings 10000~12000 Stand and these websites in partial link, can so ensure that the distribution of lengths of URL is uniform;Fishing website derives from The upper fishing websites since in June, 2009 in June, 2017 by verification of Phishtank.In addition, make comparisons for convenience, We also have collected a small-sized data set T as test set.Normal website sources are in Alexa rankings 10000 in test set The partial link totally 1000 in website and these websites between~12000, fishing website is on Phishtank Fishing website link totally 1000 of on the July 15,12 days to 2017 July in 2017 by verification.
Evaluation index is used as using accuracy rate, false dismissed rate and false alarm rate.If P is the number of fishing website in test set, L is The number of legitimate site in test set, α predict correct number for fishing website, and β predicts correct number for legitimate site, then The calculation formula of accuracy rate is as follows:
The calculation formula of false dismissed rate is as follows:
The calculation formula of false alarm rate is as follows:
In order to prove the validity of stacking models in the present embodiment, Fishing net is randomly selected respectively in data set S Stand and each 3000 of legitimate site is used as training set, train model and tested on test set, each model performance performance is such as Shown in table 1 below, it can be seen that, method provided in this embodiment will compare in terms of these three in accuracy rate, false dismissed rate and false alarm rate Other single models are good.
The performance of 1 various models of table
, should by art methods there is many research and actual application in fishing website context of detection Use on our data set to be tested, it is as a result as shown in table 2 below:
The performance comparison of 2 each method of table
Method False dismissed rate (%) False alarm rate (%) Accuracy rate (%)
Cantina 70 7.5 61.25
Varshney 7.6 48 72.2
Rakesh 7.8 9.5 91.35
The present embodiment 3.4 3.7 96.45
A kind of web page classification system provided in an embodiment of the present invention is introduced below, a kind of webpage described below point Class system can be cross-referenced with a kind of above-described Web page classification method.
Referring to Fig. 4, a kind of structure chart of web page classification system provided in an embodiment of the present invention, as shown in figure 4, including:
Acquisition module 401, for obtaining the N-dimensional current signature of webpage to be sorted;Wherein, N is positive integer;
Expansion module 402, feature is carried out for the N-dimensional current signature to be inputted in the stacking models trained and completed Extension, obtains the N+n dimensional features of the webpage to be sorted;Wherein, the stacking models are p base categories model through q The model that layer stacks, n are the product of p and q, and n, p, q are positive integer;
Sort module 403, for obtaining the classification of the webpage to be sorted using sorting algorithm according to the N+n dimensional features As a result.
Web page classification system provided in an embodiment of the present invention, it is current to the N-dimensional of webpage to be sorted using stacking models Feature is extended, and the accuracy of Web page classifying is improved on the premise of not depending on search engine or third party's service.
On the basis of above-described embodiment, preferably, the expansion module 402 includes:
Division unit, m parts of training samples are divided into for obtaining training set, and by the training set;Wherein, the training Each webpage concentrated includes N-dimensional feature;
Predicting unit, is trained the base categories model for choosing a training sample, is completed using training Webpage to be sorted described in base categories model prediction, is completed until the m parts of training sample is all chosen, and is obtained pre- among m Survey as a result, and by being voted to obtain one-dimensional increase feature to the m medium range forecast result, described in repeated priming in advance The workflow of unit is surveyed, until all training completion obtains p dimension increase features to all base categories models, starts and closes And the workflow of unit;
The combining unit, merges for the p to be tieed up increase feature with the initial characteristics of the webpage to be sorted, as The initial characteristics of the webpage to be sorted, start the workflow of the predicting unit, until completing the stacking models Q layers stack, obtain the N+n dimensional features of the webpage to be sorted.
On the basis of above-described embodiment, preferably, further include:
Standardized module, for being standardized using Z-score methods to the N-dimensional feature.
Present invention also provides a kind of Web page classifying equipment, referring to Fig. 5, a kind of Web page classifying provided in an embodiment of the present invention The structure chart of equipment, as shown in figure 5, including:
Memory 501, for storing Web page classifying program;
Processor 502, for performing Web page classifying program when, can realize the step of above-described embodiment provides.When The right Web page classifying equipment can also include the component such as various network interfaces, power supply.
Web page classifying equipment provided in an embodiment of the present invention, it is current to the N-dimensional of webpage to be sorted using stacking models Feature is extended, and the accuracy of Web page classifying is improved on the premise of not depending on search engine or third party's service.
Present invention also provides a kind of computer-readable recording medium, is stored thereon with Web page classifying program, the webpage Sort program can realize the step of above-described embodiment provides when being executed by processor.The storage medium can include:USB flash disk, Mobile hard disk, read-only storage (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD etc. are various can be with the medium of store program codes.
Each embodiment is described by the way of progressive in this specification, what each embodiment stressed be and other The difference of embodiment, between each embodiment identical similar portion mutually referring to.
The foregoing description of the disclosed embodiments, enables professional and technical personnel in the field to realize or use the present invention. A variety of modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, it is of the invention The embodiments shown herein is not intended to be limited to, and is to fit to and the principles and novel features disclosed herein phase one The most wide scope caused.
Each embodiment is described by the way of progressive in specification, and what each embodiment stressed is and other realities Apply the difference of example, between each embodiment identical similar portion mutually referring to.For system disclosed in embodiment Speech, since it is corresponded to the methods disclosed in the examples, so description is fairly simple, related part is referring to method part illustration .It should be pointed out that for those skilled in the art, on the premise of the application principle is not departed from, also Some improvement and modification can be carried out to the application, these are improved and modification also falls into the application scope of the claims It is interior.
It should also be noted that, in the present specification, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or order.Moreover, term " comprising ", "comprising" or its any other variant meaning Covering non-exclusive inclusion, so that process, method, article or equipment including a series of elements not only include that A little key elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic key element of equipment.In the absence of more restrictions, the key element limited by sentence "including a ...", is not arranged Except also there are other identical element in the process, method, article or apparatus that includes the element.

Claims (10)

  1. A kind of 1. Web page classification method, it is characterised in that including:
    Obtain the N-dimensional current signature of webpage to be sorted;Wherein, N is positive integer;
    The N-dimensional current signature is inputted in the stacking models of training completion and carry out feature extension, obtained described to be sorted The N+n dimensional features of webpage;Wherein, the stacking models are model of the p base categories model through q layers of stacking, and n is p and q Product, n, p, q are positive integer;
    The classification results of the webpage to be sorted are obtained using sorting algorithm according to the N+n dimensional features.
  2. 2. Web page classification method according to claim 1, it is characterised in that the N-dimensional current signature is inputted into training and is completed Stacking models in carry out feature extension, obtain the N+n dimensional features of the webpage to be sorted, including:
    S1:Training set is obtained, and the training set is divided into m parts of training samples;Wherein, each net in the training set Page includes N-dimensional feature;
    S2:Choose a training sample and train the base categories model, the base categories model completed using training is pre- The webpage to be sorted is surveyed, is completed until the m parts of training sample is all chosen, obtains m medium range forecast as a result, and by right The m medium range forecast result is voted to obtain one-dimensional increase feature;
    S3:S2 is repeated, until all training completion obtains p dimension increase features for all base categories model, and by the p Dimension increase feature merges with the current signature of the webpage to be sorted, the current signature as the webpage to be sorted;
    S4:S2-S3 is repeated, until the q layers for completing the stacking models stack, obtains the N+n dimensions of the webpage to be sorted Feature.
  3. 3. Web page classification method according to claim 2, it is characterised in that if p=3, the base categories model includes GBDT disaggregated models, XGBoost disaggregated models and lightGBM disaggregated models.
  4. 4. Web page classification method according to claim 1, it is characterised in that after the N-dimensional feature for obtaining webpage to be sorted, also Including:
    The N-dimensional feature is standardized using Z-score methods.
  5. 5. Web page classification method according to claim 1, it is characterised in that the sorting algorithm includes GBDT algorithms.
  6. 6. according to any one of the claim 1-5 Web page classification methods, it is characterised in that the N-dimensional current signature includes institute The URL features and HTML features of webpage to be sorted are stated, the URL features include top level domain number and similar famous brand name, institute Stating HTML features includes whether brand in spare anchor link number, title brand and the URL consistent, occurrence number in the HTML Most brands and brand in the URL whether consistent, inside and outside resource number and Word2vec features.
  7. A kind of 7. web page classification system, it is characterised in that including:
    Acquisition module, for obtaining the N-dimensional current signature of webpage to be sorted;Wherein, N is positive integer;
    Expansion module, carries out feature extension for the N-dimensional current signature to be inputted, obtains in the stacking models trained and completed To the N+n dimensional features of the webpage to be sorted;Wherein, the stacking models stack for p base categories model through q layers Model, n are the product of p and q, and n, p, q are positive integer;
    Sort module, for obtaining the classification results of the webpage to be sorted using sorting algorithm according to the N+n dimensional features.
  8. 8. web page classification system according to claim 7, it is characterised in that the expansion module includes:
    Division unit, m parts of training samples are divided into for obtaining training set, and by the training set;Wherein, in the training set Each webpage include N-dimensional feature;
    Predicting unit, trains the base categories model, the basis completed using training for choosing a training sample Disaggregated model predicts the webpage to be sorted, is completed until the m parts of training sample is all chosen, obtains m medium range forecast knot Fruit, and by being voted to obtain one-dimensional increase feature to the m medium range forecast result, predicted described in repeated priming single The workflow of member, until all base categories models all train completion to obtain p dimension increase features, starts and merges list The workflow of member;
    The combining unit, merges for the p to be tieed up increase feature with the initial characteristics of the webpage to be sorted, as described The initial characteristics of webpage to be sorted, start the workflow of the predicting unit, until completing the q layers of the stacking models Stack, obtain the N+n dimensional features of the webpage to be sorted.
  9. A kind of 9. Web page classifying equipment, it is characterised in that including:
    Memory, for storing Web page classifying program;
    Processor, the Web page classification method as described in any one of claim 1 to 6 is realized during for performing the Web page classifying program The step of.
  10. 10. a kind of computer-readable recording medium, it is characterised in that webpage point is stored with the computer-readable recording medium Class method, realizes the Web page classifying side as described in any one of claim 1 to 6 when the Web page classifying program is executed by processor Method.
CN201711285419.6A 2017-12-07 2017-12-07 Webpage classification method and system and webpage classification equipment Active CN108038173B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711285419.6A CN108038173B (en) 2017-12-07 2017-12-07 Webpage classification method and system and webpage classification equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711285419.6A CN108038173B (en) 2017-12-07 2017-12-07 Webpage classification method and system and webpage classification equipment

Publications (2)

Publication Number Publication Date
CN108038173A true CN108038173A (en) 2018-05-15
CN108038173B CN108038173B (en) 2021-11-26

Family

ID=62096244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711285419.6A Active CN108038173B (en) 2017-12-07 2017-12-07 Webpage classification method and system and webpage classification equipment

Country Status (1)

Country Link
CN (1) CN108038173B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109086932A (en) * 2018-08-02 2018-12-25 广东工业大学 A kind of prediction technique, system and the device of media information prevalence degree
CN109241383A (en) * 2018-07-20 2019-01-18 北京开普云信息科技有限公司 A kind of type of webpage intelligent identification Method and system based on deep learning
CN109657470A (en) * 2018-12-27 2019-04-19 北京天融信网络安全技术有限公司 Malicious web pages detection model training method, malicious web pages detection method and system
CN110069726A (en) * 2019-04-26 2019-07-30 福州大学 Anchor chain connects Relationship Prediction method between a kind of document network suitable for DBLP and arXiv
CN110119772A (en) * 2019-05-06 2019-08-13 哈尔滨理工大学 A kind of threedimensional model classification method based on geometric characteristic fusion
CN110347825A (en) * 2019-06-14 2019-10-18 北京物资学院 The short English film review classification method of one kind and device
CN115001763A (en) * 2022-05-20 2022-09-02 北京天融信网络安全技术有限公司 Phishing website attack detection method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7565350B2 (en) * 2006-06-19 2009-07-21 Microsoft Corporation Identifying a web page as belonging to a blog
CN101710333A (en) * 2009-11-26 2010-05-19 西北工业大学 Network text segmenting method based on genetic algorithm
CN102332028A (en) * 2011-10-15 2012-01-25 西安交通大学 Webpage-oriented unhealthy Web content identifying method
CN102831128A (en) * 2011-06-15 2012-12-19 富士通株式会社 Method and device for sorting information of namesake persons on Internet
CN103150369A (en) * 2013-03-07 2013-06-12 人民搜索网络股份公司 Method and device for identifying cheat web-pages
CN107360200A (en) * 2017-09-20 2017-11-17 广东工业大学 A kind of fishing detection method based on classification confidence and web site features

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7565350B2 (en) * 2006-06-19 2009-07-21 Microsoft Corporation Identifying a web page as belonging to a blog
CN101710333A (en) * 2009-11-26 2010-05-19 西北工业大学 Network text segmenting method based on genetic algorithm
CN102831128A (en) * 2011-06-15 2012-12-19 富士通株式会社 Method and device for sorting information of namesake persons on Internet
CN102332028A (en) * 2011-10-15 2012-01-25 西安交通大学 Webpage-oriented unhealthy Web content identifying method
CN103150369A (en) * 2013-03-07 2013-06-12 人民搜索网络股份公司 Method and device for identifying cheat web-pages
CN107360200A (en) * 2017-09-20 2017-11-17 广东工业大学 A kind of fishing detection method based on classification confidence and web site features

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MIN-YEN KAN: "Fast webpage classification using URL features", 《PROCEEDINGS OF THE 14TH ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT》 *
陈旭 等: "基于分类置信度和网站特征的钓鱼测系统", 《万方在线》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241383A (en) * 2018-07-20 2019-01-18 北京开普云信息科技有限公司 A kind of type of webpage intelligent identification Method and system based on deep learning
CN109241383B (en) * 2018-07-20 2019-06-21 北京开普云信息科技有限公司 A kind of type of webpage intelligent identification Method and system based on deep learning
CN109086932A (en) * 2018-08-02 2018-12-25 广东工业大学 A kind of prediction technique, system and the device of media information prevalence degree
CN109657470A (en) * 2018-12-27 2019-04-19 北京天融信网络安全技术有限公司 Malicious web pages detection model training method, malicious web pages detection method and system
CN110069726A (en) * 2019-04-26 2019-07-30 福州大学 Anchor chain connects Relationship Prediction method between a kind of document network suitable for DBLP and arXiv
CN110069726B (en) * 2019-04-26 2021-06-29 福州大学 Prediction method of anchor link relation between document networks suitable for DBLP and arXiv
CN110119772A (en) * 2019-05-06 2019-08-13 哈尔滨理工大学 A kind of threedimensional model classification method based on geometric characteristic fusion
CN110119772B (en) * 2019-05-06 2022-05-03 哈尔滨理工大学 Three-dimensional model classification method based on geometric shape feature fusion
CN110347825A (en) * 2019-06-14 2019-10-18 北京物资学院 The short English film review classification method of one kind and device
CN115001763A (en) * 2022-05-20 2022-09-02 北京天融信网络安全技术有限公司 Phishing website attack detection method and device, electronic equipment and storage medium
CN115001763B (en) * 2022-05-20 2024-03-19 北京天融信网络安全技术有限公司 Phishing website attack detection method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN108038173B (en) 2021-11-26

Similar Documents

Publication Publication Date Title
CN108038173A (en) A kind of Web page classification method, system and a kind of Web page classifying equipment
CN104077396B (en) Method and device for detecting phishing website
CN102279875B (en) Method and device for identifying fishing website
Aljofey et al. An effective detection approach for phishing websites using URL and HTML features
CN108777674B (en) Phishing website detection method based on multi-feature fusion
Zhang et al. Extracting implicit features in online customer reviews for opinion mining
CN103309862B (en) Webpage type recognition method and system
CN102436563B (en) Method and device for detecting page tampering
CN102446255B (en) Method and device for detecting page tamper
CN112541476B (en) Malicious webpage identification method based on semantic feature extraction
Martinez-Romo et al. Web spam identification through language model analysis
CN110191096B (en) Word vector webpage intrusion detection method based on semantic analysis
CN102591965A (en) Method and device for detecting black chain
CN110830489B (en) Method and system for detecting counterattack type fraud website based on content abstract representation
CN112464666B (en) Unknown network threat automatic discovery method based on hidden network data
CN110572359A (en) Phishing webpage detection method based on machine learning
CN107463844A (en) WEB Trojan detecting methods and system
Yang et al. Scalable detection of promotional website defacements in black hat {SEO} campaigns
Bai Phishing website detection based on machine learning algorithm
Nowroozi et al. An adversarial attack analysis on malicious advertisement url detection framework
Ren et al. A bi-directional lstm model with attention for malicious url detection
CN113438209B (en) Phishing website detection method based on improved Stacking strategy
Valiyaveedu et al. Survey and analysis on AI based phishing detection techniques
CN104036189A (en) Page distortion detecting method and black link database generating method
Abbasi et al. Detecting fake escrow websites using rich fraud cues and kernel based methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant