CN108038173B - Webpage classification method and system and webpage classification equipment - Google Patents

Webpage classification method and system and webpage classification equipment Download PDF

Info

Publication number
CN108038173B
CN108038173B CN201711285419.6A CN201711285419A CN108038173B CN 108038173 B CN108038173 B CN 108038173B CN 201711285419 A CN201711285419 A CN 201711285419A CN 108038173 B CN108038173 B CN 108038173B
Authority
CN
China
Prior art keywords
classification
dimensional
classified
webpage
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711285419.6A
Other languages
Chinese (zh)
Other versions
CN108038173A (en
Inventor
刘文印
黎宇坤
陈旭
袁华平
杨振国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201711285419.6A priority Critical patent/CN108038173B/en
Publication of CN108038173A publication Critical patent/CN108038173A/en
Application granted granted Critical
Publication of CN108038173B publication Critical patent/CN108038173B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a webpage classification method, which comprises the following steps: acquiring N-dimensional current characteristics of a webpage to be classified; wherein N is a positive integer; inputting the N-dimensional current features into a training model which is trained to perform feature expansion to obtain the N + N-dimensional features of the webpage to be classified; the stacking model is a model formed by stacking p basic classification models through q layers, n is the product of p and q, and n, p and q are positive integers; and obtaining a classification result of the webpage to be classified by utilizing a classification algorithm according to the N + N dimensional features. According to the webpage classification method disclosed by the invention, the N-dimensional current characteristics of the webpage to be classified are expanded by using the stacking model, and the accuracy of webpage classification is improved on the premise of not depending on a search engine or a third-party service. The invention also discloses a webpage classification system, webpage classification equipment and a computer readable storage medium, which can also realize the technical effect.

Description

Webpage classification method and system and webpage classification equipment
Technical Field
The present invention relates to the field of network security technologies, and in particular, to a method and a system for classifying web pages, a web page classification device, and a computer-readable storage medium.
Background
Phishing is a network fraud behavior, which means that lawless persons imitate the URL address and page content of a real website by various means, so as to cheat the privacy information of important account numbers, bank or credit card account numbers, passwords and the like of users. Lawbreakers will usually design the pages of phishing websites to be completely consistent with the real website interface, and entice visitors to submit accounts and passwords.
In recent years, many researchers have designed practical solutions to the anti-phishing problem. These solutions mainly have several aspects: (1) black list and white list based methods; (2) extracting the characteristics of texts, images or URLs from the webpage, and using a search engine to help detect phishing websites; (3) detecting a phishing website by using the visual similarity of the phishing webpage and a known webpage; (4) discovering phishing websites by using DNS abnormal information of the webpage; (5) and after text, images or special URL features are extracted from the HTML, detecting the phishing website through a heuristic algorithm or a machine learning algorithm. In the method, the method based on the black and white list needs to continuously and manually maintain the list; the method of using a search engine is often limited by the performance of the search engine and cannot achieve real-time detection; the method based on visual similarity is easily influenced by the accuracy rate of target identification; the method using the web page DNS requires a third party service to provide DNS information, and the development cost is high.
Therefore, how to improve the accuracy of webpage classification without depending on a search engine or a third-party service is a problem to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to provide a webpage classification method, a webpage classification system, webpage classification equipment and a computer readable storage medium, which improve the accuracy of webpage classification on the premise of not depending on a search engine or a third-party service.
In order to achieve the above object, an embodiment of the present invention provides a method for classifying web pages, including:
acquiring N-dimensional current characteristics of a webpage to be classified; wherein N is a positive integer;
inputting the N-dimensional current features into a training model which is trained to perform feature expansion to obtain the N + N-dimensional features of the webpage to be classified; the stacking model is a model formed by stacking p basic classification models through q layers, n is the product of p and q, and n, p and q are positive integers;
and obtaining a classification result of the webpage to be classified by utilizing a classification algorithm according to the N + N dimensional features.
Inputting the current N-dimensional features into a training model which is trained to perform feature expansion, and obtaining the N + N-dimensional features of the webpage to be classified, wherein the method comprises the following steps:
s1: acquiring a training set, and dividing the training set into m training samples; wherein each web page in the training set comprises N-dimensional features;
s2: selecting one part of training samples to train the basic classification model, predicting the webpage to be classified by using the trained basic classification model until all the m parts of training samples are selected to obtain m intermediate prediction results, and voting the m intermediate prediction results to obtain one-dimensional increase characteristics;
s3: repeating S2 until all the basic classification models are completely trained to obtain p-dimensional increasing features, and combining the p-dimensional increasing features with the current features of the web page to be classified to serve as the current features of the web page to be classified;
s4: and repeating S2-S3 until the stacking of the q layers of the stacking model is completed, and obtaining the N + N dimensional characteristics of the webpage to be classified.
And if p is 3, the basic classification model comprises a GBDT classification model, an XGboost classification model and a lightGBM classification model.
After acquiring the N-dimensional features of the web pages to be classified, the method further includes:
the N-dimensional features were normalized using the Z-score method.
Wherein the classification algorithm comprises a GBDT algorithm.
The N-dimensional current features comprise URL features and HTML features of the webpage to be classified, the URL features comprise top-level domain name numbers and similar known brands, and the HTML features comprise empty anchor link numbers, whether a title brand is consistent with a brand in the URL or not, whether a brand with the largest occurrence frequency in the HTML is consistent with the brand in the URL or not, internal and external resource numbers and Word2vec features.
In order to achieve the above object, an embodiment of the present invention provides a web page classification system, including:
the acquisition module is used for acquiring the N-dimensional current characteristics of the web pages to be classified; wherein N is a positive integer;
the extension module is used for inputting the N-dimensional current features into a training model which is trained to perform feature extension to obtain the N + N-dimensional features of the webpage to be classified; the stacking model is a model formed by stacking p basic classification models through q layers, n is the product of p and q, and n, p and q are positive integers;
and the classification module is used for obtaining the classification result of the webpage to be classified by utilizing a classification algorithm according to the N + N dimensional features.
Wherein the expansion module comprises:
the device comprises a dividing unit, a processing unit and a processing unit, wherein the dividing unit is used for acquiring a training set and dividing the training set into m training samples; wherein each web page in the training set comprises N-dimensional features;
the prediction unit is used for selecting one training sample to train the basic classification model, predicting the webpage to be classified by using the trained basic classification model until all the m training samples are selected to obtain m intermediate prediction results, voting the m intermediate prediction results to obtain one-dimensional increasing features, repeatedly starting the working process of the prediction unit until all the basic classification models are trained to obtain p-dimensional increasing features, and starting the working process of the merging unit;
the merging unit is configured to merge the p-dimensional increasing feature and the initial feature of the web page to be classified as the initial feature of the web page to be classified, start the work flow of the prediction unit until the q-layer stacking of the stacking model is completed, and obtain the N + N-dimensional feature of the web page to be classified.
In order to achieve the above object, an embodiment of the present invention provides a web page classification device, including:
a memory for storing a web page classification program;
and the processor is used for realizing the steps of the webpage classification method when executing the webpage classification program.
To achieve the above object, an embodiment of the present invention provides a computer-readable storage medium, on which a web page classification program is stored, and the web page classification program, when executed by a processor, implements the web page classification method.
According to the scheme, the webpage classification method provided by the embodiment of the invention comprises the following steps: acquiring N-dimensional current characteristics of a webpage to be classified; wherein N is a positive integer; inputting the N-dimensional current features into a training model which is trained to perform feature expansion to obtain the N + N-dimensional features of the webpage to be classified; the stacking model is a model formed by stacking p basic classification models through q layers, n is the product of p and q, and n, p and q are positive integers; and obtaining a classification result of the webpage to be classified by utilizing a classification algorithm according to the N + N dimensional features.
According to the webpage classification method provided by the embodiment of the invention, the N-dimensional current characteristics of the webpage to be classified are expanded by using the stacking model, and the accuracy of webpage classification is improved on the premise of not depending on a search engine or third-party service. The invention also discloses a webpage classification system, webpage classification equipment and a computer readable storage medium, which can also realize the technical effect.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a method for classifying web pages according to an embodiment of the present invention;
FIG. 2 is a flowchart of another method for classifying web pages according to an embodiment of the present invention;
FIG. 3 is a flowchart of feature extension in a Stacking model of another webpage classification method disclosed in the embodiments of the present invention;
FIG. 4 is a block diagram of a web page classification system according to an embodiment of the present invention;
fig. 5 is a structural diagram of a web page classification device disclosed in an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a webpage classification method, which improves the accuracy of webpage classification on the premise of not depending on a search engine or third-party service.
Referring to fig. 1, a flowchart of a method for classifying web pages disclosed in the embodiment of the present invention is shown in fig. 1, and includes:
s101: acquiring N-dimensional current characteristics of a webpage to be classified; wherein N is a positive integer;
the webpage classification method provided by the embodiment of the invention can be applied to detection of phishing websites, and in specific implementation, the N-dimensional current characteristics comprise URL characteristics and HTML characteristics of the webpage to be classified. The URL features are described as follows:
for IP addresses, the main domain name of a phishing website is usually only an IP address, for example: http://62.141.45.54/portaleTitolaris8/, whether the main domain name of the website is an IP address or not can be judged by regular matching. If so, the feature is 1, otherwise it is 0.
The suspicious symbol includes ' @ ', ' - ', '. If the '@' symbol appears in the URL, all strings to the right of the '@' symbol will be ignored when the browser parses the web site. The '-' and '-' symbols are rarely used in the URLs of normal websites because of their unclear meaning.
For https, https is a secure http data transmission method, providing authentication and encrypted communications.
The length information mainly includes the total length of the URL address and the length of the main domain name.
For the number of 'in the main domain name, the main domain name of a normal web site is usually short in order to make it easy for the user to remember, where the number of' generally does not exceed 3, for example: www.baidu.com, etc.
For sensitive words, some sensitive words are often used in URL addresses of phishing websites. In the present embodiment, a list of sensitive words, for example, [ 'secure', 'account', 'webscr', 'logic', 'ebayisappi', 'sign', 'banking', 'confirm', 'submit', 'update' ], may be created, and the number of the sensitive words appearing in the URL address may be counted as the feature.
For unusual top-level domains, the top-level domains are divided into two categories: one is the top-level domain names of countries and regions (full name: country code top-level domains, nTLDs for short), for example, 'cn' for China and 'jp' for Japan. The second is international top-level domain names (generic top-level domains, gTLDs), such as 'com' representing industry and commerce enterprises, 'net' representing network providers, 'org' representing non-profit organizations, and the like. Stuffgate counted 735 top-level domain names used by the top 100 million Alexa-ranked websites. If multiple common top-level domain names appear in the website's main domain name, such as: http:// www.ebay.com.urgd.com/path or URL path, i.e., the top-level domain name still appears in the section immediately following the main domain name, is very suspicious. In this embodiment, 3 features can be extracted: whether the top-level domain name is in the top-level domain name list of Stuffgate, the number of the top-level domain names in the main domain name and the number of the top-level domain names in the URL path.
For similar well-known brands. The brands with the relative popularity can be used by the phishers, for example, the phishers can change paypal to confuse the users. Levenshtein distance (Chinese full name: edit distance) can measure the similarity between two character strings. It represents the minimum number of edits required to convert from one string to another, including replacement, insertion, deletion of characters. Similar brand names can be found by computing the Levenshtein distance for a given well-known brand and a string in the URL.
The HTML features are described as follows:
for the number of inner and outer links, the inner link refers to a link whose main domain name is the same as the main domain name of the URL address. Phishing websites typically utilize external resources in order to trick the user into believing that the web page is legitimate. Especially, the resource of the object to be counterfeited causes that the internal links of the phishing website are very few, and the number of the external links is large. We find out the inner and outer links by cutting out all links in HTML to the main domain name of the URL address and comparing them.
For the empty anchor link, including two forms of < a href ═ and #' >, the link does not generate any reaction after clicking. The phishing website is designed in such a way that a webpage looks like a plurality of hyperlinks, so that a user is confused, and the number of the empty anchor connections can be counted as the characteristic.
For the login window, the phishing website often induces the user to reveal personal sensitive information through the login window, and the following logic is adopted to judge whether the webpage contains the login window. Firstly, all < form > tags are found in a page, then < input > tags in the page are found, and finally keywords such as password and pass are matched for each < input > tag. If password, pass does not match, then match the policy of key word such as logic, sign in all < form > tags.
For the length characteristics of HTML contents, phishing websites aim at cheating login information of users, so the phishing websites are simple in design, HTML codes of the websites are not sufficient in detail, and the direct information is that the code length of the phishing websites is usually shorter than that of normal websites. We refine the length feature to the length of the label content, such as: ' < style > ', ' < script > ', ' < link > ', ' < | >)! - - > ', ' < from > ' are the length of the contents of the tag. It is considered to select these tags, for example, the code in the '< style >' tag is mainly to set the style of the web page, set the CSS, and the designer of the phishing website usually does not make too much design on the style of the web page for quick development; also such as' <! The function of the- >' tag is code annotation, the phishing website is developed once, developers do not perform secondary maintenance, so they do not usually write annotations.
For hidden/restricted information, these will typically appear on the '< div >', '< button >' and '< input >' tags. < div > tag: and the function is to hide the content in the < div > and not display the content in the rendered page. < button > tag: the button disabled state is "disabled", and the click function of this button is disabled. < input > tag: the input type ═ hidden "> is to hide the input box, < input disabled ═ disabled" > is to disable the input function of the input box, and < input value ═ hello "> is to fill some irrelevant information in the input box in advance.
For whether the title brand is consistent with the brand in the URL, the title of a general webpage will contain the brand name of the webpage, and so will the phishing website. The brand names of known websites are one-to-one with their main domain name, and if the brand of the web page is known but the main domain name in the URL does not match the brand, then the web page is likely to be a phishing web page.
Whether the brand with the largest occurrence frequency in the HTML is consistent with the brand in the URL or not is similar to the internal and external linking principle. Most of the links in the normal website page are directed to the own brand, so the most appeared brands in the normal website links should be consistent with the brand of the URL. And the phishing website uses link resources of the target website in a large quantity, so that the brand which appears most in the links is inconsistent with the brand of the URL. Firstly, extracting all links in HTML codes, and then counting brand names appearing in main domain names of the links and the appearing times of the brand names to obtain a brand dictionary; finding out the brand name B with the largest occurrence frequency in the brand dictionary, comparing the brand name B with the brand name A in the URL, setting the feature to be 0 if the two brand names are consistent, and outputting the occurrence frequency of the brand name B as the feature; otherwise this feature is set to 1.
For internal and external resources, based on the idea that phishing websites prefer to use external resources, the number of times that they use internal resources and external resources is counted as a feature for four kinds of tags, '< link >', '< img >', '< script >' and '< script >', respectively.
For the number of times of appearing URL brand names in HTML, internal resources are frequently used in HTML codes of normal websites, so that the number of times of appearing URL brand names in HTML is large. In contrast, phishing websites are less.
For the warning window, some fishing net stations may have the user input personal information in the window in a manner of popping up the warning window.
For redirection, many phishing site designers will first create a normal website, and when a user accesses the website link, the link accessed by the user is redirected to the phishing site by a redirection method. The HTML code of the redirected web page typically has a 'redirect' string.
For Word2Vec features, Word2Vec is a more used method in natural language processing, and the processing of text contents can be simplified into vector operation in a K-dimensional vector space through training by utilizing the idea of deep learning, and the similarity on the vector space can be used for expressing the similarity on text semantics. Google, open source word2vec in 2013, is an efficient tool for characterizing words as real-valued vectors, which functions to map a word to a vector of specified dimensions. In this embodiment, word2vec is used to characterize the HTML code as a K-dimensional vector, and the K-dimensional vector is taken as a feature, which is specifically performed as follows:
1. removing all contents of < style >, < script > tags in the HTML codes;
2. removing the rest of the labels but keeping the text content in the labels, such as: the < div id ═ doc ═ wrap "> < h2> This is my homepage 2> </div >" is converted into "This my homepage";
3. performing Chinese and English matching on the text content according to a certain rule, and dividing the text into English content and Chinese content;
4. for the Chinese content in the text, performing Chinese word segmentation by using a jieba tool, and performing word segmentation on English content according to a blank space, thereby obtaining a word list with k words in the HTML code;
5. for each word in the vocabulary list, converting the word into a 300-dimensional vector by using word2 vec;
6. and adding all the vectors, and dividing each dimension by k to obtain a 300-dimensional vector to represent the text content of the HTML code.
Feature extraction is performed by the method described above, and the URL feature and the HTML feature are combined, so that 338-dimensional features are finally extracted.
S102: inputting the N-dimensional current features into a training model which is trained to perform feature expansion to obtain the N + N-dimensional features of the webpage to be classified; the stacking model is a model formed by stacking p basic classification models through q layers, n is the product of p and q, and n, p and q are positive integers;
in a specific implementation, inputting the N-dimensional current feature into a training model for feature extension to obtain an N + N-dimensional feature of the web page to be classified, the method comprises the following steps:
s1: acquiring a training set, and dividing the training set into m training samples; wherein each web page in the training set comprises N-dimensional features;
s2: selecting one part of training samples to train the basic classification model, predicting the webpage to be classified by using the trained basic classification model until all the m parts of training samples are selected to obtain m intermediate prediction results, and voting the m intermediate prediction results to obtain one-dimensional increase characteristics;
in specific implementation, a training sample is selected to train the basic classification model, that is, any one of the training samples is selected once, so that m prediction results are obtained.
S3: repeating S2 until all the basic classification models are completely trained to obtain p-dimensional increasing features, and combining the p-dimensional increasing features with the current features of the web page to be classified to serve as the current features of the web page to be classified;
s4: and repeating S2-S3 until the stacking of the q layers of the stacking model is completed, and obtaining the N + N dimensional characteristics of the webpage to be classified.
And the steps S2-S3 are stacking of the stacking model by one layer, the steps S2-S3 are repeated, and when the stacking of the stacking model by q layers is completed, the N + N dimensional features of the webpage to be classified are obtained, and the feature expansion of the webpage to be classified is completed. In this embodiment, the number of stacked layers of the Stacking model is not limited, the number of stacked layers of the Stacking model is preferably 2, and the system accuracy is highest at this time.
S103: and obtaining a classification result of the webpage to be classified by utilizing a classification algorithm according to the N + N dimensional features.
In a specific implementation, the classification algorithm includes a GBDT algorithm, but a person skilled in the art may select other classification algorithms according to practical situations, and the classification algorithm is not limited in this embodiment. And for the detection of the phishing website, outputting a detection result whether the webpage to be classified is the phishing website.
According to the webpage classification method provided by the embodiment of the invention, the N-dimensional current characteristics of the webpage to be classified are expanded by using the stacking model, and the accuracy of webpage classification is improved on the premise of not depending on a search engine or third-party service.
The embodiment of the invention discloses a webpage classification method, and compared with the previous embodiment, the technical scheme is further explained and optimized in the embodiment. Specifically, the method comprises the following steps:
referring to fig. 2 and fig. 3, fig. 2 is a flowchart of another web page classification method provided by the embodiment of the present invention, and fig. 3 is a flowchart of feature extension in a Stacking model of another web page classification method disclosed by the embodiment of the present invention, as shown in fig. 2, including:
s211: acquiring 338-dimensional current features of the webpage to be classified;
s212: standardizing the 338-dimensional current feature by using a Z-score method;
s221: acquiring a training set, and dividing the training set into m training samples; wherein each web page in the training set comprises 338-dimensional features;
s222: selecting one part of training samples to train the basic classification model, predicting the webpage to be classified by using the trained basic classification model until all the m parts of training samples are selected to obtain m intermediate prediction results, and voting the m intermediate prediction results to obtain one-dimensional increase characteristics;
in this embodiment, the basic classification model includes a GBDT classification model, an XGBoost classification model, and a lightGBM classification model. Of course, those skilled in the art can select other classification models according to actual situations, and all of them should be within the scope of the present invention.
S223: repeating S222 until all the basic classification models are completely trained to obtain three-dimensional increasing features, and combining the three-dimensional increasing features with the current features of the web pages to be classified to serve as the current features of the web pages to be classified;
s224: repeating S222-S223 until two-layer stacking of the stacking model is completed, and obtaining 344-dimensional features of the webpage to be classified;
s203: and obtaining a classification result of the webpage to be classified by using a GBDT algorithm according to the 344-dimensional features.
For the above embodiment, a relatively large data set S is selected, which contains the webpage source codes of 10000 normal websites and 10000 phishing websites. The normal websites are from websites with Alexa ranks between 10000-12000 and partial links in the websites, so that the uniform length distribution of URLs can be ensured; phishing websites were sourced from Phishtank, which started from 2009 at 6 months to 2017 at 6 menses. In addition, for comparison, we have collected a small data set T as the test set. The normal websites in the test set are from websites with Alexa ranks between 10000-12000 and part of links in the websites, and the phishing websites are from 1000 links verified from 2017, month 7 and month 12 to 2017, month 7 and day 15 on Phishtank.
And using the accuracy, the false alarm rate and the false alarm rate as evaluation indexes. If P is the number of phishing websites in the test set, L is the number of legal websites in the test set, alpha is the predicted correct number of phishing websites, and beta is the predicted correct number of legal websites, the calculation formula of the accuracy rate is as follows:
Figure BDA0001498336070000101
the calculation formula of the alarm leakage rate is as follows:
Figure BDA0001498336070000102
the false alarm rate is calculated as follows:
Figure BDA0001498336070000103
in order to prove the effectiveness of the stacking model in this embodiment, 3000 phishing websites and 3000 legal websites are respectively randomly selected from the data set S as training sets, the trained models are tested on the test set, and the performance of each model is shown in table 1 below.
TABLE 1 Performance of the various models
Figure BDA0001498336070000104
Figure BDA0001498336070000111
There have been many research and practical applications in phishing website detection, testing using prior art methods applied to our data set, with the results shown in table 2 below:
TABLE 2 comparison of the Performance of the methods
Method Rate of missing alarm (%) False alarm rate (%) Accuracy (%)
Cantina 70 7.5 61.25
Varshney 7.6 48 72.2
Rakesh 7.8 9.5 91.35
This example 3.4 3.7 96.45
In the following, a web page classification system provided by an embodiment of the present invention is introduced, and a web page classification system described below and a web page classification method described above may be referred to each other.
Referring to fig. 4, a structure diagram of a web page classification system according to an embodiment of the present invention is shown in fig. 4, and includes:
an obtaining module 401, configured to obtain an N-dimensional current feature of a webpage to be classified; wherein N is a positive integer;
an extension module 402, configured to input the N-dimensional current feature into a trained stacking model for feature extension, so as to obtain an N + N-dimensional feature of the web page to be classified; the stacking model is a model formed by stacking p basic classification models through q layers, n is the product of p and q, and n, p and q are positive integers;
and the classification module 403 is configured to obtain a classification result of the web page to be classified by using a classification algorithm according to the N + N-dimensional features.
According to the webpage classification system provided by the embodiment of the invention, the N-dimensional current characteristics of the webpage to be classified are expanded by using the stacking model, so that the accuracy of webpage classification is improved on the premise of not depending on a search engine or third-party service.
On the basis of the above embodiment, as a preferred implementation, the extension module 402 includes:
the device comprises a dividing unit, a processing unit and a processing unit, wherein the dividing unit is used for acquiring a training set and dividing the training set into m training samples; wherein each web page in the training set comprises N-dimensional features;
the prediction unit is used for selecting one training sample to train the basic classification model, predicting the webpage to be classified by using the trained basic classification model until all the m training samples are selected to obtain m intermediate prediction results, voting the m intermediate prediction results to obtain one-dimensional increasing features, repeatedly starting the working process of the prediction unit until all the basic classification models are trained to obtain p-dimensional increasing features, and starting the working process of the merging unit;
the merging unit is configured to merge the p-dimensional increasing feature and the initial feature of the web page to be classified as the initial feature of the web page to be classified, start the work flow of the prediction unit until the q-layer stacking of the stacking model is completed, and obtain the N + N-dimensional feature of the web page to be classified.
In addition to the above embodiments, as a preferred embodiment, the method further includes:
and the standardization module is used for carrying out standardization processing on the N-dimensional features by utilizing a Z-score method.
The present application further provides a web page classification device, referring to fig. 5, a structure diagram of a web page classification device provided in an embodiment of the present invention is shown in fig. 5, and includes:
a memory 501 for storing a web page classification program;
the processor 502, when executing the web page classification program, may implement the steps provided in the above embodiments. Of course, the web page classification device may also include various network interfaces, power supplies and other components.
According to the webpage classification device provided by the embodiment of the invention, the N-dimensional current characteristics of the webpage to be classified are expanded by using the stacking model, so that the accuracy of webpage classification is improved on the premise of not depending on a search engine or third-party service.
The present application also provides a computer-readable storage medium having a web page classification program stored thereon, which when executed by a processor can implement the steps provided by the above embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (8)

1. A method for classifying web pages, comprising:
acquiring N-dimensional current characteristics of a webpage to be classified; wherein N is a positive integer;
inputting the N-dimensional current features into a training model which is trained to perform feature expansion to obtain the N + N-dimensional features of the webpage to be classified; the stacking model is a model formed by stacking p basic classification models through q layers, n is the product of p and q, and n, p and q are positive integers;
obtaining a classification result of the webpage to be classified by using a classification algorithm according to the N + N dimensional features;
inputting the current N-dimensional features into a training model which is trained to perform feature expansion, and obtaining the N + N-dimensional features of the webpage to be classified, wherein the method comprises the following steps:
s1: acquiring a training set, and dividing the training set into m training samples; wherein each web page in the training set comprises N-dimensional features;
s2: selecting a basic classification model from the p basic classification models as a target basic classification model, selecting one part of training samples to train the target basic classification model, predicting the webpage to be classified by using the trained target basic classification model until all the m parts of training samples are selected to obtain m intermediate prediction results, and voting the m intermediate prediction results to obtain one-dimensional increased characteristics;
s3: repeating S2 until all the p basic classification models are used as target basic classification models and training is completed to obtain p-dimensional increasing features, and combining the p-dimensional increasing features with the current features of the web page to be classified to serve as the current features of the web page to be classified;
s4: and repeating S2-S3 until the stacking of the q layers of the stacking model is completed, and obtaining the N + N dimensional characteristics of the webpage to be classified.
2. The method for classifying web pages according to claim 1, wherein if p is 3, the basic classification model comprises a GBDT classification model, an XGBoost classification model and a lightGBM classification model.
3. The method for classifying web pages according to claim 1, after obtaining the N-dimensional features of the web pages to be classified, further comprising:
the N-dimensional features were normalized using the Z-score method.
4. The method of classifying web pages according to claim 1, wherein the classification algorithm comprises the GBDT algorithm.
5. The method for classifying web pages according to any one of claims 1 to 4, wherein the N-dimensional current characteristics comprise URL characteristics and HTML characteristics of the web pages to be classified, the URL characteristics comprise the number of top-level domain names and similar known brands, and the HTML characteristics comprise the number of empty anchor links, whether the brand of a title is consistent with the brand in the URL, whether the brand with the largest number of occurrences in the HTML is consistent with the brand in the URL, the number of internal and external resources and the Word2vec characteristics.
6. A system for classifying web pages, comprising:
the acquisition module is used for acquiring the N-dimensional current characteristics of the web pages to be classified; wherein N is a positive integer;
the extension module is used for inputting the N-dimensional current features into a training model which is trained to perform feature extension to obtain the N + N-dimensional features of the webpage to be classified; the stacking model is a model formed by stacking p basic classification models through q layers, n is the product of p and q, and n, p and q are positive integers;
the classification module is used for obtaining a classification result of the webpage to be classified by utilizing a classification algorithm according to the N + N-dimensional features;
wherein the expansion module comprises:
the device comprises a dividing unit, a processing unit and a processing unit, wherein the dividing unit is used for acquiring a training set and dividing the training set into m training samples; wherein each web page in the training set comprises N-dimensional features;
the prediction unit is used for selecting a basic classification model from the p basic classification models as a target basic classification model, selecting one part of the training samples to train the target basic classification model, predicting the webpage to be classified by using the trained target basic classification model until all the m parts of the training samples are selected to obtain m intermediate prediction results, voting the m intermediate prediction results to obtain one-dimensional increasing characteristics, repeatedly starting the working process of the prediction unit until all the p basic classification models are used as the target basic classification models to train to obtain the p-dimensional increasing characteristics, and starting the working process of the merging unit;
the merging unit is configured to merge the p-dimensional increasing feature and the initial feature of the web page to be classified as the initial feature of the web page to be classified, start the work flow of the prediction unit until the q-layer stacking of the stacking model is completed, and obtain the N + N-dimensional feature of the web page to be classified.
7. A web page classification apparatus, comprising:
a memory for storing a web page classification program;
a processor for implementing the steps of the web page classification method according to any one of claims 1 to 5 when executing the web page classification program.
8. A computer-readable storage medium, on which a web page classification program is stored, which, when executed by a processor, implements the web page classification method according to any one of claims 1 to 5.
CN201711285419.6A 2017-12-07 2017-12-07 Webpage classification method and system and webpage classification equipment Active CN108038173B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711285419.6A CN108038173B (en) 2017-12-07 2017-12-07 Webpage classification method and system and webpage classification equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711285419.6A CN108038173B (en) 2017-12-07 2017-12-07 Webpage classification method and system and webpage classification equipment

Publications (2)

Publication Number Publication Date
CN108038173A CN108038173A (en) 2018-05-15
CN108038173B true CN108038173B (en) 2021-11-26

Family

ID=62096244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711285419.6A Active CN108038173B (en) 2017-12-07 2017-12-07 Webpage classification method and system and webpage classification equipment

Country Status (1)

Country Link
CN (1) CN108038173B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241383B (en) * 2018-07-20 2019-06-21 北京开普云信息科技有限公司 A kind of type of webpage intelligent identification Method and system based on deep learning
CN109086932A (en) * 2018-08-02 2018-12-25 广东工业大学 A kind of prediction technique, system and the device of media information prevalence degree
CN109657470A (en) * 2018-12-27 2019-04-19 北京天融信网络安全技术有限公司 Malicious web pages detection model training method, malicious web pages detection method and system
CN110069726B (en) * 2019-04-26 2021-06-29 福州大学 Prediction method of anchor link relation between document networks suitable for DBLP and arXiv
CN110119772B (en) * 2019-05-06 2022-05-03 哈尔滨理工大学 Three-dimensional model classification method based on geometric shape feature fusion
CN110347825A (en) * 2019-06-14 2019-10-18 北京物资学院 The short English film review classification method of one kind and device
CN115001763B (en) * 2022-05-20 2024-03-19 北京天融信网络安全技术有限公司 Phishing website attack detection method and device, electronic equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7565350B2 (en) * 2006-06-19 2009-07-21 Microsoft Corporation Identifying a web page as belonging to a blog
CN101710333B (en) * 2009-11-26 2012-07-04 西北工业大学 Network text segmenting method based on genetic algorithm
CN102831128B (en) * 2011-06-15 2015-03-25 富士通株式会社 Method and device for sorting information of namesake persons on Internet
CN102332028B (en) * 2011-10-15 2013-08-28 西安交通大学 Webpage-oriented unhealthy Web content identifying method
CN103150369A (en) * 2013-03-07 2013-06-12 人民搜索网络股份公司 Method and device for identifying cheat web-pages
CN107360200A (en) * 2017-09-20 2017-11-17 广东工业大学 A kind of fishing detection method based on classification confidence and web site features

Also Published As

Publication number Publication date
CN108038173A (en) 2018-05-15

Similar Documents

Publication Publication Date Title
CN108038173B (en) Webpage classification method and system and webpage classification equipment
Wang et al. PDRCNN: Precise phishing detection with recurrent convolutional neural networks
CN108777674B (en) Phishing website detection method based on multi-feature fusion
CN108134784B (en) Webpage classification method and device, storage medium and electronic equipment
CN104156490A (en) Method and device for detecting suspicious fishing webpage based on character recognition
CN112541476B (en) Malicious webpage identification method based on semantic feature extraction
CN102436563A (en) Method and device for detecting page tampering
CN102591965A (en) Method and device for detecting black chain
CN107273465A (en) SQL injection detection method
CN110191096A (en) A kind of term vector homepage invasion detection method based on semantic analysis
CN111160019A (en) Public opinion monitoring method, device and system
CN116917894A (en) Detecting phishing URLs using a converter
Ren et al. A bi-directional LSTM model with attention for malicious URL detection
Bhakuni et al. Evolution and evaluation: Sarcasm analysis for twitter data using sentiment analysis
CN112948725A (en) Phishing website URL detection method and system based on machine learning
CN106202349B (en) Webpage classification dictionary generation method and device
WO2017000659A1 (en) Enriched uniform resource locator (url) identification method and apparatus
Guha Related Fact Checks: a tool for combating fake news
Qasem et al. Leveraging contextual features to enhanced machine learning models in detecting COVID-19 fake news
Li et al. Semantic‐enhanced multimodal fusion network for fake news detection
CN113742785A (en) Webpage classification method and device, electronic equipment and storage medium
KR102483004B1 (en) Method for detecting harmful url
CN110825976B (en) Website page detection method and device, electronic equipment and medium
CN114282097A (en) Information identification method and device
TWI397833B (en) Method and system for detecting a phishing webpage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant