CN108038173B

CN108038173B - Webpage classification method and system and webpage classification equipment

Info

Publication number: CN108038173B
Application number: CN201711285419.6A
Authority: CN
Inventors: 刘文印; 黎宇坤; 陈旭; 袁华平; 杨振国
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2017-12-07
Filing date: 2017-12-07
Publication date: 2021-11-26
Anticipated expiration: 2037-12-07
Also published as: CN108038173A

Abstract

The invention discloses a webpage classification method, which comprises the following steps: acquiring N-dimensional current characteristics of a webpage to be classified; wherein N is a positive integer; inputting the N-dimensional current features into a training model which is trained to perform feature expansion to obtain the N + N-dimensional features of the webpage to be classified; the stacking model is a model formed by stacking p basic classification models through q layers, n is the product of p and q, and n, p and q are positive integers; and obtaining a classification result of the webpage to be classified by utilizing a classification algorithm according to the N + N dimensional features. According to the webpage classification method disclosed by the invention, the N-dimensional current characteristics of the webpage to be classified are expanded by using the stacking model, and the accuracy of webpage classification is improved on the premise of not depending on a search engine or a third-party service. The invention also discloses a webpage classification system, webpage classification equipment and a computer readable storage medium, which can also realize the technical effect.

Description

Webpage classification method and system and webpage classification equipment

Technical Field

The present invention relates to the field of network security technologies, and in particular, to a method and a system for classifying web pages, a web page classification device, and a computer-readable storage medium.

Background

Phishing is a network fraud behavior, which means that lawless persons imitate the URL address and page content of a real website by various means, so as to cheat the privacy information of important account numbers, bank or credit card account numbers, passwords and the like of users. Lawbreakers will usually design the pages of phishing websites to be completely consistent with the real website interface, and entice visitors to submit accounts and passwords.

In recent years, many researchers have designed practical solutions to the anti-phishing problem. These solutions mainly have several aspects: (1) black list and white list based methods; (2) extracting the characteristics of texts, images or URLs from the webpage, and using a search engine to help detect phishing websites; (3) detecting a phishing website by using the visual similarity of the phishing webpage and a known webpage; (4) discovering phishing websites by using DNS abnormal information of the webpage; (5) and after text, images or special URL features are extracted from the HTML, detecting the phishing website through a heuristic algorithm or a machine learning algorithm. In the method, the method based on the black and white list needs to continuously and manually maintain the list; the method of using a search engine is often limited by the performance of the search engine and cannot achieve real-time detection; the method based on visual similarity is easily influenced by the accuracy rate of target identification; the method using the web page DNS requires a third party service to provide DNS information, and the development cost is high.

Therefore, how to improve the accuracy of webpage classification without depending on a search engine or a third-party service is a problem to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide a webpage classification method, a webpage classification system, webpage classification equipment and a computer readable storage medium, which improve the accuracy of webpage classification on the premise of not depending on a search engine or a third-party service.

In order to achieve the above object, an embodiment of the present invention provides a method for classifying web pages, including:

acquiring N-dimensional current characteristics of a webpage to be classified; wherein N is a positive integer;

inputting the N-dimensional current features into a training model which is trained to perform feature expansion to obtain the N + N-dimensional features of the webpage to be classified; the stacking model is a model formed by stacking p basic classification models through q layers, n is the product of p and q, and n, p and q are positive integers;

and obtaining a classification result of the webpage to be classified by utilizing a classification algorithm according to the N + N dimensional features.

Inputting the current N-dimensional features into a training model which is trained to perform feature expansion, and obtaining the N + N-dimensional features of the webpage to be classified, wherein the method comprises the following steps:

s1: acquiring a training set, and dividing the training set into m training samples; wherein each web page in the training set comprises N-dimensional features;

s2: selecting one part of training samples to train the basic classification model, predicting the webpage to be classified by using the trained basic classification model until all the m parts of training samples are selected to obtain m intermediate prediction results, and voting the m intermediate prediction results to obtain one-dimensional increase characteristics;

s3: repeating S2 until all the basic classification models are completely trained to obtain p-dimensional increasing features, and combining the p-dimensional increasing features with the current features of the web page to be classified to serve as the current features of the web page to be classified;

s4: and repeating S2-S3 until the stacking of the q layers of the stacking model is completed, and obtaining the N + N dimensional characteristics of the webpage to be classified.

And if p is 3, the basic classification model comprises a GBDT classification model, an XGboost classification model and a lightGBM classification model.

After acquiring the N-dimensional features of the web pages to be classified, the method further includes:

the N-dimensional features were normalized using the Z-score method.

Wherein the classification algorithm comprises a GBDT algorithm.

The N-dimensional current features comprise URL features and HTML features of the webpage to be classified, the URL features comprise top-level domain name numbers and similar known brands, and the HTML features comprise empty anchor link numbers, whether a title brand is consistent with a brand in the URL or not, whether a brand with the largest occurrence frequency in the HTML is consistent with the brand in the URL or not, internal and external resource numbers and Word2vec features.

In order to achieve the above object, an embodiment of the present invention provides a web page classification system, including:

the acquisition module is used for acquiring the N-dimensional current characteristics of the web pages to be classified; wherein N is a positive integer;

the extension module is used for inputting the N-dimensional current features into a training model which is trained to perform feature extension to obtain the N + N-dimensional features of the webpage to be classified; the stacking model is a model formed by stacking p basic classification models through q layers, n is the product of p and q, and n, p and q are positive integers;

and the classification module is used for obtaining the classification result of the webpage to be classified by utilizing a classification algorithm according to the N + N dimensional features.

Wherein the expansion module comprises:

the device comprises a dividing unit, a processing unit and a processing unit, wherein the dividing unit is used for acquiring a training set and dividing the training set into m training samples; wherein each web page in the training set comprises N-dimensional features;

the prediction unit is used for selecting one training sample to train the basic classification model, predicting the webpage to be classified by using the trained basic classification model until all the m training samples are selected to obtain m intermediate prediction results, voting the m intermediate prediction results to obtain one-dimensional increasing features, repeatedly starting the working process of the prediction unit until all the basic classification models are trained to obtain p-dimensional increasing features, and starting the working process of the merging unit;

the merging unit is configured to merge the p-dimensional increasing feature and the initial feature of the web page to be classified as the initial feature of the web page to be classified, start the work flow of the prediction unit until the q-layer stacking of the stacking model is completed, and obtain the N + N-dimensional feature of the web page to be classified.

In order to achieve the above object, an embodiment of the present invention provides a web page classification device, including:

a memory for storing a web page classification program;

and the processor is used for realizing the steps of the webpage classification method when executing the webpage classification program.

To achieve the above object, an embodiment of the present invention provides a computer-readable storage medium, on which a web page classification program is stored, and the web page classification program, when executed by a processor, implements the web page classification method.

According to the scheme, the webpage classification method provided by the embodiment of the invention comprises the following steps: acquiring N-dimensional current characteristics of a webpage to be classified; wherein N is a positive integer; inputting the N-dimensional current features into a training model which is trained to perform feature expansion to obtain the N + N-dimensional features of the webpage to be classified; the stacking model is a model formed by stacking p basic classification models through q layers, n is the product of p and q, and n, p and q are positive integers; and obtaining a classification result of the webpage to be classified by utilizing a classification algorithm according to the N + N dimensional features.

According to the webpage classification method provided by the embodiment of the invention, the N-dimensional current characteristics of the webpage to be classified are expanded by using the stacking model, and the accuracy of webpage classification is improved on the premise of not depending on a search engine or third-party service. The invention also discloses a webpage classification system, webpage classification equipment and a computer readable storage medium, which can also realize the technical effect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a method for classifying web pages according to an embodiment of the present invention;

FIG. 2 is a flowchart of another method for classifying web pages according to an embodiment of the present invention;

FIG. 3 is a flowchart of feature extension in a Stacking model of another webpage classification method disclosed in the embodiments of the present invention;

FIG. 4 is a block diagram of a web page classification system according to an embodiment of the present invention;

fig. 5 is a structural diagram of a web page classification device disclosed in an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a webpage classification method, which improves the accuracy of webpage classification on the premise of not depending on a search engine or third-party service.

Referring to fig. 1, a flowchart of a method for classifying web pages disclosed in the embodiment of the present invention is shown in fig. 1, and includes:

s101: acquiring N-dimensional current characteristics of a webpage to be classified; wherein N is a positive integer;

the webpage classification method provided by the embodiment of the invention can be applied to detection of phishing websites, and in specific implementation, the N-dimensional current characteristics comprise URL characteristics and HTML characteristics of the webpage to be classified. The URL features are described as follows:

for IP addresses, the main domain name of a phishing website is usually only an IP address, for example: http://62.141.45.54/portaleTitolaris8/, whether the main domain name of the website is an IP address or not can be judged by regular matching. If so, the feature is 1, otherwise it is 0.

The suspicious symbol includes ' @ ', ' - ', '. If the '@' symbol appears in the URL, all strings to the right of the '@' symbol will be ignored when the browser parses the web site. The '-' and '-' symbols are rarely used in the URLs of normal websites because of their unclear meaning.

For https, https is a secure http data transmission method, providing authentication and encrypted communications.

The length information mainly includes the total length of the URL address and the length of the main domain name.

For the number of 'in the main domain name, the main domain name of a normal web site is usually short in order to make it easy for the user to remember, where the number of' generally does not exceed 3, for example: www.baidu.com, etc.

For sensitive words, some sensitive words are often used in URL addresses of phishing websites. In the present embodiment, a list of sensitive words, for example, [ 'secure', 'account', 'webscr', 'logic', 'ebayisappi', 'sign', 'banking', 'confirm', 'submit', 'update' ], may be created, and the number of the sensitive words appearing in the URL address may be counted as the feature.

For unusual top-level domains, the top-level domains are divided into two categories: one is the top-level domain names of countries and regions (full name: country code top-level domains, nTLDs for short), for example, 'cn' for China and 'jp' for Japan. The second is international top-level domain names (generic top-level domains, gTLDs), such as 'com' representing industry and commerce enterprises, 'net' representing network providers, 'org' representing non-profit organizations, and the like. Stuffgate counted 735 top-level domain names used by the top 100 million Alexa-ranked websites. If multiple common top-level domain names appear in the website's main domain name, such as: http:// www.ebay.com.urgd.com/path or URL path, i.e., the top-level domain name still appears in the section immediately following the main domain name, is very suspicious. In this embodiment, 3 features can be extracted: whether the top-level domain name is in the top-level domain name list of Stuffgate, the number of the top-level domain names in the main domain name and the number of the top-level domain names in the URL path.

For similar well-known brands. The brands with the relative popularity can be used by the phishers, for example, the phishers can change paypal to confuse the users. Levenshtein distance (Chinese full name: edit distance) can measure the similarity between two character strings. It represents the minimum number of edits required to convert from one string to another, including replacement, insertion, deletion of characters. Similar brand names can be found by computing the Levenshtein distance for a given well-known brand and a string in the URL.

The HTML features are described as follows:

for the number of inner and outer links, the inner link refers to a link whose main domain name is the same as the main domain name of the URL address. Phishing websites typically utilize external resources in order to trick the user into believing that the web page is legitimate. Especially, the resource of the object to be counterfeited causes that the internal links of the phishing website are very few, and the number of the external links is large. We find out the inner and outer links by cutting out all links in HTML to the main domain name of the URL address and comparing them.

For the empty anchor link, including two forms of < a href ═ and #' >, the link does not generate any reaction after clicking. The phishing website is designed in such a way that a webpage looks like a plurality of hyperlinks, so that a user is confused, and the number of the empty anchor connections can be counted as the characteristic.

For the login window, the phishing website often induces the user to reveal personal sensitive information through the login window, and the following logic is adopted to judge whether the webpage contains the login window. Firstly, all < form > tags are found in a page, then < input > tags in the page are found, and finally keywords such as password and pass are matched for each < input > tag. If password, pass does not match, then match the policy of key word such as logic, sign in all < form > tags.

For the length characteristics of HTML contents, phishing websites aim at cheating login information of users, so the phishing websites are simple in design, HTML codes of the websites are not sufficient in detail, and the direct information is that the code length of the phishing websites is usually shorter than that of normal websites. We refine the length feature to the length of the label content, such as: ' < style > ', ' < script > ', ' < link > ', ' < | >)! - - > ', ' < from > ' are the length of the contents of the tag. It is considered to select these tags, for example, the code in the '< style >' tag is mainly to set the style of the web page, set the CSS, and the designer of the phishing website usually does not make too much design on the style of the web page for quick development; also such as' <! The function of the- >' tag is code annotation, the phishing website is developed once, developers do not perform secondary maintenance, so they do not usually write annotations.

For hidden/restricted information, these will typically appear on the '< div >', '< button >' and '< input >' tags. < div > tag: and the function is to hide the content in the < div > and not display the content in the rendered page. < button > tag: the button disabled state is "disabled", and the click function of this button is disabled. < input > tag: the input type ═ hidden "> is to hide the input box, < input disabled ═ disabled" > is to disable the input function of the input box, and < input value ═ hello "> is to fill some irrelevant information in the input box in advance.

For whether the title brand is consistent with the brand in the URL, the title of a general webpage will contain the brand name of the webpage, and so will the phishing website. The brand names of known websites are one-to-one with their main domain name, and if the brand of the web page is known but the main domain name in the URL does not match the brand, then the web page is likely to be a phishing web page.

Whether the brand with the largest occurrence frequency in the HTML is consistent with the brand in the URL or not is similar to the internal and external linking principle. Most of the links in the normal website page are directed to the own brand, so the most appeared brands in the normal website links should be consistent with the brand of the URL. And the phishing website uses link resources of the target website in a large quantity, so that the brand which appears most in the links is inconsistent with the brand of the URL. Firstly, extracting all links in HTML codes, and then counting brand names appearing in main domain names of the links and the appearing times of the brand names to obtain a brand dictionary; finding out the brand name B with the largest occurrence frequency in the brand dictionary, comparing the brand name B with the brand name A in the URL, setting the feature to be 0 if the two brand names are consistent, and outputting the occurrence frequency of the brand name B as the feature; otherwise this feature is set to 1.

For internal and external resources, based on the idea that phishing websites prefer to use external resources, the number of times that they use internal resources and external resources is counted as a feature for four kinds of tags, '< link >', '< img >', '< script >' and '< script >', respectively.

For the number of times of appearing URL brand names in HTML, internal resources are frequently used in HTML codes of normal websites, so that the number of times of appearing URL brand names in HTML is large. In contrast, phishing websites are less.

For the warning window, some fishing net stations may have the user input personal information in the window in a manner of popping up the warning window.

For redirection, many phishing site designers will first create a normal website, and when a user accesses the website link, the link accessed by the user is redirected to the phishing site by a redirection method. The HTML code of the redirected web page typically has a 'redirect' string.

For Word2Vec features, Word2Vec is a more used method in natural language processing, and the processing of text contents can be simplified into vector operation in a K-dimensional vector space through training by utilizing the idea of deep learning, and the similarity on the vector space can be used for expressing the similarity on text semantics. Google, open source word2vec in 2013, is an efficient tool for characterizing words as real-valued vectors, which functions to map a word to a vector of specified dimensions. In this embodiment, word2vec is used to characterize the HTML code as a K-dimensional vector, and the K-dimensional vector is taken as a feature, which is specifically performed as follows:

1. removing all contents of < style >, < script > tags in the HTML codes;

2. removing the rest of the labels but keeping the text content in the labels, such as: the < div id ═ doc ═ wrap "> < h2> This is my homepage 2> </div >" is converted into "This my homepage";

3. performing Chinese and English matching on the text content according to a certain rule, and dividing the text into English content and Chinese content;

4. for the Chinese content in the text, performing Chinese word segmentation by using a jieba tool, and performing word segmentation on English content according to a blank space, thereby obtaining a word list with k words in the HTML code;

5. for each word in the vocabulary list, converting the word into a 300-dimensional vector by using word2 vec;

6. and adding all the vectors, and dividing each dimension by k to obtain a 300-dimensional vector to represent the text content of the HTML code.

Feature extraction is performed by the method described above, and the URL feature and the HTML feature are combined, so that 338-dimensional features are finally extracted.

S102: inputting the N-dimensional current features into a training model which is trained to perform feature expansion to obtain the N + N-dimensional features of the webpage to be classified; the stacking model is a model formed by stacking p basic classification models through q layers, n is the product of p and q, and n, p and q are positive integers;

in a specific implementation, inputting the N-dimensional current feature into a training model for feature extension to obtain an N + N-dimensional feature of the web page to be classified, the method comprises the following steps:

in specific implementation, a training sample is selected to train the basic classification model, that is, any one of the training samples is selected once, so that m prediction results are obtained.

And the steps S2-S3 are stacking of the stacking model by one layer, the steps S2-S3 are repeated, and when the stacking of the stacking model by q layers is completed, the N + N dimensional features of the webpage to be classified are obtained, and the feature expansion of the webpage to be classified is completed. In this embodiment, the number of stacked layers of the Stacking model is not limited, the number of stacked layers of the Stacking model is preferably 2, and the system accuracy is highest at this time.

S103: and obtaining a classification result of the webpage to be classified by utilizing a classification algorithm according to the N + N dimensional features.

In a specific implementation, the classification algorithm includes a GBDT algorithm, but a person skilled in the art may select other classification algorithms according to practical situations, and the classification algorithm is not limited in this embodiment. And for the detection of the phishing website, outputting a detection result whether the webpage to be classified is the phishing website.

According to the webpage classification method provided by the embodiment of the invention, the N-dimensional current characteristics of the webpage to be classified are expanded by using the stacking model, and the accuracy of webpage classification is improved on the premise of not depending on a search engine or third-party service.

The embodiment of the invention discloses a webpage classification method, and compared with the previous embodiment, the technical scheme is further explained and optimized in the embodiment. Specifically, the method comprises the following steps:

referring to fig. 2 and fig. 3, fig. 2 is a flowchart of another web page classification method provided by the embodiment of the present invention, and fig. 3 is a flowchart of feature extension in a Stacking model of another web page classification method disclosed by the embodiment of the present invention, as shown in fig. 2, including:

s211: acquiring 338-dimensional current features of the webpage to be classified;

s212: standardizing the 338-dimensional current feature by using a Z-score method;

s221: acquiring a training set, and dividing the training set into m training samples; wherein each web page in the training set comprises 338-dimensional features;

s222: selecting one part of training samples to train the basic classification model, predicting the webpage to be classified by using the trained basic classification model until all the m parts of training samples are selected to obtain m intermediate prediction results, and voting the m intermediate prediction results to obtain one-dimensional increase characteristics;

in this embodiment, the basic classification model includes a GBDT classification model, an XGBoost classification model, and a lightGBM classification model. Of course, those skilled in the art can select other classification models according to actual situations, and all of them should be within the scope of the present invention.

S223: repeating S222 until all the basic classification models are completely trained to obtain three-dimensional increasing features, and combining the three-dimensional increasing features with the current features of the web pages to be classified to serve as the current features of the web pages to be classified;

s224: repeating S222-S223 until two-layer stacking of the stacking model is completed, and obtaining 344-dimensional features of the webpage to be classified;

s203: and obtaining a classification result of the webpage to be classified by using a GBDT algorithm according to the 344-dimensional features.

For the above embodiment, a relatively large data set S is selected, which contains the webpage source codes of 10000 normal websites and 10000 phishing websites. The normal websites are from websites with Alexa ranks between 10000-12000 and partial links in the websites, so that the uniform length distribution of URLs can be ensured; phishing websites were sourced from Phishtank, which started from 2009 at 6 months to 2017 at 6 menses. In addition, for comparison, we have collected a small data set T as the test set. The normal websites in the test set are from websites with Alexa ranks between 10000-12000 and part of links in the websites, and the phishing websites are from 1000 links verified from 2017, month 7 and month 12 to 2017, month 7 and day 15 on Phishtank.

And using the accuracy, the false alarm rate and the false alarm rate as evaluation indexes. If P is the number of phishing websites in the test set, L is the number of legal websites in the test set, alpha is the predicted correct number of phishing websites, and beta is the predicted correct number of legal websites, the calculation formula of the accuracy rate is as follows:

the calculation formula of the alarm leakage rate is as follows:

the false alarm rate is calculated as follows:

in order to prove the effectiveness of the stacking model in this embodiment, 3000 phishing websites and 3000 legal websites are respectively randomly selected from the data set S as training sets, the trained models are tested on the test set, and the performance of each model is shown in table 1 below.

TABLE 1 Performance of the various models

There have been many research and practical applications in phishing website detection, testing using prior art methods applied to our data set, with the results shown in table 2 below:

TABLE 2 comparison of the Performance of the methods

Method	Rate of missing alarm (%)	False alarm rate (%)	Accuracy (%)
				Cantina	70	7.5	61.25
Varshney	7.6	48	72.2
				Rakesh	7.8	9.5	91.35
This example	3.4	3.7	96.45

In the following, a web page classification system provided by an embodiment of the present invention is introduced, and a web page classification system described below and a web page classification method described above may be referred to each other.

Referring to fig. 4, a structure diagram of a web page classification system according to an embodiment of the present invention is shown in fig. 4, and includes:

an obtaining module 401, configured to obtain an N-dimensional current feature of a webpage to be classified; wherein N is a positive integer;

an extension module 402, configured to input the N-dimensional current feature into a trained stacking model for feature extension, so as to obtain an N + N-dimensional feature of the web page to be classified; the stacking model is a model formed by stacking p basic classification models through q layers, n is the product of p and q, and n, p and q are positive integers;

and the classification module 403 is configured to obtain a classification result of the web page to be classified by using a classification algorithm according to the N + N-dimensional features.

According to the webpage classification system provided by the embodiment of the invention, the N-dimensional current characteristics of the webpage to be classified are expanded by using the stacking model, so that the accuracy of webpage classification is improved on the premise of not depending on a search engine or third-party service.

On the basis of the above embodiment, as a preferred implementation, the extension module 402 includes:

In addition to the above embodiments, as a preferred embodiment, the method further includes:

and the standardization module is used for carrying out standardization processing on the N-dimensional features by utilizing a Z-score method.

The present application further provides a web page classification device, referring to fig. 5, a structure diagram of a web page classification device provided in an embodiment of the present invention is shown in fig. 5, and includes:

a memory 501 for storing a web page classification program;

the processor 502, when executing the web page classification program, may implement the steps provided in the above embodiments. Of course, the web page classification device may also include various network interfaces, power supplies and other components.

According to the webpage classification device provided by the embodiment of the invention, the N-dimensional current characteristics of the webpage to be classified are expanded by using the stacking model, so that the accuracy of webpage classification is improved on the premise of not depending on a search engine or third-party service.

The present application also provides a computer-readable storage medium having a web page classification program stored thereon, which when executed by a processor can implement the steps provided by the above embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method for classifying web pages, comprising:

obtaining a classification result of the webpage to be classified by using a classification algorithm according to the N + N dimensional features;

s2: selecting a basic classification model from the p basic classification models as a target basic classification model, selecting one part of training samples to train the target basic classification model, predicting the webpage to be classified by using the trained target basic classification model until all the m parts of training samples are selected to obtain m intermediate prediction results, and voting the m intermediate prediction results to obtain one-dimensional increased characteristics;

s3: repeating S2 until all the p basic classification models are used as target basic classification models and training is completed to obtain p-dimensional increasing features, and combining the p-dimensional increasing features with the current features of the web page to be classified to serve as the current features of the web page to be classified;

2. The method for classifying web pages according to claim 1, wherein if p is 3, the basic classification model comprises a GBDT classification model, an XGBoost classification model and a lightGBM classification model.

3. The method for classifying web pages according to claim 1, after obtaining the N-dimensional features of the web pages to be classified, further comprising:

the N-dimensional features were normalized using the Z-score method.

4. The method of classifying web pages according to claim 1, wherein the classification algorithm comprises the GBDT algorithm.

5. The method for classifying web pages according to any one of claims 1 to 4, wherein the N-dimensional current characteristics comprise URL characteristics and HTML characteristics of the web pages to be classified, the URL characteristics comprise the number of top-level domain names and similar known brands, and the HTML characteristics comprise the number of empty anchor links, whether the brand of a title is consistent with the brand in the URL, whether the brand with the largest number of occurrences in the HTML is consistent with the brand in the URL, the number of internal and external resources and the Word2vec characteristics.

6. A system for classifying web pages, comprising:

the classification module is used for obtaining a classification result of the webpage to be classified by utilizing a classification algorithm according to the N + N-dimensional features;

wherein the expansion module comprises:

the prediction unit is used for selecting a basic classification model from the p basic classification models as a target basic classification model, selecting one part of the training samples to train the target basic classification model, predicting the webpage to be classified by using the trained target basic classification model until all the m parts of the training samples are selected to obtain m intermediate prediction results, voting the m intermediate prediction results to obtain one-dimensional increasing characteristics, repeatedly starting the working process of the prediction unit until all the p basic classification models are used as the target basic classification models to train to obtain the p-dimensional increasing characteristics, and starting the working process of the merging unit;

7. A web page classification apparatus, comprising:

a memory for storing a web page classification program;

a processor for implementing the steps of the web page classification method according to any one of claims 1 to 5 when executing the web page classification program.

8. A computer-readable storage medium, on which a web page classification program is stored, which, when executed by a processor, implements the web page classification method according to any one of claims 1 to 5.