CN108920492B

CN108920492B - Webpage classification method, system, terminal and storage medium

Info

Publication number: CN108920492B
Application number: CN201810465784.3A
Authority: CN
Inventors: 张君晖
Original assignee: Guangzhou Sunteng Information Technology Co ltd
Current assignee: Guangzhou Sunteng Information Technology Co ltd
Priority date: 2018-05-16
Filing date: 2018-05-16
Publication date: 2021-04-09
Anticipated expiration: 2038-05-16
Also published as: CN108920492A

Abstract

The invention discloses a webpage classification method, a system, a terminal and a storage medium, wherein the method comprises the steps of obtaining webpage link information; inputting the obtained webpage link information into a text classification model for classification processing, and outputting a site classification result corresponding to the webpage link information; the text classification model is trained based on a Boosting integration method. The system comprises an acquisition module and a processing module. The terminal comprises a memory for storing a program and a processor for loading a program to perform the method steps. By using the method and the device, the webpage can be classified quickly and accurately. The webpage classification method, the webpage classification system, the webpage classification terminal and the webpage classification storage medium can be widely applied to the field of text classification.

Description

Webpage classification method, system, terminal and storage medium

Technical Field

The present invention relates to data classification processing technologies, and in particular, to a method, a system, a terminal, and a storage medium for classifying web pages.

Background

Explanation of technical words:

f1 Score: f1 score is an index used for measuring the accuracy of the two classification models in statistics, and gives consideration to the accuracy and the recall rate of the classification models; in particular, the F1 score may be viewed as a weighted average of model accuracy and recall.

Boosting: boosting, which is a method used to improve the accuracy of weak classification algorithms, is performed by constructing a series of prediction functions and then combining them in a certain way into a prediction function.

The advertisement bidding system needs to process requests of billions level every day, wherein each bidding request contains page information, equipment information, user information and the like, and the information is landed on a server in a log form, and then through the analysis of an algorithm, required data is extracted and persisted in a database. However, sometimes the bid request lacks keywords and related descriptions of the requested page, and in order to solve this problem, a method commonly used in the industry at present is to extract links of the requested page and give the links of the requested page to crawlers to crawl information of these pages. However, with the advertisement bidding system, there are also the following problems: 1. under the condition of massive bidding requests, the data volume of the extracted request page is very huge, and the data volume of the result crawled from the request page is also very huge, so that the problem of high difficulty in data storage and management exists; 2. some categories of advertisement pages have poor delivery effect, so when crawlers of the advertisement pages need to be paused, a relatively complex flow is needed to remove the urls; 3. different advertisement pages have different requirements on the crawler capacity, so that the processing capacity of the crawler is difficult to balance and the bandwidth is difficult to reasonably utilize on the premise that the advertisement pages are not classified.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide a method, a system, a terminal and a storage medium for classifying web pages, which have high classification processing rate and accuracy.

The first technical scheme adopted by the invention is as follows: a webpage classification method comprises the following steps:

acquiring webpage link information;

inputting the obtained webpage link information into a text classification model for classification processing, and outputting a site classification result corresponding to the webpage link information;

the text classification model is trained based on a Boosting integration method.

The second technical scheme adopted by the invention is as follows: a web page classification system comprising:

the acquisition module is used for acquiring webpage link information;

the processing module is used for inputting the acquired webpage link information into a text classification model for classification processing and then outputting a site classification result corresponding to the webpage link information;

The third technical scheme adopted by the invention is as follows: a terminal, comprising:

at least one processor;

at least one memory for storing at least one program;

when the at least one program is executed by the at least one processor, the at least one processor may implement the method for classifying web pages according to the first aspect.

The fourth technical scheme adopted by the invention is as follows: a storage medium having stored therein processor-executable instructions for performing a method of classifying web pages as described in the first aspect when the processor-executable instructions are executed by a processor.

The invention has the advantages that: according to the method, the obtained webpage link information is input into a text classification model trained based on a Boosting integration method for classification, and then a site classification result corresponding to the webpage link information is output, so that classification storage of webpage data is facilitated, management and subsequent query processing are facilitated, information crawling processing is not directly performed on webpages with poor advertising page delivery effects, waste of resources is avoided, processing efficiency is improved, processing capacity of crawlers can be balanced, bandwidth is reasonably utilized, and downstream algorithm tasks are supported. In addition, the method adopts the text classification model trained based on the Boosting integration method to realize the webpage classification, so that the processing rate and the accuracy of the classification are very high.

Drawings

FIG. 1 is a flowchart illustrating the steps of a method for classifying web pages according to the present invention;

FIG. 2 is a schematic diagram of a text classification model used in a web page classification method according to the present invention;

FIG. 3 is a schematic diagram of a web page classification system according to the present invention;

fig. 4 is a schematic structural diagram of a terminal according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and preferred embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

As shown in fig. 1, an embodiment of the present invention provides a method for classifying web pages, which includes the following steps:

s101, acquiring webpage link information;

specifically, in this embodiment, the web page is an advertisement page;

s102, inputting the acquired webpage link information into a text classification model for classification, and outputting a site classification result corresponding to the webpage link information;

According to the method, the constructed text classification model trained based on the Boosting integration method is on line, advertisement page link information contained in bidding requests in real-time logs is classified by using the text classification model, site classification is carried out in real time, and therefore advertisement page links of different site categories can be handed to crawlers of corresponding categories for information crawling processing, classified storage of webpage data is facilitated, management, follow-up query processing and the like are facilitated, information crawling processing can be directly omitted for webpages of the categories with poor advertisement page launching effects, waste of resources is avoided, processing efficiency is improved, load capacity of balanced crawlers can be greatly improved, bandwidth is reasonably utilized, and downstream algorithm tasks are supported; in addition, the method adopts the text classification model trained based on the Boosting integration method to realize the webpage classification, so that the processing rate and the accuracy of the classification are very high.

In a preferred embodiment, the web page link information includes a website title keyword list corresponding to the web page link, where the keyword list includes at least one keyword.

In a preferred embodiment, the text classification model is a text classification model constructed by the following construction step S100:

and S1001, acquiring a training data set.

Specifically, as for the step S1001, it preferably includes:

s10011, data acquisition step: selecting data stored in the last r days from a persistence layer database as an original data set;

the data feature components (site, title, manual _ site _ cat, and predict _ site _ cat) of the original data set are respectively a site link site, a site title, a manually classified site category manual _ site _ cat, and a model predicted site category predict _ site _ cat, that is, an original data in the original data set includes four kinds of information, i.e., a site link, a site title, a manually classified site category, and a model predicted site category;

s10012, data preprocessing: removing original data of which the site is null and the title is null or both the preset _ site _ cat and the manual _ site _ cat are null;

s10013, Chinese word segmentation step: sequentially performing Chinese word segmentation and stop word removal processing on title to obtain a processed data set A, wherein a data characteristic component (cat: title _ keywords) of the processed data set A is a data (sample) in the data set A, and the data (sample) comprises a category cat of a site and a keyword list (site title keyword list) obtained after the stop word removal of the site title;

s10014, data set preparation: dividing the processed data set A obtained in the step S10013 into a training data set X and a testing data set TE according to a preset proportion, such as 3: 1;

specifically, the data of 3/4 in the processed data set a is divided as a training data set X, the data of 1/4 is divided as a test data set T; the text classification method comprises the steps that title _ keywords are used as input data, a cat is used as output data, namely, the title _ keywords in a training data set X are used as training input data, the cat in the training data set X is used as training output data, and a text classification model is trained by utilizing the training input data and the training output data; and the title _ keywords in the test data set T are used as test input data, the cate in the test data set T is used as test output data, and the test input data and the test output data are used for testing the text classification model obtained after training.

S1002, LiAnd inputting the training data set X into the text classification model by a Boosting integration method for training to obtain a trained text classification model H (X). As shown in fig. 2, it is a schematic diagram of a text classification model used in this embodiment, and the text classification model is implemented by classifying a plurality of basic classifiers H₁、H₂、……、H_nAfter iterative training, carrying out integrated processing on the basic classifier obtained after each iterative training to obtain a classification model; and n is the total number of the basic classifiers, namely the number of the basic classifiers adopted in the text classification model.

Specifically, as for the step S1002, it preferably includes:

s10021, training a current basic classifier by using a current sub-sample set, and calculating the error rate of the basic classifier obtained after training;

the sub-sample set is obtained by distributing corresponding weights to the samples contained in the training data set X; preferably, after corresponding weights are distributed to samples included in the training data set X, N is randomly extracted from the training data set X₁Taking the samples as a sub-sample set, wherein each sample has a corresponding weight; the initial weight of the sample is 1/m, i.e. in the first iteration, the weight of the sample distributed to X is 1/m, at this time, N randomly extracted from X₁Taking the samples as a first subsample set; wherein m is the total number of samples contained in the training data set X;

specifically, in the training data set X, each sample has a corresponding weight, where the set of weight vectors is referred to as D, specifically, D₁,D₂,……,D_TThe T (T ═ 1,2,3, …, T) th round weight vector sets are respectively, and for example, D1 is the weight of the sample in 1 st round iteration;

in the 1 st iteration, m samples exist in a training data set X, after the weight of each sample is initialized to be 1/m, a first sub-sample set S is obtained through random extraction₁I.e. at this time, in iteration 1, S₁I of (1)A sample x_iThe corresponding weight is D₁(x_i)＝1/m；

Then, the first subsample set S₁Input to a first basic classifier H₁Training is carried out, i.e. using the first set of subsamples S₁For the first basic classifier H₁Training is carried out, after one round of training, the first basic classifier H obtained after the training is finished is obtained₁Calculating the error rate; if the current is the t-th iteration, the current t-th sub-sample set S is used_tInput to the current j (j ═ 1,2,3, …, n) th basic classifier H_jTraining is carried out, namely, the current t-th sub-sample set S is utilized_tFor the current j basic classifier H_jTraining is carried out, after one round of training, the jth basic classifier H obtained after the training is finished is subjected to_jCalculating the error rate;

since the number of basic classifiers included in the text classification model is n, and usually the number T of iteration rounds is greater than n, when performing iteration training of the (n + 1) th round, training is returned to the 1 st basic classifier, which is equivalent to training the (n + 1) th basic classifier, that is, in the T-th round of iteration, training is performed on the basic classifier H of the iteration training_jA basic classifier, referred to as the t-th round;

in a preferred embodiment, the error rate is calculated as follows:

in the formula, epsilon (t) is the error rate corresponding to the basic classifier of the tth round after training; if t is n +1, namely epsilon (t) is the error rate corresponding to the basic classifier of the n +1 th round after training;

D_t(x_p) Is the sample x with the p-th classification result as error in the sub-sample set_pI.e. in the training of the t iteration, the p-th classification result in the current sub-sample set is the wrong sample x_pThe weight of (c); wherein p is ∈ [1,2, …, N]N is the classification node in the subsetThe number of samples with errors;

t belongs to [1,2, …, T ], and T is iteration round number;

h_tis the basic classifier of the t-th round;

h_t(x_p) Is a classification result predictor, specifically, it is using sample x_pIn the training, h_tOutputting a classification result prediction value;

y_pis the true value of the classification result;

[h_t(x_p)≠y_p]＝-1；

s10022, when the calculated error rate is converged to the threshold value range, ending the training, and executing the step S10026;

s10023, when the calculated error rate is not converged in the threshold range, executing the step S10024;

s10024, updating the weights of the samples contained in the current sub-sample set according to the calculated error rate, so as to increase the weight of the sample with the classification result as error;

specifically, updating the weight of each sample contained in the current sub-sample set according to the calculated error rate, and increasing the weight of the sample with the classification result of error;

in a preferred embodiment, the weights of the samples included in the current sub-sample set are updated, wherein the weight update formula is as follows:

wherein the content of the first and second substances,

α_tthe weight corresponding to the basic classifier expressed as the t-th round after training;

k is the number of categories (i.e., the number of classification categories) output by the text classification model, and if the text classification model can realize classification of 3 categories, k is 3;

Z_tis to make

A normalization factor of (c);

D_t(x_i) Is the ith sample x in the subset_iI.e. the ith sample x in the current subset during the training of the t iteration_iThe weight of (c); wherein i belongs to [1,2₁]，N₁The total number of samples of the sub-sample set;

D_t+1(x_i) Is the ith sample x in the subset_iThe weight of the (t + 1) th iteration, i.e. the ith sample x in the sub-sample set_iAn updated weight;

h_t(x_i) Is a classification result predictor, specifically, it is using sample x_iIn the training, h_tOutputting a classification result prediction value;

y_iis the true value of the classification result;

from the above formula, for the sample with correct classification result, in the next iteration, i.e. the t +1 th iteration, the corresponding weight is

And then [ h ]_t(x_i)＝y_i]1 is ═ 1; for the sample with the wrong classification result, in the next iteration, namely the t +1 th iteration, the corresponding weight is

And then [ h ]_t(x_i)≠y_i]＝-1；

S10025, after the updated weight is distributed to the samples in the training data set, obtaining the next sub-sample set, inputting the next sub-sample set into the next basic classifier, and returning to execute the step S10021;

specifically, the updated sample weight is distributed to the corresponding sample in the training data set to replace the current weight of the corresponding sample in XAfter re-updating, re-randomly extracting N from X₁Obtaining a next sub-sample set by the samples, inputting the next sub-sample set into a next basic classifier, and training the next basic classifier by using the next sub-sample set to realize the next round of iterative training;

s10026, integrating a plurality of basic classifiers to obtain a trained text classification model;

specifically, after T rounds of iteration, when the error rate converges to within the threshold range, the model training process is ended, at this time, multiple rounds of iterative training on n basic classifiers are completed, and then the basic classifiers obtained by each round of iterative training are integrated, that is, the basic classifiers of the 1 st, 2 nd, 3 rd, … … th and T rounds are integrated to obtain the final required text classification model; the basic classifiers obtained by each iteration training are integrated to obtain the final required text classification model H (x), and the adopted integration formula is as follows:

wherein Y ∈ Y indicates that the prediction result belongs to the label set Y.

For the n basic classifiers used in the text classification model, they may be the same or different basic classifiers, and in order to further improve the classification accuracy, it is preferable to select a basic classifier with an error rate range of [0,1/k ] as the basic classifier needed to be used in the text classification model of this embodiment, where k is the number of classes (i.e., the number of classification classes) output by the text classification model; for example, the classification effect (i.e. the error rate range of the two basic classifiers, namely the SVM and the TextGrocery) meets the condition of being within [0,1/k ], so that the SVM and/or the TextGrocery can be selected as the basic classifiers; the experimental result shows that the integrated classifier obtained by integrating the SVM and/or the TextGrocery weak classifiers has better classification effect and higher accuracy than the SVM/TextGrocery alone, which shows that the classification effect of the integrated classifier is better than that of the single basic classifier. Therefore, the classification scheme of the invention can improve the accuracy of text classification by a Boosting integration method on the basis of single textGrocery.

In a preferred embodiment, the constructing step S100 further includes:

s1003, performing ten-fold cross validation on the text classification model H (X) after training in the step S1002 by using a training data set X;

specifically, after the training data set X is scattered into ten parts, ten-fold cross validation is performed on the text classification model H (X) by using the ten parts of data so as to validate the accuracy and stability of the model;

s1004, when the model H (x) passes the verification, testing the text classification model passing the verification by using a test data set TE;

specifically, the test data set TE is input into the text classification model after passing the verification, and the accuracy, the recall rate, and F1-score of the text classification model are calculated and compared with the accuracy, the recall rate, and F1-score of the plurality of basic classifiers described in the above step S10026, so as to test the accuracy of the model and ensure the accuracy of the model classification.

The webpage classification scheme can be applied to an advertisement bidding system for site classification of advertisement pages, can also be applied to other systems for site classification of webpages in other fields (such as games, shopping and the like), and has wide application range and high compatibility.

As shown in fig. 3, an embodiment of the present invention further provides a web page classification system, which includes:

an obtaining module 201, configured to obtain webpage link information;

the processing module 202 is configured to input the obtained webpage link information into a text classification model for classification processing, and output a site classification result corresponding to the webpage link information;

In a preferred embodiment, the system further comprises a construction module for constructing the text classification model.

The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.

As shown in fig. 4, an embodiment of the present invention further provides a terminal, including:

at least one processor 301;

at least one memory 302 for storing at least one program;

when executed by the at least one processor 301, the at least one program causes the at least one processor 301 to implement a method step of web page classification as described in the above method embodiments.

The contents in the foregoing method embodiments are all applicable to this terminal embodiment, the functions specifically implemented by this terminal embodiment are the same as those in the foregoing method embodiments, and the beneficial effects achieved by this terminal embodiment are also the same as those achieved by the foregoing method embodiments.

Embodiments of the present invention further provide a storage medium having stored therein processor-executable instructions, which when executed by a processor, are configured to perform the steps of a method for classifying web pages as described in the above method embodiments.

The contents in the above method embodiments are all applicable to the present storage medium embodiment, the functions specifically implemented by the present storage medium embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present storage medium embodiment are also the same as those achieved by the above method embodiments.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A webpage classification method is characterized by comprising the following steps:

acquiring webpage link information;

inputting the obtained webpage link information into a text classification model for classification processing, and then outputting a site classification result corresponding to the webpage link information, wherein the site classification result is used for a crawler of a corresponding category to perform information crawling processing on the webpage link information of the corresponding category;

the text classification model is trained based on a Boosting integration method, and the webpage link information comprises a website title keyword list corresponding to the webpage link;

the text classification model is constructed by the following construction steps:

acquiring a training data set;

inputting a training data set into a text classification model by using a Boosting integration method for training so as to obtain a trained text classification model;

the method for acquiring the text classification model by using the Boosting integration method comprises the following steps of inputting a training data set into the text classification model for training to obtain the trained text classification model:

s1, training the current basic classifier by using the current sub-sample set, and calculating the error rate of the basic classifier obtained after training;

s2, when the calculated error rate is converged to the threshold value range, ending the training and executing the step S6;

s3, when the calculated error rate is not converged in the threshold value range, executing the step S4;

s4, updating the weights of the samples contained in the current sub-sample set according to the calculated error rate, so as to increase the weight of the sample with the classification result as error;

s5, distributing the updated weight to the sample in the training data set, obtaining the next sub-sample set, inputting the next sub-sample set to the next basic classifier, and returning to execute the step S1;

s6, integrating a plurality of basic classifiers to obtain a trained text classification model;

the basic classifier adopted in the text classification model is a basic classifier with the error rate range of [0,1/k ], and k is the number of classes output by the text classification model;

the updating of the weights of the samples contained in the current sub-sample set is performed, where the weight updating formula is as follows:

wherein the content of the first and second substances,

Z_tis to make

A normalization factor of (c); d_t(x_i) Is the ith sample x in the subset_iWeight of the tth iteration; i is an element of [1,2₁]，N₁The total number of samples of the sub-sample set; h is_t(x_i) Is the classification result prediction value; y is_iIs the true value of the classification result.

2. The method for classifying web pages according to claim 1, wherein the constructing step further comprises:

performing ten-fold cross validation on the trained text classification model by using a training data set;

and after the model passes the verification, testing the verified text classification model by using the test data set.

3. The method for classifying web pages according to claim 1, wherein the error rate is calculated as follows:

wherein epsilon (t) is the error rate corresponding to the basic classifier of the tth round after training; d_t(x_p) Is the sample x with the p-th classification result as error in the sub-sample set_pWeight of the tth iteration; p is equal to [1,2, …, N ∈]N is the number of samples with wrong classification results in the sub-sample set; t is e [1,2, …, T]T is the number of iteration rounds; h is_tIs the basic classifier of the t-th round; h is_t(x_p) Is the classification result prediction value; y is_pIs the true value of the classification result; [ h ] of_t(x_p)≠y_p]＝-1。

4. A system for classifying web pages, comprising:

the acquisition module is used for acquiring webpage link information;

the processing module is used for inputting the acquired webpage link information into a text classification model for classification processing and then outputting a site classification result corresponding to the webpage link information, wherein the site classification result is used for a crawler of a corresponding category to perform information crawling processing on the webpage link information of the corresponding category;

acquiring a training data set;

wherein the content of the first and second substances,

Z_tis to make

A normalization factor of (c); d_t(x_i) Is the ith sample x in the subset_iWeight of the tth iteration; i ∈ [1,2,.....,N₁]，N₁the total number of samples of the sub-sample set; h is_t(x_i) Is the classification result prediction value; y is_iIs the true value of the classification result.

5. A terminal, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement a method of classifying web pages as claimed in any one of claims 1 to 3.

6. A storage medium having stored therein processor-executable instructions, which when executed by a processor, are configured to perform a method of classifying web pages as claimed in any one of claims 1 to 3.