CN108920492B - Webpage classification method, system, terminal and storage medium - Google Patents

Webpage classification method, system, terminal and storage medium Download PDF

Info

Publication number
CN108920492B
CN108920492B CN201810465784.3A CN201810465784A CN108920492B CN 108920492 B CN108920492 B CN 108920492B CN 201810465784 A CN201810465784 A CN 201810465784A CN 108920492 B CN108920492 B CN 108920492B
Authority
CN
China
Prior art keywords
classification model
text classification
training
sample
error rate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810465784.3A
Other languages
Chinese (zh)
Other versions
CN108920492A (en
Inventor
张君晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Sunteng Information Technology Co ltd
Original Assignee
Guangzhou Sunteng Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Sunteng Information Technology Co ltd filed Critical Guangzhou Sunteng Information Technology Co ltd
Priority to CN201810465784.3A priority Critical patent/CN108920492B/en
Publication of CN108920492A publication Critical patent/CN108920492A/en
Application granted granted Critical
Publication of CN108920492B publication Critical patent/CN108920492B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a webpage classification method, a system, a terminal and a storage medium, wherein the method comprises the steps of obtaining webpage link information; inputting the obtained webpage link information into a text classification model for classification processing, and outputting a site classification result corresponding to the webpage link information; the text classification model is trained based on a Boosting integration method. The system comprises an acquisition module and a processing module. The terminal comprises a memory for storing a program and a processor for loading a program to perform the method steps. By using the method and the device, the webpage can be classified quickly and accurately. The webpage classification method, the webpage classification system, the webpage classification terminal and the webpage classification storage medium can be widely applied to the field of text classification.

Description

Webpage classification method, system, terminal and storage medium
Technical Field
The present invention relates to data classification processing technologies, and in particular, to a method, a system, a terminal, and a storage medium for classifying web pages.
Background
Explanation of technical words:
f1 Score: f1 score is an index used for measuring the accuracy of the two classification models in statistics, and gives consideration to the accuracy and the recall rate of the classification models; in particular, the F1 score may be viewed as a weighted average of model accuracy and recall.
Boosting: boosting, which is a method used to improve the accuracy of weak classification algorithms, is performed by constructing a series of prediction functions and then combining them in a certain way into a prediction function.
The advertisement bidding system needs to process requests of billions level every day, wherein each bidding request contains page information, equipment information, user information and the like, and the information is landed on a server in a log form, and then through the analysis of an algorithm, required data is extracted and persisted in a database. However, sometimes the bid request lacks keywords and related descriptions of the requested page, and in order to solve this problem, a method commonly used in the industry at present is to extract links of the requested page and give the links of the requested page to crawlers to crawl information of these pages. However, with the advertisement bidding system, there are also the following problems: 1. under the condition of massive bidding requests, the data volume of the extracted request page is very huge, and the data volume of the result crawled from the request page is also very huge, so that the problem of high difficulty in data storage and management exists; 2. some categories of advertisement pages have poor delivery effect, so when crawlers of the advertisement pages need to be paused, a relatively complex flow is needed to remove the urls; 3. different advertisement pages have different requirements on the crawler capacity, so that the processing capacity of the crawler is difficult to balance and the bandwidth is difficult to reasonably utilize on the premise that the advertisement pages are not classified.
Disclosure of Invention
In order to solve the above technical problems, an object of the present invention is to provide a method, a system, a terminal and a storage medium for classifying web pages, which have high classification processing rate and accuracy.
The first technical scheme adopted by the invention is as follows: a webpage classification method comprises the following steps:
acquiring webpage link information;
inputting the obtained webpage link information into a text classification model for classification processing, and outputting a site classification result corresponding to the webpage link information;
the text classification model is trained based on a Boosting integration method.
The second technical scheme adopted by the invention is as follows: a web page classification system comprising:
the acquisition module is used for acquiring webpage link information;
the processing module is used for inputting the acquired webpage link information into a text classification model for classification processing and then outputting a site classification result corresponding to the webpage link information;
the text classification model is trained based on a Boosting integration method.
The third technical scheme adopted by the invention is as follows: a terminal, comprising:
at least one processor;
at least one memory for storing at least one program;
when the at least one program is executed by the at least one processor, the at least one processor may implement the method for classifying web pages according to the first aspect.
The fourth technical scheme adopted by the invention is as follows: a storage medium having stored therein processor-executable instructions for performing a method of classifying web pages as described in the first aspect when the processor-executable instructions are executed by a processor.
The invention has the advantages that: according to the method, the obtained webpage link information is input into a text classification model trained based on a Boosting integration method for classification, and then a site classification result corresponding to the webpage link information is output, so that classification storage of webpage data is facilitated, management and subsequent query processing are facilitated, information crawling processing is not directly performed on webpages with poor advertising page delivery effects, waste of resources is avoided, processing efficiency is improved, processing capacity of crawlers can be balanced, bandwidth is reasonably utilized, and downstream algorithm tasks are supported. In addition, the method adopts the text classification model trained based on the Boosting integration method to realize the webpage classification, so that the processing rate and the accuracy of the classification are very high.
Drawings
FIG. 1 is a flowchart illustrating the steps of a method for classifying web pages according to the present invention;
FIG. 2 is a schematic diagram of a text classification model used in a web page classification method according to the present invention;
FIG. 3 is a schematic diagram of a web page classification system according to the present invention;
fig. 4 is a schematic structural diagram of a terminal according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and preferred embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
As shown in fig. 1, an embodiment of the present invention provides a method for classifying web pages, which includes the following steps:
s101, acquiring webpage link information;
specifically, in this embodiment, the web page is an advertisement page;
s102, inputting the acquired webpage link information into a text classification model for classification, and outputting a site classification result corresponding to the webpage link information;
the text classification model is trained based on a Boosting integration method.
According to the method, the constructed text classification model trained based on the Boosting integration method is on line, advertisement page link information contained in bidding requests in real-time logs is classified by using the text classification model, site classification is carried out in real time, and therefore advertisement page links of different site categories can be handed to crawlers of corresponding categories for information crawling processing, classified storage of webpage data is facilitated, management, follow-up query processing and the like are facilitated, information crawling processing can be directly omitted for webpages of the categories with poor advertisement page launching effects, waste of resources is avoided, processing efficiency is improved, load capacity of balanced crawlers can be greatly improved, bandwidth is reasonably utilized, and downstream algorithm tasks are supported; in addition, the method adopts the text classification model trained based on the Boosting integration method to realize the webpage classification, so that the processing rate and the accuracy of the classification are very high.
In a preferred embodiment, the web page link information includes a website title keyword list corresponding to the web page link, where the keyword list includes at least one keyword.
In a preferred embodiment, the text classification model is a text classification model constructed by the following construction step S100:
and S1001, acquiring a training data set.
Specifically, as for the step S1001, it preferably includes:
s10011, data acquisition step: selecting data stored in the last r days from a persistence layer database as an original data set;
the data feature components (site, title, manual _ site _ cat, and predict _ site _ cat) of the original data set are respectively a site link site, a site title, a manually classified site category manual _ site _ cat, and a model predicted site category predict _ site _ cat, that is, an original data in the original data set includes four kinds of information, i.e., a site link, a site title, a manually classified site category, and a model predicted site category;
s10012, data preprocessing: removing original data of which the site is null and the title is null or both the preset _ site _ cat and the manual _ site _ cat are null;
s10013, Chinese word segmentation step: sequentially performing Chinese word segmentation and stop word removal processing on title to obtain a processed data set A, wherein a data characteristic component (cat: title _ keywords) of the processed data set A is a data (sample) in the data set A, and the data (sample) comprises a category cat of a site and a keyword list (site title keyword list) obtained after the stop word removal of the site title;
s10014, data set preparation: dividing the processed data set A obtained in the step S10013 into a training data set X and a testing data set TE according to a preset proportion, such as 3: 1;
specifically, the data of 3/4 in the processed data set a is divided as a training data set X, the data of 1/4 is divided as a test data set T; the text classification method comprises the steps that title _ keywords are used as input data, a cat is used as output data, namely, the title _ keywords in a training data set X are used as training input data, the cat in the training data set X is used as training output data, and a text classification model is trained by utilizing the training input data and the training output data; and the title _ keywords in the test data set T are used as test input data, the cate in the test data set T is used as test output data, and the test input data and the test output data are used for testing the text classification model obtained after training.
S1002, LiAnd inputting the training data set X into the text classification model by a Boosting integration method for training to obtain a trained text classification model H (X). As shown in fig. 2, it is a schematic diagram of a text classification model used in this embodiment, and the text classification model is implemented by classifying a plurality of basic classifiers H1、H2、……、HnAfter iterative training, carrying out integrated processing on the basic classifier obtained after each iterative training to obtain a classification model; and n is the total number of the basic classifiers, namely the number of the basic classifiers adopted in the text classification model.
Specifically, as for the step S1002, it preferably includes:
s10021, training a current basic classifier by using a current sub-sample set, and calculating the error rate of the basic classifier obtained after training;
the sub-sample set is obtained by distributing corresponding weights to the samples contained in the training data set X; preferably, after corresponding weights are distributed to samples included in the training data set X, N is randomly extracted from the training data set X1Taking the samples as a sub-sample set, wherein each sample has a corresponding weight; the initial weight of the sample is 1/m, i.e. in the first iteration, the weight of the sample distributed to X is 1/m, at this time, N randomly extracted from X1Taking the samples as a first subsample set; wherein m is the total number of samples contained in the training data set X;
specifically, in the training data set X, each sample has a corresponding weight, where the set of weight vectors is referred to as D, specifically, D1,D2,……,DTThe T (T ═ 1,2,3, …, T) th round weight vector sets are respectively, and for example, D1 is the weight of the sample in 1 st round iteration;
in the 1 st iteration, m samples exist in a training data set X, after the weight of each sample is initialized to be 1/m, a first sub-sample set S is obtained through random extraction1I.e. at this time, in iteration 1, S1I of (1)A sample xiThe corresponding weight is D1(xi)=1/m;
Then, the first subsample set S1Input to a first basic classifier H1Training is carried out, i.e. using the first set of subsamples S1For the first basic classifier H1Training is carried out, after one round of training, the first basic classifier H obtained after the training is finished is obtained1Calculating the error rate; if the current is the t-th iteration, the current t-th sub-sample set S is usedtInput to the current j (j ═ 1,2,3, …, n) th basic classifier HjTraining is carried out, namely, the current t-th sub-sample set S is utilizedtFor the current j basic classifier HjTraining is carried out, after one round of training, the jth basic classifier H obtained after the training is finished is subjected tojCalculating the error rate;
since the number of basic classifiers included in the text classification model is n, and usually the number T of iteration rounds is greater than n, when performing iteration training of the (n + 1) th round, training is returned to the 1 st basic classifier, which is equivalent to training the (n + 1) th basic classifier, that is, in the T-th round of iteration, training is performed on the basic classifier H of the iteration trainingjA basic classifier, referred to as the t-th round;
in a preferred embodiment, the error rate is calculated as follows:
Figure BDA0001662040870000051
in the formula, epsilon (t) is the error rate corresponding to the basic classifier of the tth round after training; if t is n +1, namely epsilon (t) is the error rate corresponding to the basic classifier of the n +1 th round after training;
Dt(xp) Is the sample x with the p-th classification result as error in the sub-sample setpI.e. in the training of the t iteration, the p-th classification result in the current sub-sample set is the wrong sample xpThe weight of (c); wherein p is ∈ [1,2, …, N]N is the classification node in the subsetThe number of samples with errors;
t belongs to [1,2, …, T ], and T is iteration round number;
htis the basic classifier of the t-th round;
ht(xp) Is a classification result predictor, specifically, it is using sample xpIn the training, htOutputting a classification result prediction value;
ypis the true value of the classification result;
[ht(xp)≠yp]=-1;
s10022, when the calculated error rate is converged to the threshold value range, ending the training, and executing the step S10026;
s10023, when the calculated error rate is not converged in the threshold range, executing the step S10024;
s10024, updating the weights of the samples contained in the current sub-sample set according to the calculated error rate, so as to increase the weight of the sample with the classification result as error;
specifically, updating the weight of each sample contained in the current sub-sample set according to the calculated error rate, and increasing the weight of the sample with the classification result of error;
in a preferred embodiment, the weights of the samples included in the current sub-sample set are updated, wherein the weight update formula is as follows:
Figure BDA0001662040870000061
wherein the content of the first and second substances,
Figure BDA0001662040870000062
αtthe weight corresponding to the basic classifier expressed as the t-th round after training;
k is the number of categories (i.e., the number of classification categories) output by the text classification model, and if the text classification model can realize classification of 3 categories, k is 3;
Ztis to make
Figure BDA0001662040870000063
A normalization factor of (c);
Dt(xi) Is the ith sample x in the subsetiI.e. the ith sample x in the current subset during the training of the t iterationiThe weight of (c); wherein i belongs to [1,21],N1The total number of samples of the sub-sample set;
Dt+1(xi) Is the ith sample x in the subsetiThe weight of the (t + 1) th iteration, i.e. the ith sample x in the sub-sample setiAn updated weight;
ht(xi) Is a classification result predictor, specifically, it is using sample xiIn the training, htOutputting a classification result prediction value;
yiis the true value of the classification result;
from the above formula, for the sample with correct classification result, in the next iteration, i.e. the t +1 th iteration, the corresponding weight is
Figure BDA0001662040870000064
And then [ h ]t(xi)=yi]1 is ═ 1; for the sample with the wrong classification result, in the next iteration, namely the t +1 th iteration, the corresponding weight is
Figure BDA0001662040870000065
And then [ h ]t(xi)≠yi]=-1;
S10025, after the updated weight is distributed to the samples in the training data set, obtaining the next sub-sample set, inputting the next sub-sample set into the next basic classifier, and returning to execute the step S10021;
specifically, the updated sample weight is distributed to the corresponding sample in the training data set to replace the current weight of the corresponding sample in XAfter re-updating, re-randomly extracting N from X1Obtaining a next sub-sample set by the samples, inputting the next sub-sample set into a next basic classifier, and training the next basic classifier by using the next sub-sample set to realize the next round of iterative training;
s10026, integrating a plurality of basic classifiers to obtain a trained text classification model;
specifically, after T rounds of iteration, when the error rate converges to within the threshold range, the model training process is ended, at this time, multiple rounds of iterative training on n basic classifiers are completed, and then the basic classifiers obtained by each round of iterative training are integrated, that is, the basic classifiers of the 1 st, 2 nd, 3 rd, … … th and T rounds are integrated to obtain the final required text classification model; the basic classifiers obtained by each iteration training are integrated to obtain the final required text classification model H (x), and the adopted integration formula is as follows:
Figure BDA0001662040870000066
wherein Y ∈ Y indicates that the prediction result belongs to the label set Y.
For the n basic classifiers used in the text classification model, they may be the same or different basic classifiers, and in order to further improve the classification accuracy, it is preferable to select a basic classifier with an error rate range of [0,1/k ] as the basic classifier needed to be used in the text classification model of this embodiment, where k is the number of classes (i.e., the number of classification classes) output by the text classification model; for example, the classification effect (i.e. the error rate range of the two basic classifiers, namely the SVM and the TextGrocery) meets the condition of being within [0,1/k ], so that the SVM and/or the TextGrocery can be selected as the basic classifiers; the experimental result shows that the integrated classifier obtained by integrating the SVM and/or the TextGrocery weak classifiers has better classification effect and higher accuracy than the SVM/TextGrocery alone, which shows that the classification effect of the integrated classifier is better than that of the single basic classifier. Therefore, the classification scheme of the invention can improve the accuracy of text classification by a Boosting integration method on the basis of single textGrocery.
In a preferred embodiment, the constructing step S100 further includes:
s1003, performing ten-fold cross validation on the text classification model H (X) after training in the step S1002 by using a training data set X;
specifically, after the training data set X is scattered into ten parts, ten-fold cross validation is performed on the text classification model H (X) by using the ten parts of data so as to validate the accuracy and stability of the model;
s1004, when the model H (x) passes the verification, testing the text classification model passing the verification by using a test data set TE;
specifically, the test data set TE is input into the text classification model after passing the verification, and the accuracy, the recall rate, and F1-score of the text classification model are calculated and compared with the accuracy, the recall rate, and F1-score of the plurality of basic classifiers described in the above step S10026, so as to test the accuracy of the model and ensure the accuracy of the model classification.
The webpage classification scheme can be applied to an advertisement bidding system for site classification of advertisement pages, can also be applied to other systems for site classification of webpages in other fields (such as games, shopping and the like), and has wide application range and high compatibility.
As shown in fig. 3, an embodiment of the present invention further provides a web page classification system, which includes:
an obtaining module 201, configured to obtain webpage link information;
the processing module 202 is configured to input the obtained webpage link information into a text classification model for classification processing, and output a site classification result corresponding to the webpage link information;
the text classification model is trained based on a Boosting integration method.
In a preferred embodiment, the system further comprises a construction module for constructing the text classification model.
The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.
As shown in fig. 4, an embodiment of the present invention further provides a terminal, including:
at least one processor 301;
at least one memory 302 for storing at least one program;
when executed by the at least one processor 301, the at least one program causes the at least one processor 301 to implement a method step of web page classification as described in the above method embodiments.
The contents in the foregoing method embodiments are all applicable to this terminal embodiment, the functions specifically implemented by this terminal embodiment are the same as those in the foregoing method embodiments, and the beneficial effects achieved by this terminal embodiment are also the same as those achieved by the foregoing method embodiments.
Embodiments of the present invention further provide a storage medium having stored therein processor-executable instructions, which when executed by a processor, are configured to perform the steps of a method for classifying web pages as described in the above method embodiments.
The contents in the above method embodiments are all applicable to the present storage medium embodiment, the functions specifically implemented by the present storage medium embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present storage medium embodiment are also the same as those achieved by the above method embodiments.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (6)

1. A webpage classification method is characterized by comprising the following steps:
acquiring webpage link information;
inputting the obtained webpage link information into a text classification model for classification processing, and then outputting a site classification result corresponding to the webpage link information, wherein the site classification result is used for a crawler of a corresponding category to perform information crawling processing on the webpage link information of the corresponding category;
the text classification model is trained based on a Boosting integration method, and the webpage link information comprises a website title keyword list corresponding to the webpage link;
the text classification model is constructed by the following construction steps:
acquiring a training data set;
inputting a training data set into a text classification model by using a Boosting integration method for training so as to obtain a trained text classification model;
the method for acquiring the text classification model by using the Boosting integration method comprises the following steps of inputting a training data set into the text classification model for training to obtain the trained text classification model:
s1, training the current basic classifier by using the current sub-sample set, and calculating the error rate of the basic classifier obtained after training;
s2, when the calculated error rate is converged to the threshold value range, ending the training and executing the step S6;
s3, when the calculated error rate is not converged in the threshold value range, executing the step S4;
s4, updating the weights of the samples contained in the current sub-sample set according to the calculated error rate, so as to increase the weight of the sample with the classification result as error;
s5, distributing the updated weight to the sample in the training data set, obtaining the next sub-sample set, inputting the next sub-sample set to the next basic classifier, and returning to execute the step S1;
s6, integrating a plurality of basic classifiers to obtain a trained text classification model;
the basic classifier adopted in the text classification model is a basic classifier with the error rate range of [0,1/k ], and k is the number of classes output by the text classification model;
the updating of the weights of the samples contained in the current sub-sample set is performed, where the weight updating formula is as follows:
Figure FDA0002882317010000011
wherein the content of the first and second substances,
Figure FDA0002882317010000012
Ztis to make
Figure FDA0002882317010000013
A normalization factor of (c); dt(xi) Is the ith sample x in the subsetiWeight of the tth iteration; i is an element of [1,21],N1The total number of samples of the sub-sample set; h ist(xi) Is the classification result prediction value; y isiIs the true value of the classification result.
2. The method for classifying web pages according to claim 1, wherein the constructing step further comprises:
performing ten-fold cross validation on the trained text classification model by using a training data set;
and after the model passes the verification, testing the verified text classification model by using the test data set.
3. The method for classifying web pages according to claim 1, wherein the error rate is calculated as follows:
Figure FDA0002882317010000021
wherein epsilon (t) is the error rate corresponding to the basic classifier of the tth round after training; dt(xp) Is the sample x with the p-th classification result as error in the sub-sample setpWeight of the tth iteration; p is equal to [1,2, …, N ∈]N is the number of samples with wrong classification results in the sub-sample set; t is e [1,2, …, T]T is the number of iteration rounds; h istIs the basic classifier of the t-th round; h ist(xp) Is the classification result prediction value; y ispIs the true value of the classification result; [ h ] oft(xp)≠yp]=-1。
4. A system for classifying web pages, comprising:
the acquisition module is used for acquiring webpage link information;
the processing module is used for inputting the acquired webpage link information into a text classification model for classification processing and then outputting a site classification result corresponding to the webpage link information, wherein the site classification result is used for a crawler of a corresponding category to perform information crawling processing on the webpage link information of the corresponding category;
the text classification model is trained based on a Boosting integration method, and the webpage link information comprises a website title keyword list corresponding to the webpage link;
the text classification model is constructed by the following construction steps:
acquiring a training data set;
inputting a training data set into a text classification model by using a Boosting integration method for training so as to obtain a trained text classification model;
the method for acquiring the text classification model by using the Boosting integration method comprises the following steps of inputting a training data set into the text classification model for training to obtain the trained text classification model:
s1, training the current basic classifier by using the current sub-sample set, and calculating the error rate of the basic classifier obtained after training;
s2, when the calculated error rate is converged to the threshold value range, ending the training and executing the step S6;
s3, when the calculated error rate is not converged in the threshold value range, executing the step S4;
s4, updating the weights of the samples contained in the current sub-sample set according to the calculated error rate, so as to increase the weight of the sample with the classification result as error;
s5, distributing the updated weight to the sample in the training data set, obtaining the next sub-sample set, inputting the next sub-sample set to the next basic classifier, and returning to execute the step S1;
s6, integrating a plurality of basic classifiers to obtain a trained text classification model;
the basic classifier adopted in the text classification model is a basic classifier with the error rate range of [0,1/k ], and k is the number of classes output by the text classification model;
the updating of the weights of the samples contained in the current sub-sample set is performed, where the weight updating formula is as follows:
Figure FDA0002882317010000031
wherein the content of the first and second substances,
Figure FDA0002882317010000032
Ztis to make
Figure FDA0002882317010000033
A normalization factor of (c); dt(xi) Is the ith sample x in the subsetiWeight of the tth iteration; i ∈ [1,2,.....,N1],N1the total number of samples of the sub-sample set; h ist(xi) Is the classification result prediction value; y isiIs the true value of the classification result.
5. A terminal, comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement a method of classifying web pages as claimed in any one of claims 1 to 3.
6. A storage medium having stored therein processor-executable instructions, which when executed by a processor, are configured to perform a method of classifying web pages as claimed in any one of claims 1 to 3.
CN201810465784.3A 2018-05-16 2018-05-16 Webpage classification method, system, terminal and storage medium Active CN108920492B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810465784.3A CN108920492B (en) 2018-05-16 2018-05-16 Webpage classification method, system, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810465784.3A CN108920492B (en) 2018-05-16 2018-05-16 Webpage classification method, system, terminal and storage medium

Publications (2)

Publication Number Publication Date
CN108920492A CN108920492A (en) 2018-11-30
CN108920492B true CN108920492B (en) 2021-04-09

Family

ID=64402649

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810465784.3A Active CN108920492B (en) 2018-05-16 2018-05-16 Webpage classification method, system, terminal and storage medium

Country Status (1)

Country Link
CN (1) CN108920492B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111353803B (en) * 2018-12-24 2024-04-05 三六零科技集团有限公司 Advertiser classification method and device and computing equipment
CN110825998A (en) * 2019-08-09 2020-02-21 国家计算机网络与信息安全管理中心 Website identification method and readable storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447161A (en) * 2015-11-26 2016-03-30 广东工业大学 Data feature based intelligent information classification method
CN107560850B (en) * 2017-08-26 2019-04-12 中南大学 Shafting fault recognition method based on Threshold Denoising and AdaBoost
CN107360200A (en) * 2017-09-20 2017-11-17 广东工业大学 A kind of fishing detection method based on classification confidence and web site features
CN107909396A (en) * 2017-11-11 2018-04-13 霍尔果斯普力网络科技有限公司 The anti-cheat monitoring method that a kind of Internet advertising is launched

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
网络爬取;孙世友等;《大地图 测绘地理信息大数据理论与实践》;中国环境科学出版社;20171231;第78-84页 *

Also Published As

Publication number Publication date
CN108920492A (en) 2018-11-30

Similar Documents

Publication Publication Date Title
US20230368024A1 (en) Neural architecture search
CN108073568B (en) Keyword extraction method and device
CN108427708B (en) Data processing method, data processing apparatus, storage medium, and electronic apparatus
CN110909164A (en) Text enhancement semantic classification method and system based on convolutional neural network
CN107862022B (en) Culture resource recommendation system
CN107346433A (en) A kind of text data sorting technique and server
WO2014160282A1 (en) Classifying resources using a deep network
CN111563192B (en) Entity alignment method, device, electronic equipment and storage medium
CN112819023A (en) Sample set acquisition method and device, computer equipment and storage medium
CN111914159B (en) Information recommendation method and terminal
CN110543603B (en) Collaborative filtering recommendation method, device, equipment and medium based on user behaviors
CN113177700B (en) Risk assessment method, system, electronic equipment and storage medium
CN111931055A (en) Object recommendation method, object recommendation device and electronic equipment
CN108920492B (en) Webpage classification method, system, terminal and storage medium
CN114693409A (en) Product matching method, device, computer equipment, storage medium and program product
CN111709225A (en) Event cause and effect relationship judging method and device and computer readable storage medium
CN117593077A (en) Prop combination recommendation method and device based on hypergraph neural network and computing equipment
CN110262906B (en) Interface label recommendation method and device, storage medium and electronic equipment
CN112202889A (en) Information pushing method and device and storage medium
CN109885758A (en) A kind of recommended method of the novel random walk based on bigraph (bipartite graph)
CN114707068A (en) Method, device, equipment and medium for recommending intelligence base knowledge
US20140324523A1 (en) Missing String Compensation In Capped Customer Linkage Model
US20140324524A1 (en) Evolving a capped customer linkage model using genetic models
CN115700550A (en) Label classification model training and object screening method, device and storage medium
CN115129863A (en) Intention recognition method, device, equipment, storage medium and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 510665 Room 401, No.3, East Tangdong Road, Tianhe District, Guangzhou City, Guangdong Province

Applicant after: GUANGZHOU SUNTENG INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 510665 b-420, ocean Creative Park, No.5, Tangdong East Road, Tianhe District, Guangzhou City, Guangdong Province

Applicant before: GUANGZHOU SUNTENG INFORMATION TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
CB02 Change of applicant information

Address after: 510665 Room 401, No.3, East Tangdong Road, Tianhe District, Guangzhou City, Guangdong Province

Applicant after: GUANGZHOU SUNTENG INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 510665 Room 401, 3 Tangdong East Road, Tianhe District, Guangzhou City, Guangdong Province

Applicant before: GUANGZHOU SUNTENG INFORMATION TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant