CN106445994A - Mixed algorithm-based web page classification method and apparatus - Google Patents

Mixed algorithm-based web page classification method and apparatus Download PDF

Info

Publication number
CN106445994A
CN106445994A CN201610557554.0A CN201610557554A CN106445994A CN 106445994 A CN106445994 A CN 106445994A CN 201610557554 A CN201610557554 A CN 201610557554A CN 106445994 A CN106445994 A CN 106445994A
Authority
CN
China
Prior art keywords
classification
sorted
webpage
characteristic vector
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610557554.0A
Other languages
Chinese (zh)
Inventor
邹立斌
李青海
简宋全
侯大勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Jing Dian Computing Machine Science And Technology Ltd
Original Assignee
Guangzhou Jing Dian Computing Machine Science And Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Jing Dian Computing Machine Science And Technology Ltd filed Critical Guangzhou Jing Dian Computing Machine Science And Technology Ltd
Priority to CN201610557554.0A priority Critical patent/CN106445994A/en
Publication of CN106445994A publication Critical patent/CN106445994A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a mixed algorithm-based web page classification method and apparatus. The method comprises the steps of a, processing to-be-classified web pages to obtain web page data; b, processing the web page data, and converting eigenvectors to be in the form of numerical values by using a vector space model; c, building a SVM classification model, and classifying the to-be-classified web pages by utilizing an SVM classifier; d, transmitting the eigenvectors meeting a classification condition and output by the SVM classifier to a Naive Bayes classifier for performing classification; and e, classifying the eigenvectors of the to-be-classified web pages by utilizing the Naive Bayes classifier. The apparatus comprises a web page processing unit, a data conversion unit, an SVM classification unit, a data transmission unit and a Bayes classification unit. An SVM is adopted for performing binary classification first and then a Naive Bayes method is used for performing multi-classification, so that the classification is quicker and more accurate.

Description

A kind of Web page classification method based on hybrid algorithm and device
Technical field
The present invention relates to Webpage classification technology field is and in particular to a kind of Web page classification method based on hybrid algorithm and dress Put.
Background technology
With developing rapidly of the Internet and its correlation technique, occur in that magnanimity and numerous and jumbled network information resource.How Extract from the unstructured data of these magnanimity and produce knowledge, find people's content interested, have become as and currently compel Be essential problem to be solved.Various search engines such as Google, Baidu, the appearance of Yahoo etc. starts this problem has been delayed Solution, but these research tools towards be all users, it is useful that they typically return to institute the result of a versatility Family, so can not meet and be in the specific period, specific area, the search request of specific purpose.People east really interested How preferably west is often submerged in the information ocean of vastness, then how effectively to organize, process these magnanimity informations, Distribute, become for problem demanding prompt solution using required network information resource.
Support vector machine (SVM) are according to Statistical Learning Theory, with structural risk minimization as theoretical basiss Plant new machine learning method, its main thought is for two classification problems, higher dimensional space is found a hyperplane conduct The segmentation of two classes, to ensure the mistake point rate of minimum, but a disadvantage is that long for SVM training time during mass data classification.
Naive Bayesian is the algorithm that a class is classified using probability statistics knowledge, but it is inadequate to be single use degree of accuracy High.
In view of drawbacks described above, creator of the present invention passes through long research and practice obtains the present invention finally.
Content of the invention
For solving above-mentioned technological deficiency, the technical solution used in the present invention is, provides a kind of net based on hybrid algorithm Page sorting technique, it includes:
Step a, searches for webpage to be sorted, carries out process to described webpage to be sorted and obtain web data;
Step b, is processed to described web data, with vector space model, described web data is converted to text table Show, calculate the weights of entry item and the characteristic vector of described webpage to be sorted is changed into numeric form;
Step c, by the use of the characteristic vector of numeric form as training data, sets up the disaggregated model of SVM, and utilizes SVM Grader is classified to the described characteristic vector of webpage to be sorted;
Step d, the described characteristic vector meeting class condition that SVM classifier is exported is delivered to Naive Bayes Classification Classified in the middle of device;
Step e, is classified to the described characteristic vector of described webpage to be sorted using Naive Bayes Classifier.
Preferably, described step c includes:
Step c1, by the use of the characteristic vector of numeric form as training data, determines classification formula, sets up the classification of SVM Model;
Step c2, the described classification formula using SVM classifier calculates to the characteristic vector of described webpage to be sorted, Confirm whether described characteristic vector makes described classification formula set up, thus described characteristic vector is divided into two classes.
Preferably, described step e includes:
Step e1, selects a part as training sample from the described characteristic vector of SVM classifier output, determines described The corresponding characteristic attribute of each characteristic vector in training sample, and the class of the corresponding described webpage to be sorted of each characteristic vector Not;
Step e2, count described training sample described in each classification of webpage to be sorted occur frequency and of all categories under The conditional probability of each characteristic attribute is estimated;
Step e3, is analyzed to the described characteristic attribute in the webpage described to be sorted of SVM classifier output, and calculating should Webpage to be sorted belongs to the class probability of each classification;
Step e4, determines the maximum class probability of numerical value, category probability pair in the class probability of described webpage to be sorted The classification answered is the classification of described webpage to be sorted.
Preferably, in described step e3, the computing formula of the class probability of described webpage to be sorted is:
Wherein, x is the characteristic vector of webpage to be sorted, and i is the sequence number of classification, and j is characterized the sequence number of attribute, and m is characterized The sum of attribute, C is constant, yiFor i-th classification, ajFor j-th characteristic attribute, P (yi) it is the frequency that i-th classification occurs, P(aj|yi) be j-th characteristic mathematical in i-th classification conditional probability estimate, P (yi| x) be webpage to be sorted classification general Rate.
Preferably, described web data is semi-structured data.
Preferably, in described step b, the weight computing formula of described entry item is:
Wherein, ωiD () is i-th entry item weights in text d, ωiD () goes out in text d for i-th entry item Existing word frequency, N is the number of all texts, niFor occurring in that the number of the text of i-th entry item.
Preferably, in described step c, the kernel function of described svm classifier model is RBF kernel function.
Secondly provide the corresponding Web page classifying device based on hybrid algorithm of Web page classification method a kind of with described above, It includes:
Web Page Processing unit, searches for webpage to be sorted, carries out process to described webpage to be sorted and obtain web data;
Date Conversion Unit, is processed to described web data, is changed described web data with vector space model For text representation, calculate the weights of entry item and the characteristic vector of described webpage to be sorted is changed into numeric form;
Svm classifier unit, by the use of the characteristic vector of numeric form as training data, sets up the disaggregated model of SVM, and profit With SVM classifier, the described characteristic vector of webpage to be sorted is classified;
Data supply unit, the described characteristic vector meeting class condition that SVM classifier is exported is delivered to simple shellfish Classified in the middle of this grader of leaf;
Bayes's classification unit, is carried out to the described characteristic vector of described webpage to be sorted using Naive Bayes Classifier Classification.
Preferably, described svm classifier unit includes:
Model building module, by the use of the characteristic vector of numeric form as training data, determines classification formula, sets up SVM Disaggregated model;
Category of model module, the described classification formula using SVM classifier enters to the characteristic vector of described webpage to be sorted Row calculates, and confirms whether described characteristic vector makes described classification formula set up, thus described characteristic vector is divided into two classes.
Preferably, described Bayes's classification unit includes:
Characteristic determination module, selects a part as training sample, really from the described characteristic vector of SVM classifier output The corresponding characteristic attribute of each characteristic vector in fixed described training sample, and the corresponding described net to be sorted of each characteristic vector The classification of page;
Probability statistics module, counts the frequency that described in described training sample, each classification of webpage to be sorted occurs and each Under classification, the conditional probability of each characteristic attribute is estimated;
Probability evaluation entity, is analyzed to the described characteristic attribute in the webpage described to be sorted of SVM classifier output, Calculate the class probability that this webpage to be sorted belongs to each classification;
Category determination module, determines the maximum class probability of numerical value, the category in the class probability of described webpage to be sorted The corresponding classification of probability is the classification of described webpage to be sorted.
Compared with the prior art the beneficial effects of the present invention is:A kind of Web page classification method based on hybrid algorithm and dress Put, support the feature of incremental training using svm classifier models coupling naive Bayesian, first two classification are carried out using SVM, then Carried out with Nae Bayesianmethod again classifying, classification is rapider, more more;And can automatically carry out after including new data Adjustment, revises and judges, improves accuracy rate without re -training.It has raising classified counting efficiency and classification accuracy, The advantage reducing algorithm complex.This hybrid sorting process can be Web page classifying and real-time marketing provides fast and accurately User's request.
Brief description
For the technical scheme being illustrated more clearly that in various embodiments of the present invention, below will be to required in embodiment description The accompanying drawing using is briefly described.
Fig. 1 is the flow chart based on the Web page classification method of hybrid algorithm for the present invention;
Fig. 2 is the flow chart based on Web page classification method step c of hybrid algorithm for the present invention;
Fig. 3 is the flow chart based on Web page classification method step c of hybrid algorithm for the present invention;
Fig. 4 is the structural representation based on the Web page classifying device of hybrid algorithm for the present invention;
Fig. 5 is the structural representation based on the Web page classifying device svm classifier unit of hybrid algorithm for the present invention;
Fig. 6 is the structural representation based on the Web page classifying device Bayes's classification unit of hybrid algorithm for the present invention.
Specific embodiment
Below in conjunction with accompanying drawing, the above-mentioned He other technical characteristic of the present invention and advantage are described in more detail.
Embodiment 1
As shown in figure 1, it is the flow chart based on the Web page classification method of hybrid algorithm for the present invention, wherein, described it is based on The Web page classification method of hybrid algorithm includes:
Step a, searches for webpage to be sorted, carries out process to described webpage to be sorted and obtain web data;
Obtain web data from described webpage to be sorted, be to be obtained inside this webpage by the url of this page Some data, available HttpClient obtains.These information can include pageview (PV), access times, visitor's number (UV), New visitor's number, new visitor's ratio, IP, jump out rate, average access duration, average access number of pages, conversion number of times, conversion ratio etc..
Described web data is semi-structured data, is usually expressed as html format.In the expression of Chinese web page, pass through Related web page is searched for using information gathering system, the title in html file and text is processed respectively, (title also serves as text A part), thus the expression of webpage can be converted to the expression of text.Compare with common plain text, semi-structured number Certain structural according to having, but be not the data of the relational database with strict theoretical model.Such as XML is just relatively more suitable Close storage partly-structured data, different classes of information is saved in the different node of XML just permissible.
Step b, is processed to described web data, with vector space model, described web data is converted to text table Show, calculate the weights of entry item and the characteristic vector of described webpage to be sorted is changed into numeric form.
In described vector space model, the characteristic vector representing described web page characteristics to be sorted is by the entry with weight Composition, that is,:Each element of characteristic vector is entry.
In described vector space model, the vector that text space is counted as being made up of one group of orthogonal entry vector is empty Between.The feature sum assuming all texts is n, then constitute the vector space of a n dimension, and each of which text is represented as one The characteristic vector of individual n dimension:
V (d)=(t1, ω1(d);t2, ω2(d);…;tn, ωn(d))
Wherein V (d) is the corresponding characteristic vector of text d, t1、t2、tnFor the 1st, 2, n entry item (vectorial), ω1(d)、 ω2(d)、ωnD () is t1、t2、tnWeights in text d.
The weight computing formula of entry item is:
Wherein, ωiD () is i-th entry item weights in text d, ωiD () goes out in text d for i-th entry item Existing word frequency, N is the number of all texts, niFor occurring in that the number of the text of i-th entry item.
This computing formula does not have the mathematical derivation formula of complexity, calculates simple and quick, beneficial to understanding, result meets actual feelings Condition.Simply, quickly and accurately calculate weights, so can quickly each text be represented as the feature of n dimension to Amount.
Step c, by the use of the characteristic vector of numeric form as training data, sets up the disaggregated model of SVM, and utilizes SVM Grader is classified to the described characteristic vector of webpage to be sorted.
Described SVM is the abbreviation of support vector machine (Support Vector Machine).
After SVM classifier is classified to the described characteristic vector of webpage to be sorted, characteristic vector is divided into two class one class It is the sample meeting class condition, a class is the not sample in taxonomic category.The web data such as getting is stored in number In the middle of storehouse, in these data, not can determine which meets class condition, need to be carried out a filtering screening with SVM.
Step d, the described characteristic vector meeting class condition that SVM classifier is exported is delivered to Naive Bayes Classification Classified in the middle of device.
After SVM classifier is classified to the described characteristic vector of webpage to be sorted, characteristic vector is divided into two class one class It is the sample meeting class condition, a class is the not sample in taxonomic category.The web data such as getting is stored in number In the middle of storehouse, in these data, not can determine which meets class condition, need to be carried out a filtering screening with SVM.
Wherein, the sample meeting class condition is the sample of needs.Described class condition determines according to practical situation, according to It needs to be determined that concrete threshold value, such as visitor's number (UV) are more than how many, average access number of pages is how many etc..
Step e, is classified to the described characteristic vector of described webpage to be sorted using Naive Bayes Classifier.
So, support the feature of incremental training using svm classifier models coupling naive Bayesian, first carried out using SVM Two classification, are then carried out with Nae Bayesianmethod classifying, classification is rapider, more again;And after including new data Can automatically be adjusted, revise and judge, improve accuracy rate without re -training.It has raising classified counting efficiency with Classification accuracy, the advantage reducing algorithm complex.This hybrid sorting process can be Web page classifying and real-time marketing carries For fast and accurately user's request.
Embodiment 2
Web page classification method based on hybrid algorithm as described above, the present embodiment is different from part and is, described In step c, the kernel function of described svm classifier model is RBF (RBF) kernel function, this is because webpage classification is various, Using RBF (RBF) kernel function, its method is simply and readily realized, and can accelerate the processing speed to Web page classifying, And then Speed-up Establishment svm classifier model and speed webpage to be sorted classified using svm classifier model.
Wherein, the distance between vector in described RBF computing formula is:
Wherein, D is the distance between vector, ωi(dm) for the vectorial d in i-th dimension spacem, ωi(dn) it is i-th dimension space In vectorial dn.
Wherein, this computational methods are simply and readily realized, and can accelerate the processing speed to Web page classifying, and then accelerate Set up svm classifier model and speed webpage to be sorted classified using svm classifier model.
Embodiment 3
Web page classification method based on hybrid algorithm as described above, the present embodiment is different from part and is, such as Fig. 2 Shown, described step c includes:
Step c1, by the use of the characteristic vector of numeric form as training data, determines classification formula, sets up the classification of SVM Model;
The classification of webpage to be sorted is judged, and the characteristic vector of the webpage to be sorted after will determine that is as training number According to;Using described training data, determine the classification formula of svm classifier model.
Here, the webpage to be sorted as training data is a portion of all webpages to be sorted.
The process of setting up of described SVM is:
(xi,yi), i=1 ..., n, x ∈ Rd, y ∈ { -1 ,+1 } is class code.Linear discriminant function in d dimension space General type is g (x)=wx+b, and classification line equation is wx+b=0.Discriminant function is normalized, makes all samples of two classes All meet | g (x) |=1, that is, make | the g (x) |=1 from the nearest sample of classifying face, now class interval is equal to 2/ | | w | |, therefore make interval maximum be equivalent to make | | w | | (or | | w | |2) minimum.
Finally obtain classification formula yi[(wx)+b] -1 >=0, i=1,2 ..., n
Step c2, the described classification formula using SVM classifier calculates to the characteristic vector of described webpage to be sorted, Confirm whether described characteristic vector makes described classification formula set up, thus described characteristic vector is divided into two classes.
Wherein, can make that described classification formula sets up for a class sample it is impossible to make described classification formula it is true that another Class sample.
Embodiment 4
Web page classification method based on hybrid algorithm as described above, the present embodiment is different from part and is, such as Fig. 3 Shown, described step e includes:
Step e1, selects a part as training sample from the described characteristic vector of SVM classifier output, determines described The corresponding characteristic attribute of each characteristic vector in training sample, and the class of the corresponding described webpage to be sorted of each characteristic vector Not.
Such as one characteristic vector includes multiple characteristic attributes, then can be expressed as x={ a1,…,am, wherein each a A characteristic attribute for x.
The classification of described webpage to be sorted has multiple, then can be expressed as category set C={ y1,…,yn}
Step e2, count described training sample described in each classification of webpage to be sorted occur frequency and of all categories under The conditional probability of each characteristic attribute is estimated.
The conditional probability of described each characteristic attribute lower of all categories is estimated as:
P(a1|y1), P (a2|y1) ..., P (am|y1);P(a1|y2), P (a2|y2) ..., P (am|y2);…;P(a1|y
Wherein, y1、y2、...、ynRefer to from the 1 to n-th classification, a1、a2、...、amIt is to belong to from the 1 to m-th feature Property.P(am|yn) refer to that the conditional probability of m-th characteristic attribute in n-th classification is estimated, that is to say and occurring in that n-th classification On the basis of, the probability of m-th characteristic attribute appearance.
Wherein, described conditional probability estimation is to be determined according to practical situation by the method for statistics.
Step e3, is analyzed to the described characteristic attribute in the webpage described to be sorted of SVM classifier output, and calculating should Webpage to be sorted belongs to the class probability of each classification.
Wherein, the computing formula of the class probability of described webpage to be sorted is:
Wherein, x is the characteristic vector of webpage to be sorted, and i is the sequence number of classification, and j is characterized the sequence number of attribute, and m is characterized The sum of attribute, C is constant, yiFor i-th classification, ajFor j-th characteristic attribute, P (yi) it is the frequency that i-th classification occurs, P(aj|yi) be j-th characteristic mathematical in i-th classification conditional probability estimate, P (yi| x) be webpage to be sorted classification general Rate.
As such, it is possible to quickly calculate the probability that webpage to be sorted belongs to each classification, thus judging rapidly to be sorted The optimal classification of webpage, improves judging efficiency;And formula is simple, convenience of calculation, save system resource.
Step e4, determines the maximum class probability of numerical value, category probability pair in the class probability of described webpage to be sorted The classification answered is the classification of described webpage to be sorted.
If i.e.:
P(yi| x)=max { P (y1| x), P (y2| x) ..., P (yn|x)}
Then x ∈ yk, that is, the classification of described webpage to be sorted is i-th classification.
So, carried out with model-naive Bayesian classifying, classification is rapider more;And can be automatic after including new data It is adjusted, revises and judge, improve accuracy rate without re -training.It has raising classified counting efficiency with classification accurately Rate, the advantage reducing algorithm complex.
Embodiment 5
Web page classifying device based on hybrid algorithm as described above, the present embodiment is corresponding being calculated based on mixing The Web page classifying device of method, as shown in figure 4, it is the structural representation based on the Web page classifying device of hybrid algorithm for the present invention, Wherein, described included based on the Web page classifying device of hybrid algorithm:
Web Page Processing unit 1, searches for webpage to be sorted, carries out process to described webpage to be sorted and obtain web data;
Obtain web data from described webpage to be sorted, be to be obtained inside this webpage by the url of this page Some data, available HttpClient obtains.These information can include pageview (PV), access times, visitor's number (UV), New visitor's number, new visitor's ratio, IP, jump out rate, average access duration, average access number of pages, conversion number of times, conversion ratio etc..
Described web data is semi-structured data, is usually expressed as html format.In the expression of Chinese web page, pass through Related web page is searched for using information gathering system, the title in html file and text is processed respectively, (title also serves as text A part), thus the expression of webpage can be converted to the expression of text.Compare with common plain text, semi-structured number Certain structural according to having, but be not the data of the relational database with strict theoretical model.Such as XML is just relatively more suitable Close storage partly-structured data, different classes of information is saved in the different node of XML just permissible.
Date Conversion Unit 2, is processed to described web data, is changed described web data with vector space model For text representation, calculate the weights of entry item and the characteristic vector of described webpage to be sorted is changed into numeric form.
In described vector space model, the characteristic vector representing described web page characteristics to be sorted is by the entry with weight Composition, that is,:Each element of characteristic vector is entry.
In described vector space model, the vector that text space is counted as being made up of one group of orthogonal entry vector is empty Between.The feature sum assuming all texts is n, then constitute the vector space of a n dimension, and each of which text is represented as one The characteristic vector of individual n dimension:
V (d)=(t1, ω1(d);t2, ω2(d);…;tn, ωn(d))
Wherein V (d) is the corresponding characteristic vector of text d, t1、t2、tnFor the 1st, 2, n entry item (vectorial), ω1(d)、 ω2(d)、ωnD () is t1、t2、tnWeights in text d.
The weight computing formula of entry item is:
Wherein, ωiD () is i-th entry item weights in text d, ωiD () goes out in text d for i-th entry item Existing word frequency, N is the number of all texts, niFor occurring in that the number of the text of i-th entry item.
This computing formula does not have the mathematical derivation formula of complexity, calculates simple and quick, beneficial to understanding, result meets actual feelings Condition.Simply, quickly and accurately calculate weights, so can quickly each text be represented as the feature of n dimension to Amount.
Svm classifier unit 3, by the use of the characteristic vector of numeric form as training data, sets up the disaggregated model of SVM, and Using SVM classifier, the described characteristic vector of webpage to be sorted is classified.
Described SVM is the abbreviation of support vector machine (Support Vector Machine).
After SVM classifier is classified to the described characteristic vector of webpage to be sorted, characteristic vector is divided into two class one class It is the sample meeting class condition, a class is the not sample in taxonomic category.The web data such as getting is stored in number In the middle of storehouse, in these data, not can determine which meets class condition, need to be carried out a filtering screening with SVM.
Data supply unit 4, the described characteristic vector meeting class condition that SVM classifier is exported is delivered to simple shellfish Classified in the middle of this grader of leaf.
After SVM classifier is classified to the described characteristic vector of webpage to be sorted, characteristic vector is divided into two class one class It is the sample meeting class condition, a class is the not sample in taxonomic category.The web data such as getting is stored in number In the middle of storehouse, in these data, not can determine which meets class condition, need to be carried out a filtering screening with SVM.
Wherein, the sample meeting class condition is the sample of needs.Described class condition determines according to practical situation, according to It needs to be determined that concrete threshold value, such as visitor's number (UV) are more than how many, average access number of pages is how many etc..
Bayes's classification unit 5, is entered to the described characteristic vector of described webpage to be sorted using Naive Bayes Classifier Row classification.
So, support the feature of incremental training using svm classifier models coupling naive Bayesian, first carried out using SVM Two classification, are then carried out with model-naive Bayesian classifying, classification is rapider, more again;And after including new data Can automatically be adjusted, revise and judge, improve accuracy rate without re -training.It has raising classified counting efficiency with Classification accuracy, the advantage reducing algorithm complex.This hybrid classification device can be Web page classifying and real-time marketing carries For fast and accurately user's request.
Embodiment 6
Web page classifying device based on hybrid algorithm as described above, the present embodiment is different from part and is, described In svm classifier unit 3, the kernel function of described svm classifier model is RBF (RBF) kernel function, this is because web page class Not various, using RBF (RBF) kernel function, its method is simply and readily realized, and can accelerate the place to Web page classifying Reason speed, and then Speed-up Establishment svm classifier model and speed webpage to be sorted classified using svm classifier model.
Wherein, the distance between vector in described RBF computing formula is:
Wherein, D is the distance between vector, ωi(dm) for the vectorial d in i-th dimension spacem, ωi(dn) it is i-th dimension space In vectorial dn.
Wherein, this computational methods are simply and readily realized, and can accelerate the processing speed to Web page classifying, and then accelerate Set up svm classifier model and speed webpage to be sorted classified using svm classifier model.
Embodiment 7
Web page classifying device based on hybrid algorithm as described above, the present embodiment is different from part and is, such as Fig. 5 Shown, svm classifier unit 3 includes:
Model building module 31, by the use of the characteristic vector of numeric form as training data, determines classification formula, sets up The disaggregated model of SVM;
The classification of webpage to be sorted is judged, and the characteristic vector of the webpage to be sorted after will determine that is as training number According to;Using described training data, determine the classification formula of svm classifier model.
Here, the webpage to be sorted as training data is a portion of all webpages to be sorted.
The process of setting up of described SVM is:
(xi,yi), i=1 ..., n, x ∈ Rd, y ∈ { -1 ,+1 } is class code.Linear discriminant function in d dimension space General type is g (x)=wx+b, and classification line equation is wx+b=0.Discriminant function is normalized, makes all samples of two classes All meet | g (x) |=1, that is, make | the g (x) |=1 from the nearest sample of classifying face, now class interval is equal to 2/ | | w | |, therefore make interval maximum be equivalent to make | | w | | (or | | w | |2) minimum.
Finally obtain classification formula yi[(wx)+b] -1 >=0, i=1,2 ..., n
Category of model module 32, using the characteristic vector to described webpage to be sorted for the described classification formula of SVM classifier Calculated, confirmed whether described characteristic vector makes described classification formula set up, thus described characteristic vector is divided into two classes.
Wherein, can make that described classification formula sets up for a class sample it is impossible to make described classification formula it is true that another Class sample.
Embodiment 8
Web page classifying device based on hybrid algorithm as described above, the present embodiment is different from part and is, such as Fig. 6 Shown, described Bayes's classification unit 5 includes:
Characteristic determination module 51, selects a part as training sample from the described characteristic vector of SVM classifier output, Determine the corresponding characteristic attribute of each characteristic vector in described training sample, and each characteristic vector is corresponding described to be sorted The classification of webpage.
Such as one characteristic vector includes multiple characteristic attributes, then can be expressed as x={ a1,...,am, wherein each A is a characteristic attribute of x.
The classification of described webpage to be sorted has multiple, then can be expressed as category set C={ y1,...,yn}
Probability statistics module 52, count described training sample described in each classification of webpage to be sorted occur frequency and The conditional probability of each characteristic attribute lower of all categories is estimated.
The conditional probability of described each characteristic attribute lower of all categories is estimated as:
P(a1|y1), P (a2|y1) ..., P (am|y1);P(a1|y2), P (a2|y2) ..., P (am|y2);…;P(a1|y
Wherein, y1、y2、...、ynRefer to from the 1 to n-th classification, a1、a2、...、amIt is to belong to from the 1 to m-th feature Property.P(am|yn) refer to that the conditional probability of m-th characteristic attribute in n-th classification is estimated, that is to say and occurring in that n-th classification On the basis of, the probability of m-th characteristic attribute appearance.
Wherein, described conditional probability estimation is to be determined according to practical situation by the method for statistics.
Probability evaluation entity 53, is carried out point to the described characteristic attribute in the webpage described to be sorted of SVM classifier output Analysis, calculates the class probability that this webpage to be sorted belongs to each classification.
Wherein, the computing formula of the class probability of described webpage to be sorted is:
Wherein, x is the characteristic vector of webpage to be sorted, and i is the sequence number of classification, and j is characterized the sequence number of attribute, and m is characterized The sum of attribute, C is constant, yiFor i-th classification, ajFor j-th characteristic attribute, P (yi) it is the frequency that i-th classification occurs, P(aj|yi) be j-th characteristic mathematical in i-th classification conditional probability estimate, P (yi| x) be webpage to be sorted classification general Rate.
As such, it is possible to quickly calculate the probability that webpage to be sorted belongs to each classification, thus judging rapidly to be sorted The optimal classification of webpage, improves judging efficiency;And formula is simple, convenience of calculation, save system resource.
Category determination module 54, determines the maximum class probability of numerical value in the class probability of described webpage to be sorted, such The corresponding classification of other probability is the classification of described webpage to be sorted.
If i.e.:
P(yi| x)=max { P (y1| x), P (y2| x) ..., P (yn|x)}
Then x ∈ yk, that is, the classification of described webpage to be sorted is i-th classification.
So, carried out with model-naive Bayesian classifying, classification is rapider more;And can be automatic after including new data It is adjusted, revises and judge, improve accuracy rate without re -training.It has raising classified counting efficiency with classification accurately Rate, the advantage reducing algorithm complex.
The foregoing is only presently preferred embodiments of the present invention, be merely illustrative for the purpose of the present invention, and non-limiting 's.Those skilled in the art understands, it can be carried out in the spirit and scope that the claims in the present invention are limited with many changes, Modification, in addition equivalent, but fall within protection scope of the present invention.

Claims (10)

1. a kind of Web page classification method based on hybrid algorithm is it is characterised in that include:
Step a, searches for webpage to be sorted, carries out process to described webpage to be sorted and obtain web data;
Step b, is processed to described web data, with vector space model, described web data is converted to text representation, Calculate the weights of entry item and the characteristic vector of described webpage to be sorted is changed into numeric form;
Step c, by the use of the described characteristic vector of numeric form as training data, sets up the disaggregated model of SVM, and utilizes SVM Grader is classified to the described characteristic vector of described webpage to be sorted;
Step d, the described characteristic vector meeting class condition of described SVM classifier output is delivered to Naive Bayes Classification Classified in the middle of device;
Step e, is classified to the described characteristic vector of described webpage to be sorted using described Naive Bayes Classifier.
2. Web page classification method as claimed in claim 1 is it is characterised in that described step c includes:
Step c1, by the use of the described characteristic vector of numeric form as training data, determines classification formula, sets up the classification of SVM Model;
Step c2, the described classification formula using described SVM classifier is carried out to the described characteristic vector of described webpage to be sorted Calculate, confirm whether described characteristic vector makes described classification formula set up, thus described characteristic vector is divided into two classes.
3. Web page classification method as claimed in claim 1 or 2 is it is characterised in that described step e includes:
Step e1, selects a part as training sample from the described characteristic vector of described SVM classifier output, determines described The corresponding characteristic attribute of each characteristic vector in training sample, and the class of the corresponding described webpage to be sorted of each characteristic vector Not;
Step e2, count the frequency that described in described training sample, each classification of webpage to be sorted occurs and of all categories lower each The conditional probability of characteristic attribute is estimated;
Step e3, is analyzed to the described characteristic attribute in the webpage described to be sorted of described SVM classifier output, and calculating should Webpage to be sorted belongs to the class probability of each classification;
Step e4, determines the maximum class probability of numerical value in the class probability of described webpage to be sorted, and category probability is corresponding Classification is the classification of described webpage to be sorted.
4. Web page classification method as claimed in claim 3 is it is characterised in that in described step e3, described webpage to be sorted The computing formula of class probability is:
P ( y i | x ) = P ( y i ) × C × Π j = 1 m P ( a j | y i )
Wherein, x is the characteristic vector of webpage to be sorted, and i is the sequence number of classification, and j is characterized the sequence number of attribute, and m is characterized attribute Sum, C be constant, yiFor i-th classification, ajFor j-th characteristic attribute, P (yi) it is the frequency that i-th classification occurs, P (aj |yi) be j-th characteristic mathematical in i-th classification conditional probability estimate, P (yi| x) be webpage to be sorted class probability.
5. Web page classification method as claimed in claim 1 or 2 is it is characterised in that described web data is semi-structured data.
6. Web page classification method as claimed in claim 1 or 2 is it is characterised in that in described step b, the power of described entry item Value computing formula is:
ω i ( d ) = tf i ( d ) × log ( N / n i ) Σ ( tf i ( d ) × log ( N / n i ) ) 2
Wherein, ωiD () is i-th entry item weights in text d, ωiD () is that i-th entry item occurs in text d Word frequency, N is the number of all texts, niFor occurring in that the number of the text of i-th entry item.
7. Web page classification method as claimed in claim 1 or 2 is it is characterised in that in described step c, the disaggregated model of SVM Kernel function is RBF kernel function.
8. the corresponding Web page classifying dress based on hybrid algorithm of the Web page classification method described in a kind of and any of the above-described claim Put it is characterised in that including:
Web Page Processing unit, searches for webpage to be sorted, carries out process to described webpage to be sorted and obtain web data;
Date Conversion Unit, is processed to described web data, with vector space model, described web data is converted to literary composition This expression, calculates the weights of entry item and the characteristic vector of described webpage to be sorted is changed into numeric form;
Svm classifier unit, by the use of the described characteristic vector of numeric form as training data, sets up the disaggregated model of SVM, and profit With SVM classifier, the described characteristic vector of described webpage to be sorted is classified;
Data supply unit, the described characteristic vector meeting class condition of described SVM classifier output is delivered to simple shellfish Classified in the middle of this grader of leaf;
Bayes's classification unit, is carried out to the described characteristic vector of described webpage to be sorted using described Naive Bayes Classifier Classification.
9. Web page classifying device as claimed in claim 8 is it is characterised in that described svm classifier unit includes:
Model building module, by the use of the described characteristic vector of numeric form as training data, determines classification formula, sets up SVM Disaggregated model;
Category of model module, the described classification formula using described SVM classifier enters to the characteristic vector of described webpage to be sorted Row calculates, and confirms whether described characteristic vector makes described classification formula set up, thus described characteristic vector is divided into two classes.
10. Web page classifying device as claimed in claim 8 or 9 is it is characterised in that described Bayes's classification unit includes:
Characteristic determination module, selects a part as training sample, really from the described characteristic vector of described SVM classifier output The corresponding characteristic attribute of each characteristic vector in fixed described training sample, and the corresponding described net to be sorted of each characteristic vector The classification of page;
Probability statistics module, counts the frequency and of all categories that described in described training sample, each classification of webpage to be sorted occurs The conditional probability of each characteristic attribute lower is estimated;
Probability evaluation entity, is analyzed to the described characteristic attribute in the webpage described to be sorted of described SVM classifier output, Calculate the class probability that this webpage to be sorted belongs to each classification;
Category determination module, determines the maximum class probability of numerical value, category probability in the class probability of described webpage to be sorted Corresponding classification is the classification of described webpage to be sorted.
CN201610557554.0A 2016-07-13 2016-07-13 Mixed algorithm-based web page classification method and apparatus Pending CN106445994A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610557554.0A CN106445994A (en) 2016-07-13 2016-07-13 Mixed algorithm-based web page classification method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610557554.0A CN106445994A (en) 2016-07-13 2016-07-13 Mixed algorithm-based web page classification method and apparatus

Publications (1)

Publication Number Publication Date
CN106445994A true CN106445994A (en) 2017-02-22

Family

ID=58185021

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610557554.0A Pending CN106445994A (en) 2016-07-13 2016-07-13 Mixed algorithm-based web page classification method and apparatus

Country Status (1)

Country Link
CN (1) CN106445994A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506472A (en) * 2017-09-05 2017-12-22 淮阴工学院 A kind of student browses Web page classification method
CN108134784A (en) * 2017-12-19 2018-06-08 东软集团股份有限公司 web page classification method and device, storage medium and electronic equipment
CN108763961A (en) * 2018-06-04 2018-11-06 中国电子信息产业集团有限公司第六研究所 A kind of private data stage division and device based on big data
CN108897754A (en) * 2018-05-07 2018-11-27 广东省电信规划设计院有限公司 Recognition methods, system and the calculating equipment of work order type based on big data
CN109446618A (en) * 2018-10-18 2019-03-08 重庆大学 A kind of ancient building component based on VR builds analogy method
CN110019781A (en) * 2017-09-15 2019-07-16 北京京东尚科信息技术有限公司 Difference comments information classification approach and device, storage medium, electronic equipment
CN111967503A (en) * 2020-07-24 2020-11-20 西安电子科技大学 Method for constructing multi-type abnormal webpage classification model and abnormal webpage detection method
CN113407118A (en) * 2021-06-24 2021-09-17 九江职业技术学院 Data storage device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110184817A1 (en) * 2010-01-28 2011-07-28 Yahoo!, Inc. Sensitivity Categorization of Web Pages
CN102426585A (en) * 2011-08-09 2012-04-25 中国科学技术信息研究所 Webpage automatic classification method based on Bayesian network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110184817A1 (en) * 2010-01-28 2011-07-28 Yahoo!, Inc. Sensitivity Categorization of Web Pages
CN102426585A (en) * 2011-08-09 2012-04-25 中国科学技术信息研究所 Webpage automatic classification method based on Bayesian network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘巍: ""基于内容的不良网页信息过滤方法研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506472A (en) * 2017-09-05 2017-12-22 淮阴工学院 A kind of student browses Web page classification method
CN107506472B (en) * 2017-09-05 2020-09-08 淮阴工学院 Method for classifying browsed webpages of students
CN110019781A (en) * 2017-09-15 2019-07-16 北京京东尚科信息技术有限公司 Difference comments information classification approach and device, storage medium, electronic equipment
CN108134784A (en) * 2017-12-19 2018-06-08 东软集团股份有限公司 web page classification method and device, storage medium and electronic equipment
CN108134784B (en) * 2017-12-19 2021-08-31 东软集团股份有限公司 Webpage classification method and device, storage medium and electronic equipment
CN108897754A (en) * 2018-05-07 2018-11-27 广东省电信规划设计院有限公司 Recognition methods, system and the calculating equipment of work order type based on big data
CN108897754B (en) * 2018-05-07 2020-12-11 广东省电信规划设计院有限公司 Big data-based work order type identification method and system and computing device
CN108763961A (en) * 2018-06-04 2018-11-06 中国电子信息产业集团有限公司第六研究所 A kind of private data stage division and device based on big data
CN109446618A (en) * 2018-10-18 2019-03-08 重庆大学 A kind of ancient building component based on VR builds analogy method
CN111967503A (en) * 2020-07-24 2020-11-20 西安电子科技大学 Method for constructing multi-type abnormal webpage classification model and abnormal webpage detection method
CN111967503B (en) * 2020-07-24 2023-10-13 西安电子科技大学 Construction method of multi-type abnormal webpage classification model and abnormal webpage detection method
CN113407118A (en) * 2021-06-24 2021-09-17 九江职业技术学院 Data storage device

Similar Documents

Publication Publication Date Title
CN106445994A (en) Mixed algorithm-based web page classification method and apparatus
CN104750844B (en) Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device
CN103744981B (en) System for automatic classification analysis for website based on website content
US7949643B2 (en) Method and apparatus for rating user generated content in search results
CN102495860B (en) Expert recommendation method based on language model
Zhou et al. Userrec: A user recommendation framework in social tagging systems
CN107391659B (en) Citation network academic influence evaluation ranking method based on credibility
CN109508385B (en) Character relation analysis method in webpage news data based on Bayesian network
CN103399891A (en) Method, device and system for automatic recommendation of network content
CN110674407A (en) Hybrid recommendation method based on graph convolution neural network
CN105893609A (en) Mobile APP recommendation method based on weighted mixing
CN103207913A (en) Method and system for acquiring commodity fine-grained semantic relation
CN103235824A (en) Method and system for determining web page texts users interested in according to browsed web pages
CN112307336B (en) Hot spot information mining and previewing method and device, computer equipment and storage medium
CN104199822A (en) Method and system for identifying demand classification corresponding to searching
CN101266620A (en) Method and apparatus for providing target information to user
CN106055661A (en) Multi-interest resource recommendation method based on multi-Markov-chain model
CN104834640A (en) Webpage identification method and apparatus
CN103473128A (en) Collaborative filtering method for mashup application recommendation
CN108959329A (en) A kind of file classification method, device, medium and equipment
CN111324801A (en) Hot event discovery method in judicial field based on hot words
CN108363752B (en) User social influence analysis method based on microblog propagation scale prediction
CN112232933A (en) House source information recommendation method, device, equipment and readable storage medium
CN103744918A (en) Vertical domain based micro blog searching ranking method and system
CN104572733A (en) User interest tag classification method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170222

RJ01 Rejection of invention patent application after publication