CN103810264B

CN103810264B - The web page text sorting technique of feature based selection

Info

Publication number: CN103810264B
Application number: CN201410038614.9A
Authority: CN
Inventors: 周红芳; 郭杰; 王鹏; 张国荣; 段文聪; 王心怡; 何馨依
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2014-01-27
Filing date: 2014-01-27
Publication date: 2017-06-06
Anticipated expiration: 2034-01-27
Also published as: CN103810264A

Abstract

The data set being made up of substantial amounts of webpage, first, is divided into training set and test set two parts by the web page text sorting technique of feature based selection；Then, the ability of the information representation web page contents in web page tag domain assigns label different weight, and calculates the weight of Feature Words in each webpage in training set（The product of word frequency after normalization and anti-document frequency）；Deviation between distribution within class rate and class is combined on the basis of gained weight, the characteristic vector of each webpage in training set is calculated, the characteristic vector of each class in training set is then calculated；Finally, the word frequency of Feature Words in each webpage in test set, and the similarity in webpage to be sorted and training set between each class are calculated, using the maximum class of similarity as the affiliated class of webpage to be sorted, classification results is obtained.

Description

The web page text sorting technique of feature based selection

Technical field

The invention belongs to data digging method technical field, it is related to a kind of web page text classification side of feature based selection Method.

Background technology

With computer and the fast-developing of mechanics of communication, the rapid popularization and application in internet, the webpage on network is just with several The speed of what series increases.In face of the mass network information of these explosive growths, how therefrom quickly and efficiently to obtain useful , information interested become more and more important.Therefore, effectively organization and management web page resources, shorten needed for user obtains The time of information, become current urgent problem.Webpage classification technology arises at the historic moment, and is increasingly becoming after text classification The study hotspot in machine learning field afterwards.

Web page classifying traditionally is first by artificial judgment classification, i.e., after the content of analysis webpage, manually to select Select a suitable classification.But, the way of this manual sort has many shortcomings：One be in web page text quantity drastically In the case of growth, manually sorting technique becomes unrealistic, it is necessary to expend substantial amounts of human resources carrying out classification；Two are Classification is manually carried out to web page text cannot ensure classification accuracy higher, main mainly due to everyone Heuristics etc. Sight factor is different, and classification results are it is possible that inconsistent situation.Therefore, it is badly in need of a kind of effective method to enter web page text Row management, thus web page text automatic classification technology starts to show its superiority.

Web page text automatic classification technology derives from Technologies of Automated Text Classification, and its target is consistent with Text Classification, I.e. under pre-defined Web page classifying system, webpage to be sorted is accurately belonged to one or more corresponding classifications.Often Web page text sorting algorithm has following several：KNN algorithms, NB (Naive Bayes) algorithm, SVMs (SVM), something lost Propagation algorithm (GA), Rocchio algorithms etc..These web page text automatic classification technologies remain many problems, such as webpage text The dimension in eigen space is too high, causes memory space big, and classification speed is slow；Include a large amount of website marks, advertisement in webpage Deng noise information, severe jamming to the other determination of web page class, so as to reduce the accuracy rate of classification；While different positions in webpage The ability of the information representation webpage put is different, and the accuracy to classifying has a certain impact.Therefore, have in the urgent need to finding one kind The web page text sorting technique of effect reduces the time of classification, and improves the accuracy rate of classification.

The content of the invention

It is an object of the invention to provide a kind of web page text sorting technique of feature based selection, solve prior art and exist The problem that classification speed is slow, accuracy rate is not high.

The technical scheme is that, the web page text sorting technique of feature based selection, first, by substantial amounts of webpage The data set of composition is divided into training set and test set two parts；Then, the information representation web page contents in web page tag domain Ability assign label different weight, and calculate the weight of Feature Words in each webpage in training set（Word frequency after normalization With the product of anti-document frequency）；Deviation between distribution within class rate and class is combined on the basis of gained weight, each in training set is calculated The characteristic vector of webpage, then calculates the characteristic vector of each class in training set；Finally, it is special in each webpage in calculating test set The similarity between each class in the word frequency of word, and webpage to be sorted and training set is levied, is made using the maximum class of similarity It is the affiliated class of webpage to be sorted, obtains classification results.

The features of the present invention is also resided in:

Feature Words are that what is obtained after being pre-processed to webpage can represent the word of web page contents.

Webpage in training set includes several different classes, and the webpage in each class is carried out to be calculated each class Characteristic vector, then, calculates the word frequency of Feature Words in each webpage in test set, and in webpage to be sorted and training set each The similarity of the characteristic vector of class, using the maximum class of similarity as the affiliated class of webpage to be sorted, obtains carrying out webpage The result of classification.Training set in data set carries out a series of calculating and constructs Web page classifying device, and test set is used to test the net The performance that web page classifier is classified to webpage is good and bad.

Comprise the following steps that:

1. the data set being made up of substantial amounts of webpage is divided into training set and test set two parts, typically requires that training set takes 80% or so of data set, test set takes remainder；

2. pair data set（Including training set and test set）Pre-processed, participle mainly is carried out to webpage, will net Into single word, to unrelated noise information of classifying in removal webpage, removal stop words is without reality to text dividing in page Implication applies very extensive word；

3. the position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in training set；

4. deviation between the distribution within class rate and class of binding characteristic word, calculates the weight of special testimony in each webpage in training set （TFIDF）；

5., according to the weight of special testimony in each webpage, the Text eigenvector of each webpage in training set is calculated；

6., according to the Text eigenvector of each webpage in each class, the characteristic vector of each class in training set is calculated；

7. the position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in test set；

8. Web page classifying is carried out using vector space model, calculated using the cosine angle formulae between two characteristic vectors and treated Similarity in classification webpage and training set between each class, and the maximum class of similarity is used as the institute of webpage to be sorted Category class.

When calculating the word frequency of Feature Words, it is considered to the influence of its position, the present invention is according to practical experience and grinding with reference to forefathers Study carefully achievement, it is believed that represent the title of webpage centre point, its weight highest；To summarizing and emphasizing the brief introduction that webpage plays a crucial role And keyword, its weight takes second place；Web page text, its weight is minimum.

Calculate Feature Words t_kWeight when binding characteristic word t_kClass between deviation ED_kjWith distribution within class rate ID_kj, wherein, class Between deviation ED_kjComputing formula it is as follows：

In formula, N (t_k,C_j) represent class C_jIn there is Feature Words t_kDocument number,In representing all classes There is Feature Words t_kDocument number, m be training set in classification number.

Distribution within class rate ID_kjComputing formula it is as follows：

In formula, M (t_k,C_j) represent class C_jMiddle Feature Words t_kThe total degree of appearance, M (C_j) represent class C_jIn all words occur Total degree.

The computing formula of weight is as follows：

Wherein, tf_ik(d_i) it is according to Feature Words t_kPosition in webpage be modified after new word frequency, N (D) for training The total number of files concentrated, N (t_k, D) and to there is Feature Words t in document sets D_kNumber of files, n be document d_iTotal of middle Feature Words Number, ED_kjIt is characterized word t_kClass between deviation, ID_kjIt is characterized word t_kDistribution within class rule.

Substantial amounts of webpage is minimum 6000.

The present invention has the advantages that：

1. on classification accuracy rate, tradition TFIDF algorithms and genetic algorithm (GA) are contrasted, sorting technique of the present invention is being classified just Better than other 2 contrast algorithms in true rate.Main cause is：1. when the word frequency of Feature Words is calculated, it is contemplated that Feature Words are in net Influence of the position to word frequency in page, is corrected to it, effectively raises the accuracy of classification；2. Feature Words are being calculated During weight, deviation between the distribution within class rate of Feature Words and class is combined, further increase the accuracy of classification.

2. on the classification time, because sorting technique of the present invention is when term weight function is calculated, it is contemplated that Feature Words are in webpage In the distribution in class and between class of position, Feature Words, so, compared to the genetic algorithm equally with preferable classifying quality, Substantially reduce the execution time.

3. recall rate of the present invention is all higher than traditional TFIDF algorithms and genetic algorithm on the whole.

Brief description of the drawings

Fig. 1 is the contrast of the web page text sorting technique with the classification accuracy rate of prior art of feature based selection of the present invention Figure；

Fig. 2 is the contrast of the web page text sorting technique with the classification recall rate of prior art of feature based selection of the present invention Figure.

Specific embodiment

The present invention is described in detail with reference to the accompanying drawings and detailed description.

Sorting technique of the present invention combines class between the position of Feature Words and the class of Feature Words when term weight function is calculated Interior distribution, does not have contributive Feature Words to be endowed the deficiency of larger weights classification, and finally improve so as to avoid those The accuracy rate of classification.

Related definition in the present invention is as follows：

Define 1（Word frequency）Word frequency (TF, Term Frequency) refers to Feature Words t_kIn document d_iThe number of times of middle appearance, uses tf_ik(d_i) represent.On the premise of stop words and indivedual high frequency words are excluded, Feature Words t_kIn document d_iThe number of times of middle appearance is more, It characterizes document d_iAbility it is stronger.

Define 2（Document frequency）Document frequency (DF, Document Frequency) refers to Feature Words occur in document sets D t_kNumber of files, with N (t_k, D) represent.Feature Words t_kNumber of files N (the t of appearance_k, D) and bigger, t_kTo the document d in document sets D_i Representativeness it is weaker.

Define 3（Anti- document frequency）Anti- document frequency (IDF, Inverse Document Frequency) is Feature Words t_k Occur the measurement of frequent degree in document sets D, use IDF_kRepresent：

Wherein, N (D) is the total number of files in training set, N (t_k, D) and to there is Feature Words t in document sets D_kNumber of files. IDF_kWith N (t_k, D) increase and reduce, there is t in document sets D_kNumber of files N (t_k, D) and smaller, t_kTo in document sets D Document d_iIt is more representative.

Define 4（Normalization）To reduce indivedual high-frequency characteristic words to the inhibitory action of characteristics of low-frequency word, each component is carried out Normalization.TFIDF after normalization is calculated as follows：

Wherein, L is empirical value, generally takes L=0.01, tf_ik(d_i) it is characterized word t_kIn document d_iThe number of times of middle appearance, N (D) Total number of files in for training set, N (t_k, D) and to there is Feature Words t in document sets D_kNumber of files, n be document d_iMiddle Feature Words Total number.

Define 5（The VSM of webpage is represented）The representation of webpage d is V (d)=(t₁,w₁(d);…;t_k,w_k(d);…,t_n, w_n(d)), wherein t_kRepresent the Feature Words in webpage, w_kD () represents t_kThe word frequency of appearance.

Define 6（Deviation between class）Deviation (ED, external deviation) represents that Feature Words may be in some classes between class Middle appearance, may occur without in some classes, and it is uncertainty measure between a species, uses ED_kjRepresent:

Wherein, N (t_k,C_j) represent class C_jIn there is Feature Words t_kDocument number,In representing all classes There is Feature Words t_kDocument number, m be training set in classification number.Be can be seen that from above formula, ED_kjValue is bigger, illustrates feature Word t_kMore concentrate on class C_jIn, to class C_jSign effect it is stronger.

Define 7（Distribution within class rate）Distribution within class rate (ID, internal distribution) represents all documents in class Middle the probability of Feature Words occur, it is Feature Words t_kThe measurement of distributing equilibrium degree in certain concrete kind.Distribution within class rate IDk_j Represent：

Wherein, M (t_k,C_j) represent class C_jMiddle Feature Words t_kThe total degree of appearance, M (C_j) represent class C_jIn all words occur Total degree.Be can be seen that from above formula, ID_kjValue is bigger, illustrates Feature Words t_kIn class C_jIn more be uniformly distributed, to class C_jSign effect It is stronger.

Comprise the following steps that：

1. substantial amounts of webpage is divided into training set and test set two parts, typically requires that training set takes the 80% of total webpage number Left and right, test set takes remaining part；

2. pair webpage（Including training set and test set）Pre-processed, participle mainly is carried out to webpage, will webpage Interior text dividing, to unrelated noise information of classifying in removal webpage, removes stop words into single word（Without physical meaning Or apply very extensive word）Deng；

3. binding characteristic word（All words that can represent web page contents in webpage）Position feature, calculate training set in each The word frequency of Feature Words in webpage；

Web page is different from general text, and it is a kind of semi-structured file, contains substantial amounts of link and mark Sign, the ability of the information representation web page contents in label field is different thus also different to the role of Web page classifying.This Invention is according to Feature Words t_kThe position at place is modified to its word frequency, and specific method is on the basis of former word frequency, according to its institute Position be multiplied by corresponding weight, obtain new word frequency.In an experiment, it is believed that Title is directly retouching to Web page subject State, represent the centre point of webpage, it is 4 to assign its weight；Description is the brief introduction to webpage, and keywords represents net Keyword in page content, to summarizing and emphasizing that webpage plays a crucial role, it is 2 to assign its weight to this two parts content； PlainText is common text, i.e. Web page text, and its effect to webpage is taken second place compared with the above two, and it is 1 to assign its weight.

The present invention considers to include Feature Words t_kDistribution situation of the document in each class, and Feature Words t_kIn certain class Each document in distribution situation, calculating Feature Words t_kWeight when binding characteristic word t_kClass between deviation ED_kjIn class Distributive law ID_kj.Wherein, deviation ED between class_kjComputing formula it is as follows：

In formula, N (t_k,C_j) represent class C_jIn there is Feature Words t_kDocument number,In representing all classes There is Feature Words t_kDocument number, m be training set in classification number.Distribution within class rate ID_kjComputing formula it is as follows：

The formula for calculating weight between position, the class of binding characteristic word after deviation and distribution within class rate is as follows：

In the present invention, if the frequency that certain word occurs in a text is higher, illustrate that it is distinguishing text content Ability in terms of attribute is stronger；If the scope that word occurs in some texts is wider, i.e. occurrence number in each classification Quite, illustrate that the ability of word differentiation content of text is lower.It is a kind of semi-structured file in view of Web page, containing big The link of amount and label, the ability of the information representation web page contents in label field difference, to the role of Web page classifying Also different, present invention definition can most reflect that the information of content of pages is classified as position 1, assign its highest weights；Compared with can reflect The information of content of pages is classified as position 2, assigns its high weight；Reflection content of pages is classified as position 3 inferior to the above two information, assigns Its relatively low weights is given, that is, is had：

Weight (p=1) ＞ weight (p=2) ＞ weight (p=3)

（6）

Wherein, p is position feature.During specific experiment, it is considered herein that Title is directly retouching to Web page subject State, represent the centre point of webpage, be placed on position 1, and it is 4 to assign its weight；Description is the letter to webpage Be situated between, keywords represents the keyword in web page contents, this two parts content to summarizing and emphasizing that webpage plays a crucial role, by it Position 2 is placed in, and it is 2 to assign its weight；Plain Text are common text, i.e. Web page text, its effect to webpage compared with The above two take second place, and are placed on position 3, and it is 1 to assign its weight.The present invention is according to Feature Words t_kPosition in webpage is to it Word frequency is modified, and specific method is on the basis of former word frequency, corresponding weight to be multiplied by according to its position feature, obtains new word frequency w_k(d)。

Secondly, it is contemplated that distribution situation of the Feature Words in class and between class is seldom considered in web page text sorting algorithm, originally Invention combines deviation and distribution within class rate between the class of Feature Words to adjust the weight of Feature Words again.

Finally, the present invention proposes the TFIDF features of deviation and distribution within class rate between a kind of position, the class of binding characteristic word Method of weighting, formula is as follows:

Generally, obtain the weight come by above formula and can be obtained by preferable classification results, but when multiple classes When containing same Feature Words simultaneously, and the feature weight for calculating than it is larger when, one can be produced to the accuracy of classification results Fixed influence, therefore, the present invention is corrected again on the weights that above formula is obtained, and revised weight is designated as W '_ik (d_i).Modification method is the summation sum of first statistical nature word weight in each classification（Note：When Feature Words be not present in it is a certain When in classification, its weight is 0）, it is then reduced to classification results with this divided by sum with the weight obtained according to above formula Influence.I.e.

According to the weight that formula (7) is calculated, reduce same Feature Words and appear in inhomogeneity and when its weight is too big pair The influence of classification results, while not influenceing influence of the exclusive Feature Words to classifying in inhomogeneity again.

In the selection of grader, the present invention selects vector space model, calculate first webpage to be sorted and each Similarity between class, then using the maximum class of similarity as webpage to be sorted affiliated class.The computing formula of similarity Represented using the cosine angle between two characteristic vectors:

Wherein, W_ik、W_jkDocument d is represented respectively_iWith class C_jK-th Feature Words weights, n is characterized the total number of word.

Embodiment, the specific implementation of the web page text sorting technique selected according to feature based proposed by the present invention is as follows：

Webpage used in the present invention is the internet corpus SougouCS from search dog laboratory.In an experiment, by In webpage the webpage number of some classifications very little, therefore, we only have chosen automobile, finance and economics, IT, health, physical culture, tourism, Education, culture, military affairs, house property, amusement, fashion totally 12 classifications, training set and test set two are divided into by the webpage after arrangement Point, the webpage number of training set is 600 wherein in each class, and the webpage number of test set is 200.

12 classes are had in the present embodiment, the webpage number of training set is 600 in each class, the webpage number of test set is 200, So total webpage number is 12*（600+200）=9600.

Webpage is pre-processed, participle is mainly carried out to webpage, to unrelated noise information of classifying in removal webpage, Removal stop words etc..For example, Web page text content is " I is a student ", it is that " I is one by the result obtained after participle The so a series of phrase of individual student ", then the result of gained is " student " after removing noise information and stop words.

The position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in training set.In statistics training set The number of times that Feature Words occur in the webpage in each webpage, it is secondary what is calculated if this feature lexeme is in " title " place 4 are multiplied by number；If this feature lexeme is multiplied by 2 in " brief introduction " and " keyword " place on the number of times for calculating；If This feature lexeme is then multiplied by 1 in " Web page text " place on the number of times for calculating.

Deviation between the distribution within class rate and class of binding characteristic word, calculates the weight of special testimony in each webpage in training set （TFIDF）.According to formula（1）Deviation between the class of Feature Words is calculated, according to formula（2）Calculate the distribution within class of Feature Words Rate, finally according to formula（3）Calculate the complex weight of Feature Words.

In selection training set in each webpage n before term weight function highest（N can any value, it is general bigger than normal, N takes 100 in the present invention）Feature Words and its weight constitute the Text eigenvector of the webpage.Merge all webpages in a certain class Text eigenvector, and arranged from big to small by weight, n before choosing（N can any value, it is general bigger than normal, in the present invention N takes 100）Feature Words and its weight constitute such characteristic vector.When the characteristic vector of all classes is obtained, training is completed.

The position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in test set.In statistical test collection The number of times that Feature Words occur in the webpage in each webpage, it is secondary what is calculated if this feature lexeme is in " title " place 4 are multiplied by number；If this feature lexeme is multiplied by 2 in " brief introduction " and " keyword " place on the number of times for calculating；If This feature lexeme is then multiplied by 1 in " Web page text " place on the number of times for calculating.

Web page classifying is carried out using vector space model, according to formula（14）Calculate webpage to be sorted each with training set Similarity between individual class, and the maximum class of similarity is used as the affiliated class of webpage to be sorted.After the completion of this step, according to this Invention carries out Web page classifying and terminates, and its classification results is as shown in the confusion matrix of following table：

The classification results table of the invention of table 1

From table 1 it follows that the correct webpage number of present invention classification is generally more, but there is also as health, The relatively low classification of the so correct classification number such as culture, fashion.This between these classifications and some other classification due to including What same characteristic features word was caused too much, i.e., these different classes of categorised demarcation lines are obscured.Such as fashion class, there is 31 in classification results Webpage has been assigned in amusement class.

In order to verify accuracy of the invention, carried out with the present invention using traditional TFIDF algorithms, genetic algorithm (GA) respectively Contrast.The performance that the present invention is classified using accuracy and recall rate evaluating network page, its computing formula is as follows：

Its accuracy comparison diagram is as shown in figure 1, recall rate comparison diagram is as shown in Figure 2.Be can be seen that from Fig. 1, Fig. 2 and used Classifying quality of the invention is better than using traditional TFIDF algorithms and genetic algorithm, and for most several classes of, it is accurate that it is classified Rate and recall rate all improve.This calculating of distribution of explanation Feature Words in class and between class to weight has certain shadow Ring, accordingly, it is considered to the two factors can effectively improve the accuracy and recall rate of classification.Also illustrate simultaneously when weight is calculated Consider that position of the Feature Words in webpage can significantly improve the degree of accuracy of Web page classifying.

Claims

1. the web page text sorting technique that feature based is selected, it is characterised in that first, the data being made up of substantial amounts of webpage Collection is divided into training set and test set two parts；Then, the ability of the information representation web page contents in web page tag domain is assigned The different weight of label, and the weight of Feature Words in each webpage in training set is calculated, the weight is the word frequency after normalization With the product of anti-document frequency；Deviation between distribution within class rate and class is combined on the basis of gained weight, each in training set is calculated The characteristic vector of webpage, then calculates the characteristic vector of each class in training set；Finally, it is special in each webpage in calculating test set The similarity between each class in the word frequency of word, and webpage to be sorted and training set is levied, is made using the maximum class of similarity It is the affiliated class of webpage to be sorted, obtains classification results；

Webpage in the training set includes several different classes, and the webpage in each class is carried out to be calculated each class Characteristic vector, then, calculates the word frequency of Feature Words in each webpage in test set, and in webpage to be sorted and training set each The similarity of the characteristic vector of class, using the maximum class of similarity as the affiliated class of webpage to be sorted, obtains carrying out webpage The result of classification；Training set in data set carries out a series of calculating and constructs Web page classifying device, and test set is used to test the net The performance that web page classifier is classified to webpage is good and bad；

Calculate Feature Words t_kWeight when binding characteristic word t_kClass between deviation ED_kjWith distribution within class rate ID_kj, wherein, between class partially Difference ED_kjComputing formula it is as follows：

{ED}_{k j} = \frac{N (t_{k}, C_{j})}{Σ_{x = 1}^{m} N (t_{k}, C_{x})} - - - (1)

In formula, N (t_k,C_j) represent class C_jIn there is Feature Words t_kDocument number,Represent in all classes spy occur Levy word t_kDocument number, m be training set in classification number；

Distribution within class rate ID_kjComputing formula it is as follows：

{ID}_{k j} = \frac{M (t_{k}, C_{j})}{M (C_{j})} - - - (2)

In formula, M (t_k,C_j) represent class C_jMiddle Feature Words t_kThe total degree of appearance, M (C_j) represent class C_jIn occur total time of all words Number；

The computing formula of weight is as follows：

W_{i k} (d_{i}) = \frac{{tf}_{i k} (d_{i}) \times l o g (\frac{N (D)}{N (t_{k}, D)} + 0.01)}{\sqrt{Σ_{k = 1}^{n} {({tf}_{i k} (d_{i}))}^{2} \times {[l o g (\frac{N (D)}{N (t_{k}, D)} + 0.01)]}^{2}}} \times {ED}_{k j} \times {ID}_{k j} - - - (3)

Wherein, tf_ik(d_i) it is according to Feature Words t_kPosition in webpage be modified after new word frequency, N (D) be training set in Total number of files, N (t_k, D) and to there is Feature Words t in document sets D_kNumber of files, n be document d_iThe total number of middle Feature Words, ED_kjIt is characterized word t_kClass between deviation, ID_kjIt is characterized word t_kDistribution within class rate.

2. the web page text sorting technique that feature based as claimed in claim 1 is selected, it is characterised in that Feature Words are to net The word that can represent web page contents that page is obtained after being pre-processed.

3. the web page text sorting technique that the feature based as described in claim any one of 1-2 is selected, it is characterised in that specific Step is as follows:

1) data set being made up of substantial amounts of webpage is divided into training set and test set two parts, typically requires training set access evidence 80% or so of collection, test set takes remainder；

2) data set is pre-processed, participle mainly is carried out to webpage, text dividing that will be in webpage is into single word Language, to unrelated noise information of classifying in removal webpage, removal stop words is without physical meaning or applies very extensive word；

3) position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in training set；

4) deviation between the distribution within class rate and class of binding characteristic word, calculates the weight of special testimony in each webpage in training set；

5) according to the weight of special testimony in each webpage, the Text eigenvector of each webpage in training set is calculated；

6) according to the Text eigenvector of each webpage in each class, the characteristic vector of each class in training set is calculated；

7) position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in test set；

8) Web page classifying is carried out using vector space model, calculates to be sorted using the cosine angle formulae between two characteristic vectors Similarity in webpage and training set between each class, and using the maximum class of similarity as belonging to webpage to be sorted Class.

4. the web page text sorting technique that feature based as claimed in claim 1 is selected, it is characterised in that represent webpage center The title of content, its weight highest；To summarizing and emphasizing brief introduction and keyword that webpage plays a crucial role, its weight is taken second place；Net Page text, its weight is minimum.

5. the web page text sorting technique that feature based as claimed in claim 1 is selected, it is characterised in that：Substantial amounts of webpage is Minimum 6000.