CN103810264B - The web page text sorting technique of feature based selection - Google Patents

The web page text sorting technique of feature based selection Download PDF

Info

Publication number
CN103810264B
CN103810264B CN201410038614.9A CN201410038614A CN103810264B CN 103810264 B CN103810264 B CN 103810264B CN 201410038614 A CN201410038614 A CN 201410038614A CN 103810264 B CN103810264 B CN 103810264B
Authority
CN
China
Prior art keywords
webpage
class
training set
web page
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410038614.9A
Other languages
Chinese (zh)
Other versions
CN103810264A (en
Inventor
周红芳
郭杰
王鹏
张国荣
段文聪
王心怡
何馨依
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN201410038614.9A priority Critical patent/CN103810264B/en
Publication of CN103810264A publication Critical patent/CN103810264A/en
Application granted granted Critical
Publication of CN103810264B publication Critical patent/CN103810264B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The data set being made up of substantial amounts of webpage, first, is divided into training set and test set two parts by the web page text sorting technique of feature based selection;Then, the ability of the information representation web page contents in web page tag domain assigns label different weight, and calculates the weight of Feature Words in each webpage in training set(The product of word frequency after normalization and anti-document frequency);Deviation between distribution within class rate and class is combined on the basis of gained weight, the characteristic vector of each webpage in training set is calculated, the characteristic vector of each class in training set is then calculated;Finally, the word frequency of Feature Words in each webpage in test set, and the similarity in webpage to be sorted and training set between each class are calculated, using the maximum class of similarity as the affiliated class of webpage to be sorted, classification results is obtained.

Description

The web page text sorting technique of feature based selection
Technical field
The invention belongs to data digging method technical field, it is related to a kind of web page text classification side of feature based selection Method.
Background technology
With computer and the fast-developing of mechanics of communication, the rapid popularization and application in internet, the webpage on network is just with several The speed of what series increases.In face of the mass network information of these explosive growths, how therefrom quickly and efficiently to obtain useful , information interested become more and more important.Therefore, effectively organization and management web page resources, shorten needed for user obtains The time of information, become current urgent problem.Webpage classification technology arises at the historic moment, and is increasingly becoming after text classification The study hotspot in machine learning field afterwards.
Web page classifying traditionally is first by artificial judgment classification, i.e., after the content of analysis webpage, manually to select Select a suitable classification.But, the way of this manual sort has many shortcomings:One be in web page text quantity drastically In the case of growth, manually sorting technique becomes unrealistic, it is necessary to expend substantial amounts of human resources carrying out classification;Two are Classification is manually carried out to web page text cannot ensure classification accuracy higher, main mainly due to everyone Heuristics etc. Sight factor is different, and classification results are it is possible that inconsistent situation.Therefore, it is badly in need of a kind of effective method to enter web page text Row management, thus web page text automatic classification technology starts to show its superiority.
Web page text automatic classification technology derives from Technologies of Automated Text Classification, and its target is consistent with Text Classification, I.e. under pre-defined Web page classifying system, webpage to be sorted is accurately belonged to one or more corresponding classifications.Often Web page text sorting algorithm has following several:KNN algorithms, NB (Naive Bayes) algorithm, SVMs (SVM), something lost Propagation algorithm (GA), Rocchio algorithms etc..These web page text automatic classification technologies remain many problems, such as webpage text The dimension in eigen space is too high, causes memory space big, and classification speed is slow;Include a large amount of website marks, advertisement in webpage Deng noise information, severe jamming to the other determination of web page class, so as to reduce the accuracy rate of classification;While different positions in webpage The ability of the information representation webpage put is different, and the accuracy to classifying has a certain impact.Therefore, have in the urgent need to finding one kind The web page text sorting technique of effect reduces the time of classification, and improves the accuracy rate of classification.
The content of the invention
It is an object of the invention to provide a kind of web page text sorting technique of feature based selection, solve prior art and exist The problem that classification speed is slow, accuracy rate is not high.
The technical scheme is that, the web page text sorting technique of feature based selection, first, by substantial amounts of webpage The data set of composition is divided into training set and test set two parts;Then, the information representation web page contents in web page tag domain Ability assign label different weight, and calculate the weight of Feature Words in each webpage in training set(Word frequency after normalization With the product of anti-document frequency);Deviation between distribution within class rate and class is combined on the basis of gained weight, each in training set is calculated The characteristic vector of webpage, then calculates the characteristic vector of each class in training set;Finally, it is special in each webpage in calculating test set The similarity between each class in the word frequency of word, and webpage to be sorted and training set is levied, is made using the maximum class of similarity It is the affiliated class of webpage to be sorted, obtains classification results.
The features of the present invention is also resided in:
Feature Words are that what is obtained after being pre-processed to webpage can represent the word of web page contents.
Webpage in training set includes several different classes, and the webpage in each class is carried out to be calculated each class Characteristic vector, then, calculates the word frequency of Feature Words in each webpage in test set, and in webpage to be sorted and training set each The similarity of the characteristic vector of class, using the maximum class of similarity as the affiliated class of webpage to be sorted, obtains carrying out webpage The result of classification.Training set in data set carries out a series of calculating and constructs Web page classifying device, and test set is used to test the net The performance that web page classifier is classified to webpage is good and bad.
Comprise the following steps that:
1. the data set being made up of substantial amounts of webpage is divided into training set and test set two parts, typically requires that training set takes 80% or so of data set, test set takes remainder;
2. pair data set(Including training set and test set)Pre-processed, participle mainly is carried out to webpage, will net Into single word, to unrelated noise information of classifying in removal webpage, removal stop words is without reality to text dividing in page Implication applies very extensive word;
3. the position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in training set;
4. deviation between the distribution within class rate and class of binding characteristic word, calculates the weight of special testimony in each webpage in training set (TFIDF);
5., according to the weight of special testimony in each webpage, the Text eigenvector of each webpage in training set is calculated;
6., according to the Text eigenvector of each webpage in each class, the characteristic vector of each class in training set is calculated;
7. the position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in test set;
8. Web page classifying is carried out using vector space model, calculated using the cosine angle formulae between two characteristic vectors and treated Similarity in classification webpage and training set between each class, and the maximum class of similarity is used as the institute of webpage to be sorted Category class.
When calculating the word frequency of Feature Words, it is considered to the influence of its position, the present invention is according to practical experience and grinding with reference to forefathers Study carefully achievement, it is believed that represent the title of webpage centre point, its weight highest;To summarizing and emphasizing the brief introduction that webpage plays a crucial role And keyword, its weight takes second place;Web page text, its weight is minimum.
Calculate Feature Words tkWeight when binding characteristic word tkClass between deviation EDkjWith distribution within class rate IDkj, wherein, class Between deviation EDkjComputing formula it is as follows:
In formula, N (tk,Cj) represent class CjIn there is Feature Words tkDocument number,In representing all classes There is Feature Words tkDocument number, m be training set in classification number.
Distribution within class rate IDkjComputing formula it is as follows:
In formula, M (tk,Cj) represent class CjMiddle Feature Words tkThe total degree of appearance, M (Cj) represent class CjIn all words occur Total degree.
The computing formula of weight is as follows:
Wherein, tfik(di) it is according to Feature Words tkPosition in webpage be modified after new word frequency, N (D) for training The total number of files concentrated, N (tk, D) and to there is Feature Words t in document sets DkNumber of files, n be document diTotal of middle Feature Words Number, EDkjIt is characterized word tkClass between deviation, IDkjIt is characterized word tkDistribution within class rule.
Substantial amounts of webpage is minimum 6000.
The present invention has the advantages that:
1. on classification accuracy rate, tradition TFIDF algorithms and genetic algorithm (GA) are contrasted, sorting technique of the present invention is being classified just Better than other 2 contrast algorithms in true rate.Main cause is:1. when the word frequency of Feature Words is calculated, it is contemplated that Feature Words are in net Influence of the position to word frequency in page, is corrected to it, effectively raises the accuracy of classification;2. Feature Words are being calculated During weight, deviation between the distribution within class rate of Feature Words and class is combined, further increase the accuracy of classification.
2. on the classification time, because sorting technique of the present invention is when term weight function is calculated, it is contemplated that Feature Words are in webpage In the distribution in class and between class of position, Feature Words, so, compared to the genetic algorithm equally with preferable classifying quality, Substantially reduce the execution time.
3. recall rate of the present invention is all higher than traditional TFIDF algorithms and genetic algorithm on the whole.
Brief description of the drawings
Fig. 1 is the contrast of the web page text sorting technique with the classification accuracy rate of prior art of feature based selection of the present invention Figure;
Fig. 2 is the contrast of the web page text sorting technique with the classification recall rate of prior art of feature based selection of the present invention Figure.
Specific embodiment
The present invention is described in detail with reference to the accompanying drawings and detailed description.
Sorting technique of the present invention combines class between the position of Feature Words and the class of Feature Words when term weight function is calculated Interior distribution, does not have contributive Feature Words to be endowed the deficiency of larger weights classification, and finally improve so as to avoid those The accuracy rate of classification.
Related definition in the present invention is as follows:
Define 1(Word frequency)Word frequency (TF, Term Frequency) refers to Feature Words tkIn document diThe number of times of middle appearance, uses tfik(di) represent.On the premise of stop words and indivedual high frequency words are excluded, Feature Words tkIn document diThe number of times of middle appearance is more, It characterizes document diAbility it is stronger.
Define 2(Document frequency)Document frequency (DF, Document Frequency) refers to Feature Words occur in document sets D tkNumber of files, with N (tk, D) represent.Feature Words tkNumber of files N (the t of appearancek, D) and bigger, tkTo the document d in document sets Di Representativeness it is weaker.
Define 3(Anti- document frequency)Anti- document frequency (IDF, Inverse Document Frequency) is Feature Words tk Occur the measurement of frequent degree in document sets D, use IDFkRepresent:
Wherein, N (D) is the total number of files in training set, N (tk, D) and to there is Feature Words t in document sets DkNumber of files. IDFkWith N (tk, D) increase and reduce, there is t in document sets DkNumber of files N (tk, D) and smaller, tkTo in document sets D Document diIt is more representative.
Define 4(Normalization)To reduce indivedual high-frequency characteristic words to the inhibitory action of characteristics of low-frequency word, each component is carried out Normalization.TFIDF after normalization is calculated as follows:
Wherein, L is empirical value, generally takes L=0.01, tfik(di) it is characterized word tkIn document diThe number of times of middle appearance, N (D) Total number of files in for training set, N (tk, D) and to there is Feature Words t in document sets DkNumber of files, n be document diMiddle Feature Words Total number.
Define 5(The VSM of webpage is represented)The representation of webpage d is V (d)=(t1,w1(d);…;tk,wk(d);…,tn, wn(d)), wherein tkRepresent the Feature Words in webpage, wkD () represents tkThe word frequency of appearance.
Define 6(Deviation between class)Deviation (ED, external deviation) represents that Feature Words may be in some classes between class Middle appearance, may occur without in some classes, and it is uncertainty measure between a species, uses EDkjRepresent:
Wherein, N (tk,Cj) represent class CjIn there is Feature Words tkDocument number,In representing all classes There is Feature Words tkDocument number, m be training set in classification number.Be can be seen that from above formula, EDkjValue is bigger, illustrates feature Word tkMore concentrate on class CjIn, to class CjSign effect it is stronger.
Define 7(Distribution within class rate)Distribution within class rate (ID, internal distribution) represents all documents in class Middle the probability of Feature Words occur, it is Feature Words tkThe measurement of distributing equilibrium degree in certain concrete kind.Distribution within class rate IDkj Represent:
Wherein, M (tk,Cj) represent class CjMiddle Feature Words tkThe total degree of appearance, M (Cj) represent class CjIn all words occur Total degree.Be can be seen that from above formula, IDkjValue is bigger, illustrates Feature Words tkIn class CjIn more be uniformly distributed, to class CjSign effect It is stronger.
Feature Words are that what is obtained after being pre-processed to webpage can represent the word of web page contents.
Comprise the following steps that:
1. substantial amounts of webpage is divided into training set and test set two parts, typically requires that training set takes the 80% of total webpage number Left and right, test set takes remaining part;
2. pair webpage(Including training set and test set)Pre-processed, participle mainly is carried out to webpage, will webpage Interior text dividing, to unrelated noise information of classifying in removal webpage, removes stop words into single word(Without physical meaning Or apply very extensive word)Deng;
3. binding characteristic word(All words that can represent web page contents in webpage)Position feature, calculate training set in each The word frequency of Feature Words in webpage;
4. deviation between the distribution within class rate and class of binding characteristic word, calculates the weight of special testimony in each webpage in training set (TFIDF);
5., according to the weight of special testimony in each webpage, the Text eigenvector of each webpage in training set is calculated;
6., according to the Text eigenvector of each webpage in each class, the characteristic vector of each class in training set is calculated;
7. the position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in test set;
8. Web page classifying is carried out using vector space model, calculated using the cosine angle formulae between two characteristic vectors and treated Similarity in classification webpage and training set between each class, and the maximum class of similarity is used as the institute of webpage to be sorted Category class.
Web page is different from general text, and it is a kind of semi-structured file, contains substantial amounts of link and mark Sign, the ability of the information representation web page contents in label field is different thus also different to the role of Web page classifying.This Invention is according to Feature Words tkThe position at place is modified to its word frequency, and specific method is on the basis of former word frequency, according to its institute Position be multiplied by corresponding weight, obtain new word frequency.In an experiment, it is believed that Title is directly retouching to Web page subject State, represent the centre point of webpage, it is 4 to assign its weight;Description is the brief introduction to webpage, and keywords represents net Keyword in page content, to summarizing and emphasizing that webpage plays a crucial role, it is 2 to assign its weight to this two parts content; PlainText is common text, i.e. Web page text, and its effect to webpage is taken second place compared with the above two, and it is 1 to assign its weight.
The present invention considers to include Feature Words tkDistribution situation of the document in each class, and Feature Words tkIn certain class Each document in distribution situation, calculating Feature Words tkWeight when binding characteristic word tkClass between deviation EDkjIn class Distributive law IDkj.Wherein, deviation ED between classkjComputing formula it is as follows:
In formula, N (tk,Cj) represent class CjIn there is Feature Words tkDocument number,In representing all classes There is Feature Words tkDocument number, m be training set in classification number.Distribution within class rate IDkjComputing formula it is as follows:
In formula, M (tk,Cj) represent class CjMiddle Feature Words tkThe total degree of appearance, M (Cj) represent class CjIn all words occur Total degree.
The formula for calculating weight between position, the class of binding characteristic word after deviation and distribution within class rate is as follows:
Wherein, tfik(di) it is according to Feature Words tkPosition in webpage be modified after new word frequency, N (D) for training The total number of files concentrated, N (tk, D) and to there is Feature Words t in document sets DkNumber of files, n be document diTotal of middle Feature Words Number, EDkjIt is characterized word tkClass between deviation, IDkjIt is characterized word tkDistribution within class rule.
When calculating the word frequency of Feature Words, it is considered to the influence of its position, the present invention is according to practical experience and grinding with reference to forefathers Study carefully achievement, it is believed that represent the title of webpage centre point, its weight highest;To summarizing and emphasizing the brief introduction that webpage plays a crucial role And keyword, its weight takes second place;Web page text, its weight is minimum.
In the present invention, if the frequency that certain word occurs in a text is higher, illustrate that it is distinguishing text content Ability in terms of attribute is stronger;If the scope that word occurs in some texts is wider, i.e. occurrence number in each classification Quite, illustrate that the ability of word differentiation content of text is lower.It is a kind of semi-structured file in view of Web page, containing big The link of amount and label, the ability of the information representation web page contents in label field difference, to the role of Web page classifying Also different, present invention definition can most reflect that the information of content of pages is classified as position 1, assign its highest weights;Compared with can reflect The information of content of pages is classified as position 2, assigns its high weight;Reflection content of pages is classified as position 3 inferior to the above two information, assigns Its relatively low weights is given, that is, is had:
Weight (p=1) > weight (p=2) > weight (p=3)
(6)
Wherein, p is position feature.During specific experiment, it is considered herein that Title is directly retouching to Web page subject State, represent the centre point of webpage, be placed on position 1, and it is 4 to assign its weight;Description is the letter to webpage Be situated between, keywords represents the keyword in web page contents, this two parts content to summarizing and emphasizing that webpage plays a crucial role, by it Position 2 is placed in, and it is 2 to assign its weight;Plain Text are common text, i.e. Web page text, its effect to webpage compared with The above two take second place, and are placed on position 3, and it is 1 to assign its weight.The present invention is according to Feature Words tkPosition in webpage is to it Word frequency is modified, and specific method is on the basis of former word frequency, corresponding weight to be multiplied by according to its position feature, obtains new word frequency wk(d)。
Secondly, it is contemplated that distribution situation of the Feature Words in class and between class is seldom considered in web page text sorting algorithm, originally Invention combines deviation and distribution within class rate between the class of Feature Words to adjust the weight of Feature Words again.
Finally, the present invention proposes the TFIDF features of deviation and distribution within class rate between a kind of position, the class of binding characteristic word Method of weighting, formula is as follows:
Wherein, tfik(di) it is according to Feature Words tkPosition in webpage be modified after new word frequency, N (D) for training The total number of files concentrated, N (tk, D) and to there is Feature Words t in document sets DkNumber of files, n be document diTotal of middle Feature Words Number, EDkjIt is characterized word tkClass between deviation, IDkjIt is characterized word tkDistribution within class rule.
Generally, obtain the weight come by above formula and can be obtained by preferable classification results, but when multiple classes When containing same Feature Words simultaneously, and the feature weight for calculating than it is larger when, one can be produced to the accuracy of classification results Fixed influence, therefore, the present invention is corrected again on the weights that above formula is obtained, and revised weight is designated as W 'ik (di).Modification method is the summation sum of first statistical nature word weight in each classification(Note:When Feature Words be not present in it is a certain When in classification, its weight is 0), it is then reduced to classification results with this divided by sum with the weight obtained according to above formula Influence.I.e.
According to the weight that formula (7) is calculated, reduce same Feature Words and appear in inhomogeneity and when its weight is too big pair The influence of classification results, while not influenceing influence of the exclusive Feature Words to classifying in inhomogeneity again.
In the selection of grader, the present invention selects vector space model, calculate first webpage to be sorted and each Similarity between class, then using the maximum class of similarity as webpage to be sorted affiliated class.The computing formula of similarity Represented using the cosine angle between two characteristic vectors:
Wherein, Wik、WjkDocument d is represented respectivelyiWith class CjK-th Feature Words weights, n is characterized the total number of word.
Embodiment, the specific implementation of the web page text sorting technique selected according to feature based proposed by the present invention is as follows:
Webpage used in the present invention is the internet corpus SougouCS from search dog laboratory.In an experiment, by In webpage the webpage number of some classifications very little, therefore, we only have chosen automobile, finance and economics, IT, health, physical culture, tourism, Education, culture, military affairs, house property, amusement, fashion totally 12 classifications, training set and test set two are divided into by the webpage after arrangement Point, the webpage number of training set is 600 wherein in each class, and the webpage number of test set is 200.
12 classes are had in the present embodiment, the webpage number of training set is 600 in each class, the webpage number of test set is 200, So total webpage number is 12*(600+200)=9600.
Webpage is pre-processed, participle is mainly carried out to webpage, to unrelated noise information of classifying in removal webpage, Removal stop words etc..For example, Web page text content is " I is a student ", it is that " I is one by the result obtained after participle The so a series of phrase of individual student ", then the result of gained is " student " after removing noise information and stop words.
The position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in training set.In statistics training set The number of times that Feature Words occur in the webpage in each webpage, it is secondary what is calculated if this feature lexeme is in " title " place 4 are multiplied by number;If this feature lexeme is multiplied by 2 in " brief introduction " and " keyword " place on the number of times for calculating;If This feature lexeme is then multiplied by 1 in " Web page text " place on the number of times for calculating.
Deviation between the distribution within class rate and class of binding characteristic word, calculates the weight of special testimony in each webpage in training set (TFIDF).According to formula(1)Deviation between the class of Feature Words is calculated, according to formula(2)Calculate the distribution within class of Feature Words Rate, finally according to formula(3)Calculate the complex weight of Feature Words.
In selection training set in each webpage n before term weight function highest(N can any value, it is general bigger than normal, N takes 100 in the present invention)Feature Words and its weight constitute the Text eigenvector of the webpage.Merge all webpages in a certain class Text eigenvector, and arranged from big to small by weight, n before choosing(N can any value, it is general bigger than normal, in the present invention N takes 100)Feature Words and its weight constitute such characteristic vector.When the characteristic vector of all classes is obtained, training is completed.
The position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in test set.In statistical test collection The number of times that Feature Words occur in the webpage in each webpage, it is secondary what is calculated if this feature lexeme is in " title " place 4 are multiplied by number;If this feature lexeme is multiplied by 2 in " brief introduction " and " keyword " place on the number of times for calculating;If This feature lexeme is then multiplied by 1 in " Web page text " place on the number of times for calculating.
Web page classifying is carried out using vector space model, according to formula(14)Calculate webpage to be sorted each with training set Similarity between individual class, and the maximum class of similarity is used as the affiliated class of webpage to be sorted.After the completion of this step, according to this Invention carries out Web page classifying and terminates, and its classification results is as shown in the confusion matrix of following table:
The classification results table of the invention of table 1
From table 1 it follows that the correct webpage number of present invention classification is generally more, but there is also as health, The relatively low classification of the so correct classification number such as culture, fashion.This between these classifications and some other classification due to including What same characteristic features word was caused too much, i.e., these different classes of categorised demarcation lines are obscured.Such as fashion class, there is 31 in classification results Webpage has been assigned in amusement class.
In order to verify accuracy of the invention, carried out with the present invention using traditional TFIDF algorithms, genetic algorithm (GA) respectively Contrast.The performance that the present invention is classified using accuracy and recall rate evaluating network page, its computing formula is as follows:
Its accuracy comparison diagram is as shown in figure 1, recall rate comparison diagram is as shown in Figure 2.Be can be seen that from Fig. 1, Fig. 2 and used Classifying quality of the invention is better than using traditional TFIDF algorithms and genetic algorithm, and for most several classes of, it is accurate that it is classified Rate and recall rate all improve.This calculating of distribution of explanation Feature Words in class and between class to weight has certain shadow Ring, accordingly, it is considered to the two factors can effectively improve the accuracy and recall rate of classification.Also illustrate simultaneously when weight is calculated Consider that position of the Feature Words in webpage can significantly improve the degree of accuracy of Web page classifying.

Claims (5)

1. the web page text sorting technique that feature based is selected, it is characterised in that first, the data being made up of substantial amounts of webpage Collection is divided into training set and test set two parts;Then, the ability of the information representation web page contents in web page tag domain is assigned The different weight of label, and the weight of Feature Words in each webpage in training set is calculated, the weight is the word frequency after normalization With the product of anti-document frequency;Deviation between distribution within class rate and class is combined on the basis of gained weight, each in training set is calculated The characteristic vector of webpage, then calculates the characteristic vector of each class in training set;Finally, it is special in each webpage in calculating test set The similarity between each class in the word frequency of word, and webpage to be sorted and training set is levied, is made using the maximum class of similarity It is the affiliated class of webpage to be sorted, obtains classification results;
Webpage in the training set includes several different classes, and the webpage in each class is carried out to be calculated each class Characteristic vector, then, calculates the word frequency of Feature Words in each webpage in test set, and in webpage to be sorted and training set each The similarity of the characteristic vector of class, using the maximum class of similarity as the affiliated class of webpage to be sorted, obtains carrying out webpage The result of classification;Training set in data set carries out a series of calculating and constructs Web page classifying device, and test set is used to test the net The performance that web page classifier is classified to webpage is good and bad;
Calculate Feature Words tkWeight when binding characteristic word tkClass between deviation EDkjWith distribution within class rate IDkj, wherein, between class partially Difference EDkjComputing formula it is as follows:
ED k j = N ( t k , C j ) Σ x = 1 m N ( t k , C x ) - - - ( 1 )
In formula, N (tk,Cj) represent class CjIn there is Feature Words tkDocument number,Represent in all classes spy occur Levy word tkDocument number, m be training set in classification number;
Distribution within class rate IDkjComputing formula it is as follows:
ID k j = M ( t k , C j ) M ( C j ) - - - ( 2 )
In formula, M (tk,Cj) represent class CjMiddle Feature Words tkThe total degree of appearance, M (Cj) represent class CjIn occur total time of all words Number;
The computing formula of weight is as follows:
W i k ( d i ) = tf i k ( d i ) × l o g ( N ( D ) N ( t k , D ) + 0.01 ) Σ k = 1 n ( tf i k ( d i ) ) 2 × [ l o g ( N ( D ) N ( t k , D ) + 0.01 ) ] 2 × ED k j × ID k j - - - ( 3 )
Wherein, tfik(di) it is according to Feature Words tkPosition in webpage be modified after new word frequency, N (D) be training set in Total number of files, N (tk, D) and to there is Feature Words t in document sets DkNumber of files, n be document diThe total number of middle Feature Words, EDkjIt is characterized word tkClass between deviation, IDkjIt is characterized word tkDistribution within class rate.
2. the web page text sorting technique that feature based as claimed in claim 1 is selected, it is characterised in that Feature Words are to net The word that can represent web page contents that page is obtained after being pre-processed.
3. the web page text sorting technique that the feature based as described in claim any one of 1-2 is selected, it is characterised in that specific Step is as follows:
1) data set being made up of substantial amounts of webpage is divided into training set and test set two parts, typically requires training set access evidence 80% or so of collection, test set takes remainder;
2) data set is pre-processed, participle mainly is carried out to webpage, text dividing that will be in webpage is into single word Language, to unrelated noise information of classifying in removal webpage, removal stop words is without physical meaning or applies very extensive word;
3) position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in training set;
4) deviation between the distribution within class rate and class of binding characteristic word, calculates the weight of special testimony in each webpage in training set;
5) according to the weight of special testimony in each webpage, the Text eigenvector of each webpage in training set is calculated;
6) according to the Text eigenvector of each webpage in each class, the characteristic vector of each class in training set is calculated;
7) position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in test set;
8) Web page classifying is carried out using vector space model, calculates to be sorted using the cosine angle formulae between two characteristic vectors Similarity in webpage and training set between each class, and using the maximum class of similarity as belonging to webpage to be sorted Class.
4. the web page text sorting technique that feature based as claimed in claim 1 is selected, it is characterised in that represent webpage center The title of content, its weight highest;To summarizing and emphasizing brief introduction and keyword that webpage plays a crucial role, its weight is taken second place;Net Page text, its weight is minimum.
5. the web page text sorting technique that feature based as claimed in claim 1 is selected, it is characterised in that:Substantial amounts of webpage is Minimum 6000.
CN201410038614.9A 2014-01-27 2014-01-27 The web page text sorting technique of feature based selection Expired - Fee Related CN103810264B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410038614.9A CN103810264B (en) 2014-01-27 2014-01-27 The web page text sorting technique of feature based selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410038614.9A CN103810264B (en) 2014-01-27 2014-01-27 The web page text sorting technique of feature based selection

Publications (2)

Publication Number Publication Date
CN103810264A CN103810264A (en) 2014-05-21
CN103810264B true CN103810264B (en) 2017-06-06

Family

ID=50707034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410038614.9A Expired - Fee Related CN103810264B (en) 2014-01-27 2014-01-27 The web page text sorting technique of feature based selection

Country Status (1)

Country Link
CN (1) CN103810264B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050240A (en) * 2014-05-26 2014-09-17 北京奇虎科技有限公司 Method and device for determining categorical attribute of search query word
CN104123659A (en) * 2014-07-30 2014-10-29 杭州野工科技有限公司 Commodity networked gene based brand intellectual property protection platform
CN104239436B (en) * 2014-08-27 2018-01-02 南京邮电大学 It is a kind of that method is found based on the network hotspot event of text classification and cluster analysis
CN106294392B (en) * 2015-05-20 2019-12-06 阿里巴巴集团控股有限公司 Webpage display method and device
CN104866573B (en) * 2015-05-22 2018-02-13 齐鲁工业大学 A kind of method of text classification
CN105205090A (en) * 2015-05-29 2015-12-30 湖南大学 Web page text classification algorithm research based on web page link analysis and support vector machine
CN105303296B (en) * 2015-09-29 2019-04-23 国网浙江省电力公司电力科学研究院 A kind of power equipment life-cycle method for evaluating state
CN105488029A (en) * 2015-11-30 2016-04-13 西安闻泰电子科技有限公司 KNN based evidence taking method for instant communication tool of intelligent mobile phone
CN107544980B (en) * 2016-06-24 2020-07-24 北京国双科技有限公司 Method and device for searching webpage
CN108614825B (en) * 2016-12-12 2022-04-15 中移(杭州)信息技术有限公司 Webpage feature extraction method and device
CN108268457A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of file classification method and device based on SVM
CN108268458B (en) * 2016-12-30 2020-12-08 广东精点数据科技股份有限公司 KNN algorithm-based semi-structured data classification method and device
CN108694325B (en) * 2017-04-10 2020-12-29 北大方正集团有限公司 Method and device for identifying specified type of website
CN107577708A (en) * 2017-07-31 2018-01-12 北京北信源软件股份有限公司 Class base construction method and system based on SparkMLlib document classifications
CN109858006B (en) * 2017-11-30 2021-04-09 亿度慧达教育科技(北京)有限公司 Subject identification training method and device
CN108764671B (en) * 2018-05-16 2022-04-15 山东师范大学 Creativity evaluation method and device based on self-built corpus
CN109101477B (en) * 2018-06-04 2023-01-31 东南大学 Enterprise field classification and enterprise keyword screening method
CN109472293A (en) * 2018-10-12 2019-03-15 国家电网有限公司 A kind of grid equipment file data error correction method based on machine learning
CN109299275A (en) * 2018-11-09 2019-02-01 长春理工大学 A kind of file classification method eliminated based on parallelization noise
CN110929028A (en) * 2019-11-01 2020-03-27 深圳前海微众银行股份有限公司 Log classification method and device
CN111368552B (en) * 2020-02-26 2023-09-26 北京市公安局 Specific-field-oriented network user group division method and device
CN111382273B (en) * 2020-03-09 2023-04-14 广州智赢万世市场管理有限公司 Text classification method based on feature selection of attraction factors

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609450A (en) * 2009-04-10 2009-12-23 南京邮电大学 Web page classification method based on training set

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609450A (en) * 2009-04-10 2009-12-23 南京邮电大学 Web page classification method based on training set

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《基于VSM模型和特征选择算法的中文文本自动分类研究》;朱坤红;《中国优秀硕士学位论文全文数据库(电子期刊)》;20120430;正文第22-28页 *
《基于支持向量机的网页文本分类技术研究》;黄乐;《中国优秀硕士学位论文全文数据库(电子期刊)》;20121031;正文第15-35页 *

Also Published As

Publication number Publication date
CN103810264A (en) 2014-05-21

Similar Documents

Publication Publication Date Title
CN103810264B (en) The web page text sorting technique of feature based selection
CN104750844B (en) Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device
Cohen et al. End to end long short term memory networks for non-factoid question answering
CN105205090A (en) Web page text classification algorithm research based on web page link analysis and support vector machine
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
CN106202294B (en) Related news computing method and device based on keyword and topic model fusion
CN103365997B (en) A kind of opining mining method based on integrated study
CN110909164A (en) Text enhancement semantic classification method and system based on convolutional neural network
CN104199833B (en) The clustering method and clustering apparatus of a kind of network search words
US20080208840A1 (en) Diverse Topic Phrase Extraction
CN106445919A (en) Sentiment classifying method and device
CN106599054A (en) Method and system for title classification and push
CN105917364B (en) Ranking discussion topics in question-and-answer forums
WO2021184674A1 (en) Text keyword extraction method, electronic device, and computer readable storage medium
CN107169086B (en) Text classification method
CN110516074B (en) Website theme classification method and device based on deep learning
CN106021572A (en) Binary feature dictionary construction method and device
CN109062958B (en) Primary school composition automatic classification method based on TextRank and convolutional neural network
CN106997379A (en) A kind of merging method of the close text based on picture text click volume
Li et al. Text classification method based on convolution neural network
CN106649264B (en) A kind of Chinese fruit variety information extraction method and device based on chapter information
CN107908649B (en) Text classification control method
Ma et al. A microblog recommendation algorithm based on multi-tag correlation
CN103324942B (en) A kind of image classification method, Apparatus and system
Gao et al. Text categorization based on improved Rocchio algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170606

Termination date: 20210127

CF01 Termination of patent right due to non-payment of annual fee