CN103810264B - The web page text sorting technique of feature based selection - Google Patents
The web page text sorting technique of feature based selection Download PDFInfo
- Publication number
- CN103810264B CN103810264B CN201410038614.9A CN201410038614A CN103810264B CN 103810264 B CN103810264 B CN 103810264B CN 201410038614 A CN201410038614 A CN 201410038614A CN 103810264 B CN103810264 B CN 103810264B
- Authority
- CN
- China
- Prior art keywords
- webpage
- class
- training set
- web page
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The data set being made up of substantial amounts of webpage, first, is divided into training set and test set two parts by the web page text sorting technique of feature based selection;Then, the ability of the information representation web page contents in web page tag domain assigns label different weight, and calculates the weight of Feature Words in each webpage in training set(The product of word frequency after normalization and anti-document frequency);Deviation between distribution within class rate and class is combined on the basis of gained weight, the characteristic vector of each webpage in training set is calculated, the characteristic vector of each class in training set is then calculated;Finally, the word frequency of Feature Words in each webpage in test set, and the similarity in webpage to be sorted and training set between each class are calculated, using the maximum class of similarity as the affiliated class of webpage to be sorted, classification results is obtained.
Description
Technical field
The invention belongs to data digging method technical field, it is related to a kind of web page text classification side of feature based selection
Method.
Background technology
With computer and the fast-developing of mechanics of communication, the rapid popularization and application in internet, the webpage on network is just with several
The speed of what series increases.In face of the mass network information of these explosive growths, how therefrom quickly and efficiently to obtain useful
, information interested become more and more important.Therefore, effectively organization and management web page resources, shorten needed for user obtains
The time of information, become current urgent problem.Webpage classification technology arises at the historic moment, and is increasingly becoming after text classification
The study hotspot in machine learning field afterwards.
Web page classifying traditionally is first by artificial judgment classification, i.e., after the content of analysis webpage, manually to select
Select a suitable classification.But, the way of this manual sort has many shortcomings:One be in web page text quantity drastically
In the case of growth, manually sorting technique becomes unrealistic, it is necessary to expend substantial amounts of human resources carrying out classification;Two are
Classification is manually carried out to web page text cannot ensure classification accuracy higher, main mainly due to everyone Heuristics etc.
Sight factor is different, and classification results are it is possible that inconsistent situation.Therefore, it is badly in need of a kind of effective method to enter web page text
Row management, thus web page text automatic classification technology starts to show its superiority.
Web page text automatic classification technology derives from Technologies of Automated Text Classification, and its target is consistent with Text Classification,
I.e. under pre-defined Web page classifying system, webpage to be sorted is accurately belonged to one or more corresponding classifications.Often
Web page text sorting algorithm has following several:KNN algorithms, NB (Naive Bayes) algorithm, SVMs (SVM), something lost
Propagation algorithm (GA), Rocchio algorithms etc..These web page text automatic classification technologies remain many problems, such as webpage text
The dimension in eigen space is too high, causes memory space big, and classification speed is slow;Include a large amount of website marks, advertisement in webpage
Deng noise information, severe jamming to the other determination of web page class, so as to reduce the accuracy rate of classification;While different positions in webpage
The ability of the information representation webpage put is different, and the accuracy to classifying has a certain impact.Therefore, have in the urgent need to finding one kind
The web page text sorting technique of effect reduces the time of classification, and improves the accuracy rate of classification.
The content of the invention
It is an object of the invention to provide a kind of web page text sorting technique of feature based selection, solve prior art and exist
The problem that classification speed is slow, accuracy rate is not high.
The technical scheme is that, the web page text sorting technique of feature based selection, first, by substantial amounts of webpage
The data set of composition is divided into training set and test set two parts;Then, the information representation web page contents in web page tag domain
Ability assign label different weight, and calculate the weight of Feature Words in each webpage in training set(Word frequency after normalization
With the product of anti-document frequency);Deviation between distribution within class rate and class is combined on the basis of gained weight, each in training set is calculated
The characteristic vector of webpage, then calculates the characteristic vector of each class in training set;Finally, it is special in each webpage in calculating test set
The similarity between each class in the word frequency of word, and webpage to be sorted and training set is levied, is made using the maximum class of similarity
It is the affiliated class of webpage to be sorted, obtains classification results.
The features of the present invention is also resided in:
Feature Words are that what is obtained after being pre-processed to webpage can represent the word of web page contents.
Webpage in training set includes several different classes, and the webpage in each class is carried out to be calculated each class
Characteristic vector, then, calculates the word frequency of Feature Words in each webpage in test set, and in webpage to be sorted and training set each
The similarity of the characteristic vector of class, using the maximum class of similarity as the affiliated class of webpage to be sorted, obtains carrying out webpage
The result of classification.Training set in data set carries out a series of calculating and constructs Web page classifying device, and test set is used to test the net
The performance that web page classifier is classified to webpage is good and bad.
Comprise the following steps that:
1. the data set being made up of substantial amounts of webpage is divided into training set and test set two parts, typically requires that training set takes
80% or so of data set, test set takes remainder;
2. pair data set(Including training set and test set)Pre-processed, participle mainly is carried out to webpage, will net
Into single word, to unrelated noise information of classifying in removal webpage, removal stop words is without reality to text dividing in page
Implication applies very extensive word;
3. the position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in training set;
4. deviation between the distribution within class rate and class of binding characteristic word, calculates the weight of special testimony in each webpage in training set
(TFIDF);
5., according to the weight of special testimony in each webpage, the Text eigenvector of each webpage in training set is calculated;
6., according to the Text eigenvector of each webpage in each class, the characteristic vector of each class in training set is calculated;
7. the position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in test set;
8. Web page classifying is carried out using vector space model, calculated using the cosine angle formulae between two characteristic vectors and treated
Similarity in classification webpage and training set between each class, and the maximum class of similarity is used as the institute of webpage to be sorted
Category class.
When calculating the word frequency of Feature Words, it is considered to the influence of its position, the present invention is according to practical experience and grinding with reference to forefathers
Study carefully achievement, it is believed that represent the title of webpage centre point, its weight highest;To summarizing and emphasizing the brief introduction that webpage plays a crucial role
And keyword, its weight takes second place;Web page text, its weight is minimum.
Calculate Feature Words tkWeight when binding characteristic word tkClass between deviation EDkjWith distribution within class rate IDkj, wherein, class
Between deviation EDkjComputing formula it is as follows:
In formula, N (tk,Cj) represent class CjIn there is Feature Words tkDocument number,In representing all classes
There is Feature Words tkDocument number, m be training set in classification number.
Distribution within class rate IDkjComputing formula it is as follows:
In formula, M (tk,Cj) represent class CjMiddle Feature Words tkThe total degree of appearance, M (Cj) represent class CjIn all words occur
Total degree.
The computing formula of weight is as follows:
Wherein, tfik(di) it is according to Feature Words tkPosition in webpage be modified after new word frequency, N (D) for training
The total number of files concentrated, N (tk, D) and to there is Feature Words t in document sets DkNumber of files, n be document diTotal of middle Feature Words
Number, EDkjIt is characterized word tkClass between deviation, IDkjIt is characterized word tkDistribution within class rule.
Substantial amounts of webpage is minimum 6000.
The present invention has the advantages that:
1. on classification accuracy rate, tradition TFIDF algorithms and genetic algorithm (GA) are contrasted, sorting technique of the present invention is being classified just
Better than other 2 contrast algorithms in true rate.Main cause is:1. when the word frequency of Feature Words is calculated, it is contemplated that Feature Words are in net
Influence of the position to word frequency in page, is corrected to it, effectively raises the accuracy of classification;2. Feature Words are being calculated
During weight, deviation between the distribution within class rate of Feature Words and class is combined, further increase the accuracy of classification.
2. on the classification time, because sorting technique of the present invention is when term weight function is calculated, it is contemplated that Feature Words are in webpage
In the distribution in class and between class of position, Feature Words, so, compared to the genetic algorithm equally with preferable classifying quality,
Substantially reduce the execution time.
3. recall rate of the present invention is all higher than traditional TFIDF algorithms and genetic algorithm on the whole.
Brief description of the drawings
Fig. 1 is the contrast of the web page text sorting technique with the classification accuracy rate of prior art of feature based selection of the present invention
Figure;
Fig. 2 is the contrast of the web page text sorting technique with the classification recall rate of prior art of feature based selection of the present invention
Figure.
Specific embodiment
The present invention is described in detail with reference to the accompanying drawings and detailed description.
Sorting technique of the present invention combines class between the position of Feature Words and the class of Feature Words when term weight function is calculated
Interior distribution, does not have contributive Feature Words to be endowed the deficiency of larger weights classification, and finally improve so as to avoid those
The accuracy rate of classification.
Related definition in the present invention is as follows:
Define 1(Word frequency)Word frequency (TF, Term Frequency) refers to Feature Words tkIn document diThe number of times of middle appearance, uses
tfik(di) represent.On the premise of stop words and indivedual high frequency words are excluded, Feature Words tkIn document diThe number of times of middle appearance is more,
It characterizes document diAbility it is stronger.
Define 2(Document frequency)Document frequency (DF, Document Frequency) refers to Feature Words occur in document sets D
tkNumber of files, with N (tk, D) represent.Feature Words tkNumber of files N (the t of appearancek, D) and bigger, tkTo the document d in document sets Di
Representativeness it is weaker.
Define 3(Anti- document frequency)Anti- document frequency (IDF, Inverse Document Frequency) is Feature Words tk
Occur the measurement of frequent degree in document sets D, use IDFkRepresent:
Wherein, N (D) is the total number of files in training set, N (tk, D) and to there is Feature Words t in document sets DkNumber of files.
IDFkWith N (tk, D) increase and reduce, there is t in document sets DkNumber of files N (tk, D) and smaller, tkTo in document sets D
Document diIt is more representative.
Define 4(Normalization)To reduce indivedual high-frequency characteristic words to the inhibitory action of characteristics of low-frequency word, each component is carried out
Normalization.TFIDF after normalization is calculated as follows:
Wherein, L is empirical value, generally takes L=0.01, tfik(di) it is characterized word tkIn document diThe number of times of middle appearance, N (D)
Total number of files in for training set, N (tk, D) and to there is Feature Words t in document sets DkNumber of files, n be document diMiddle Feature Words
Total number.
Define 5(The VSM of webpage is represented)The representation of webpage d is V (d)=(t1,w1(d);…;tk,wk(d);…,tn,
wn(d)), wherein tkRepresent the Feature Words in webpage, wkD () represents tkThe word frequency of appearance.
Define 6(Deviation between class)Deviation (ED, external deviation) represents that Feature Words may be in some classes between class
Middle appearance, may occur without in some classes, and it is uncertainty measure between a species, uses EDkjRepresent:
Wherein, N (tk,Cj) represent class CjIn there is Feature Words tkDocument number,In representing all classes
There is Feature Words tkDocument number, m be training set in classification number.Be can be seen that from above formula, EDkjValue is bigger, illustrates feature
Word tkMore concentrate on class CjIn, to class CjSign effect it is stronger.
Define 7(Distribution within class rate)Distribution within class rate (ID, internal distribution) represents all documents in class
Middle the probability of Feature Words occur, it is Feature Words tkThe measurement of distributing equilibrium degree in certain concrete kind.Distribution within class rate IDkj
Represent:
Wherein, M (tk,Cj) represent class CjMiddle Feature Words tkThe total degree of appearance, M (Cj) represent class CjIn all words occur
Total degree.Be can be seen that from above formula, IDkjValue is bigger, illustrates Feature Words tkIn class CjIn more be uniformly distributed, to class CjSign effect
It is stronger.
Feature Words are that what is obtained after being pre-processed to webpage can represent the word of web page contents.
Comprise the following steps that:
1. substantial amounts of webpage is divided into training set and test set two parts, typically requires that training set takes the 80% of total webpage number
Left and right, test set takes remaining part;
2. pair webpage(Including training set and test set)Pre-processed, participle mainly is carried out to webpage, will webpage
Interior text dividing, to unrelated noise information of classifying in removal webpage, removes stop words into single word(Without physical meaning
Or apply very extensive word)Deng;
3. binding characteristic word(All words that can represent web page contents in webpage)Position feature, calculate training set in each
The word frequency of Feature Words in webpage;
4. deviation between the distribution within class rate and class of binding characteristic word, calculates the weight of special testimony in each webpage in training set
(TFIDF);
5., according to the weight of special testimony in each webpage, the Text eigenvector of each webpage in training set is calculated;
6., according to the Text eigenvector of each webpage in each class, the characteristic vector of each class in training set is calculated;
7. the position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in test set;
8. Web page classifying is carried out using vector space model, calculated using the cosine angle formulae between two characteristic vectors and treated
Similarity in classification webpage and training set between each class, and the maximum class of similarity is used as the institute of webpage to be sorted
Category class.
Web page is different from general text, and it is a kind of semi-structured file, contains substantial amounts of link and mark
Sign, the ability of the information representation web page contents in label field is different thus also different to the role of Web page classifying.This
Invention is according to Feature Words tkThe position at place is modified to its word frequency, and specific method is on the basis of former word frequency, according to its institute
Position be multiplied by corresponding weight, obtain new word frequency.In an experiment, it is believed that Title is directly retouching to Web page subject
State, represent the centre point of webpage, it is 4 to assign its weight;Description is the brief introduction to webpage, and keywords represents net
Keyword in page content, to summarizing and emphasizing that webpage plays a crucial role, it is 2 to assign its weight to this two parts content;
PlainText is common text, i.e. Web page text, and its effect to webpage is taken second place compared with the above two, and it is 1 to assign its weight.
The present invention considers to include Feature Words tkDistribution situation of the document in each class, and Feature Words tkIn certain class
Each document in distribution situation, calculating Feature Words tkWeight when binding characteristic word tkClass between deviation EDkjIn class
Distributive law IDkj.Wherein, deviation ED between classkjComputing formula it is as follows:
In formula, N (tk,Cj) represent class CjIn there is Feature Words tkDocument number,In representing all classes
There is Feature Words tkDocument number, m be training set in classification number.Distribution within class rate IDkjComputing formula it is as follows:
In formula, M (tk,Cj) represent class CjMiddle Feature Words tkThe total degree of appearance, M (Cj) represent class CjIn all words occur
Total degree.
The formula for calculating weight between position, the class of binding characteristic word after deviation and distribution within class rate is as follows:
Wherein, tfik(di) it is according to Feature Words tkPosition in webpage be modified after new word frequency, N (D) for training
The total number of files concentrated, N (tk, D) and to there is Feature Words t in document sets DkNumber of files, n be document diTotal of middle Feature Words
Number, EDkjIt is characterized word tkClass between deviation, IDkjIt is characterized word tkDistribution within class rule.
When calculating the word frequency of Feature Words, it is considered to the influence of its position, the present invention is according to practical experience and grinding with reference to forefathers
Study carefully achievement, it is believed that represent the title of webpage centre point, its weight highest;To summarizing and emphasizing the brief introduction that webpage plays a crucial role
And keyword, its weight takes second place;Web page text, its weight is minimum.
In the present invention, if the frequency that certain word occurs in a text is higher, illustrate that it is distinguishing text content
Ability in terms of attribute is stronger;If the scope that word occurs in some texts is wider, i.e. occurrence number in each classification
Quite, illustrate that the ability of word differentiation content of text is lower.It is a kind of semi-structured file in view of Web page, containing big
The link of amount and label, the ability of the information representation web page contents in label field difference, to the role of Web page classifying
Also different, present invention definition can most reflect that the information of content of pages is classified as position 1, assign its highest weights;Compared with can reflect
The information of content of pages is classified as position 2, assigns its high weight;Reflection content of pages is classified as position 3 inferior to the above two information, assigns
Its relatively low weights is given, that is, is had:
Weight (p=1) > weight (p=2) > weight (p=3)
(6)
Wherein, p is position feature.During specific experiment, it is considered herein that Title is directly retouching to Web page subject
State, represent the centre point of webpage, be placed on position 1, and it is 4 to assign its weight;Description is the letter to webpage
Be situated between, keywords represents the keyword in web page contents, this two parts content to summarizing and emphasizing that webpage plays a crucial role, by it
Position 2 is placed in, and it is 2 to assign its weight;Plain Text are common text, i.e. Web page text, its effect to webpage compared with
The above two take second place, and are placed on position 3, and it is 1 to assign its weight.The present invention is according to Feature Words tkPosition in webpage is to it
Word frequency is modified, and specific method is on the basis of former word frequency, corresponding weight to be multiplied by according to its position feature, obtains new word frequency
wk(d)。
Secondly, it is contemplated that distribution situation of the Feature Words in class and between class is seldom considered in web page text sorting algorithm, originally
Invention combines deviation and distribution within class rate between the class of Feature Words to adjust the weight of Feature Words again.
Finally, the present invention proposes the TFIDF features of deviation and distribution within class rate between a kind of position, the class of binding characteristic word
Method of weighting, formula is as follows:
Wherein, tfik(di) it is according to Feature Words tkPosition in webpage be modified after new word frequency, N (D) for training
The total number of files concentrated, N (tk, D) and to there is Feature Words t in document sets DkNumber of files, n be document diTotal of middle Feature Words
Number, EDkjIt is characterized word tkClass between deviation, IDkjIt is characterized word tkDistribution within class rule.
Generally, obtain the weight come by above formula and can be obtained by preferable classification results, but when multiple classes
When containing same Feature Words simultaneously, and the feature weight for calculating than it is larger when, one can be produced to the accuracy of classification results
Fixed influence, therefore, the present invention is corrected again on the weights that above formula is obtained, and revised weight is designated as W 'ik
(di).Modification method is the summation sum of first statistical nature word weight in each classification(Note:When Feature Words be not present in it is a certain
When in classification, its weight is 0), it is then reduced to classification results with this divided by sum with the weight obtained according to above formula
Influence.I.e.
According to the weight that formula (7) is calculated, reduce same Feature Words and appear in inhomogeneity and when its weight is too big pair
The influence of classification results, while not influenceing influence of the exclusive Feature Words to classifying in inhomogeneity again.
In the selection of grader, the present invention selects vector space model, calculate first webpage to be sorted and each
Similarity between class, then using the maximum class of similarity as webpage to be sorted affiliated class.The computing formula of similarity
Represented using the cosine angle between two characteristic vectors:
Wherein, Wik、WjkDocument d is represented respectivelyiWith class CjK-th Feature Words weights, n is characterized the total number of word.
Embodiment, the specific implementation of the web page text sorting technique selected according to feature based proposed by the present invention is as follows:
Webpage used in the present invention is the internet corpus SougouCS from search dog laboratory.In an experiment, by
In webpage the webpage number of some classifications very little, therefore, we only have chosen automobile, finance and economics, IT, health, physical culture, tourism,
Education, culture, military affairs, house property, amusement, fashion totally 12 classifications, training set and test set two are divided into by the webpage after arrangement
Point, the webpage number of training set is 600 wherein in each class, and the webpage number of test set is 200.
12 classes are had in the present embodiment, the webpage number of training set is 600 in each class, the webpage number of test set is 200,
So total webpage number is 12*(600+200)=9600.
Webpage is pre-processed, participle is mainly carried out to webpage, to unrelated noise information of classifying in removal webpage,
Removal stop words etc..For example, Web page text content is " I is a student ", it is that " I is one by the result obtained after participle
The so a series of phrase of individual student ", then the result of gained is " student " after removing noise information and stop words.
The position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in training set.In statistics training set
The number of times that Feature Words occur in the webpage in each webpage, it is secondary what is calculated if this feature lexeme is in " title " place
4 are multiplied by number;If this feature lexeme is multiplied by 2 in " brief introduction " and " keyword " place on the number of times for calculating;If
This feature lexeme is then multiplied by 1 in " Web page text " place on the number of times for calculating.
Deviation between the distribution within class rate and class of binding characteristic word, calculates the weight of special testimony in each webpage in training set
(TFIDF).According to formula(1)Deviation between the class of Feature Words is calculated, according to formula(2)Calculate the distribution within class of Feature Words
Rate, finally according to formula(3)Calculate the complex weight of Feature Words.
In selection training set in each webpage n before term weight function highest(N can any value, it is general bigger than normal,
N takes 100 in the present invention)Feature Words and its weight constitute the Text eigenvector of the webpage.Merge all webpages in a certain class
Text eigenvector, and arranged from big to small by weight, n before choosing(N can any value, it is general bigger than normal, in the present invention
N takes 100)Feature Words and its weight constitute such characteristic vector.When the characteristic vector of all classes is obtained, training is completed.
The position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in test set.In statistical test collection
The number of times that Feature Words occur in the webpage in each webpage, it is secondary what is calculated if this feature lexeme is in " title " place
4 are multiplied by number;If this feature lexeme is multiplied by 2 in " brief introduction " and " keyword " place on the number of times for calculating;If
This feature lexeme is then multiplied by 1 in " Web page text " place on the number of times for calculating.
Web page classifying is carried out using vector space model, according to formula(14)Calculate webpage to be sorted each with training set
Similarity between individual class, and the maximum class of similarity is used as the affiliated class of webpage to be sorted.After the completion of this step, according to this
Invention carries out Web page classifying and terminates, and its classification results is as shown in the confusion matrix of following table:
The classification results table of the invention of table 1
From table 1 it follows that the correct webpage number of present invention classification is generally more, but there is also as health,
The relatively low classification of the so correct classification number such as culture, fashion.This between these classifications and some other classification due to including
What same characteristic features word was caused too much, i.e., these different classes of categorised demarcation lines are obscured.Such as fashion class, there is 31 in classification results
Webpage has been assigned in amusement class.
In order to verify accuracy of the invention, carried out with the present invention using traditional TFIDF algorithms, genetic algorithm (GA) respectively
Contrast.The performance that the present invention is classified using accuracy and recall rate evaluating network page, its computing formula is as follows:
Its accuracy comparison diagram is as shown in figure 1, recall rate comparison diagram is as shown in Figure 2.Be can be seen that from Fig. 1, Fig. 2 and used
Classifying quality of the invention is better than using traditional TFIDF algorithms and genetic algorithm, and for most several classes of, it is accurate that it is classified
Rate and recall rate all improve.This calculating of distribution of explanation Feature Words in class and between class to weight has certain shadow
Ring, accordingly, it is considered to the two factors can effectively improve the accuracy and recall rate of classification.Also illustrate simultaneously when weight is calculated
Consider that position of the Feature Words in webpage can significantly improve the degree of accuracy of Web page classifying.
Claims (5)
1. the web page text sorting technique that feature based is selected, it is characterised in that first, the data being made up of substantial amounts of webpage
Collection is divided into training set and test set two parts;Then, the ability of the information representation web page contents in web page tag domain is assigned
The different weight of label, and the weight of Feature Words in each webpage in training set is calculated, the weight is the word frequency after normalization
With the product of anti-document frequency;Deviation between distribution within class rate and class is combined on the basis of gained weight, each in training set is calculated
The characteristic vector of webpage, then calculates the characteristic vector of each class in training set;Finally, it is special in each webpage in calculating test set
The similarity between each class in the word frequency of word, and webpage to be sorted and training set is levied, is made using the maximum class of similarity
It is the affiliated class of webpage to be sorted, obtains classification results;
Webpage in the training set includes several different classes, and the webpage in each class is carried out to be calculated each class
Characteristic vector, then, calculates the word frequency of Feature Words in each webpage in test set, and in webpage to be sorted and training set each
The similarity of the characteristic vector of class, using the maximum class of similarity as the affiliated class of webpage to be sorted, obtains carrying out webpage
The result of classification;Training set in data set carries out a series of calculating and constructs Web page classifying device, and test set is used to test the net
The performance that web page classifier is classified to webpage is good and bad;
Calculate Feature Words tkWeight when binding characteristic word tkClass between deviation EDkjWith distribution within class rate IDkj, wherein, between class partially
Difference EDkjComputing formula it is as follows:
In formula, N (tk,Cj) represent class CjIn there is Feature Words tkDocument number,Represent in all classes spy occur
Levy word tkDocument number, m be training set in classification number;
Distribution within class rate IDkjComputing formula it is as follows:
In formula, M (tk,Cj) represent class CjMiddle Feature Words tkThe total degree of appearance, M (Cj) represent class CjIn occur total time of all words
Number;
The computing formula of weight is as follows:
Wherein, tfik(di) it is according to Feature Words tkPosition in webpage be modified after new word frequency, N (D) be training set in
Total number of files, N (tk, D) and to there is Feature Words t in document sets DkNumber of files, n be document diThe total number of middle Feature Words,
EDkjIt is characterized word tkClass between deviation, IDkjIt is characterized word tkDistribution within class rate.
2. the web page text sorting technique that feature based as claimed in claim 1 is selected, it is characterised in that Feature Words are to net
The word that can represent web page contents that page is obtained after being pre-processed.
3. the web page text sorting technique that the feature based as described in claim any one of 1-2 is selected, it is characterised in that specific
Step is as follows:
1) data set being made up of substantial amounts of webpage is divided into training set and test set two parts, typically requires training set access evidence
80% or so of collection, test set takes remainder;
2) data set is pre-processed, participle mainly is carried out to webpage, text dividing that will be in webpage is into single word
Language, to unrelated noise information of classifying in removal webpage, removal stop words is without physical meaning or applies very extensive word;
3) position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in training set;
4) deviation between the distribution within class rate and class of binding characteristic word, calculates the weight of special testimony in each webpage in training set;
5) according to the weight of special testimony in each webpage, the Text eigenvector of each webpage in training set is calculated;
6) according to the Text eigenvector of each webpage in each class, the characteristic vector of each class in training set is calculated;
7) position feature of binding characteristic word, calculates the word frequency of Feature Words in each webpage in test set;
8) Web page classifying is carried out using vector space model, calculates to be sorted using the cosine angle formulae between two characteristic vectors
Similarity in webpage and training set between each class, and using the maximum class of similarity as belonging to webpage to be sorted
Class.
4. the web page text sorting technique that feature based as claimed in claim 1 is selected, it is characterised in that represent webpage center
The title of content, its weight highest;To summarizing and emphasizing brief introduction and keyword that webpage plays a crucial role, its weight is taken second place;Net
Page text, its weight is minimum.
5. the web page text sorting technique that feature based as claimed in claim 1 is selected, it is characterised in that:Substantial amounts of webpage is
Minimum 6000.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410038614.9A CN103810264B (en) | 2014-01-27 | 2014-01-27 | The web page text sorting technique of feature based selection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410038614.9A CN103810264B (en) | 2014-01-27 | 2014-01-27 | The web page text sorting technique of feature based selection |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103810264A CN103810264A (en) | 2014-05-21 |
CN103810264B true CN103810264B (en) | 2017-06-06 |
Family
ID=50707034
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410038614.9A Expired - Fee Related CN103810264B (en) | 2014-01-27 | 2014-01-27 | The web page text sorting technique of feature based selection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103810264B (en) |
Families Citing this family (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104050240A (en) * | 2014-05-26 | 2014-09-17 | 北京奇虎科技有限公司 | Method and device for determining categorical attribute of search query word |
CN104123659A (en) * | 2014-07-30 | 2014-10-29 | 杭州野工科技有限公司 | Commodity networked gene based brand intellectual property protection platform |
CN104239436B (en) * | 2014-08-27 | 2018-01-02 | 南京邮电大学 | It is a kind of that method is found based on the network hotspot event of text classification and cluster analysis |
CN106294392B (en) * | 2015-05-20 | 2019-12-06 | 阿里巴巴集团控股有限公司 | Webpage display method and device |
CN104866573B (en) * | 2015-05-22 | 2018-02-13 | 齐鲁工业大学 | A kind of method of text classification |
CN105205090A (en) * | 2015-05-29 | 2015-12-30 | 湖南大学 | Web page text classification algorithm research based on web page link analysis and support vector machine |
CN105303296B (en) * | 2015-09-29 | 2019-04-23 | 国网浙江省电力公司电力科学研究院 | A kind of power equipment life-cycle method for evaluating state |
CN105488029A (en) * | 2015-11-30 | 2016-04-13 | 西安闻泰电子科技有限公司 | KNN based evidence taking method for instant communication tool of intelligent mobile phone |
CN107544980B (en) * | 2016-06-24 | 2020-07-24 | 北京国双科技有限公司 | Method and device for searching webpage |
CN108614825B (en) * | 2016-12-12 | 2022-04-15 | 中移(杭州)信息技术有限公司 | Webpage feature extraction method and device |
CN108268457A (en) * | 2016-12-30 | 2018-07-10 | 广东精点数据科技股份有限公司 | A kind of file classification method and device based on SVM |
CN108268458B (en) * | 2016-12-30 | 2020-12-08 | 广东精点数据科技股份有限公司 | KNN algorithm-based semi-structured data classification method and device |
CN108694325B (en) * | 2017-04-10 | 2020-12-29 | 北大方正集团有限公司 | Method and device for identifying specified type of website |
CN107577708A (en) * | 2017-07-31 | 2018-01-12 | 北京北信源软件股份有限公司 | Class base construction method and system based on SparkMLlib document classifications |
CN109858006B (en) * | 2017-11-30 | 2021-04-09 | 亿度慧达教育科技(北京)有限公司 | Subject identification training method and device |
CN108764671B (en) * | 2018-05-16 | 2022-04-15 | 山东师范大学 | Creativity evaluation method and device based on self-built corpus |
CN109101477B (en) * | 2018-06-04 | 2023-01-31 | 东南大学 | Enterprise field classification and enterprise keyword screening method |
CN109472293A (en) * | 2018-10-12 | 2019-03-15 | 国家电网有限公司 | A kind of grid equipment file data error correction method based on machine learning |
CN109299275A (en) * | 2018-11-09 | 2019-02-01 | 长春理工大学 | A kind of file classification method eliminated based on parallelization noise |
CN110929028A (en) * | 2019-11-01 | 2020-03-27 | 深圳前海微众银行股份有限公司 | Log classification method and device |
CN111368552B (en) * | 2020-02-26 | 2023-09-26 | 北京市公安局 | Specific-field-oriented network user group division method and device |
CN111382273B (en) * | 2020-03-09 | 2023-04-14 | 广州智赢万世市场管理有限公司 | Text classification method based on feature selection of attraction factors |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101609450A (en) * | 2009-04-10 | 2009-12-23 | 南京邮电大学 | Web page classification method based on training set |
-
2014
- 2014-01-27 CN CN201410038614.9A patent/CN103810264B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101609450A (en) * | 2009-04-10 | 2009-12-23 | 南京邮电大学 | Web page classification method based on training set |
Non-Patent Citations (2)
Title |
---|
《基于VSM模型和特征选择算法的中文文本自动分类研究》;朱坤红;《中国优秀硕士学位论文全文数据库(电子期刊)》;20120430;正文第22-28页 * |
《基于支持向量机的网页文本分类技术研究》;黄乐;《中国优秀硕士学位论文全文数据库(电子期刊)》;20121031;正文第15-35页 * |
Also Published As
Publication number | Publication date |
---|---|
CN103810264A (en) | 2014-05-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103810264B (en) | The web page text sorting technique of feature based selection | |
CN104750844B (en) | Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device | |
Cohen et al. | End to end long short term memory networks for non-factoid question answering | |
CN105205090A (en) | Web page text classification algorithm research based on web page link analysis and support vector machine | |
CN103207913B (en) | The acquisition methods of commercial fine granularity semantic relation and system | |
CN106202294B (en) | Related news computing method and device based on keyword and topic model fusion | |
CN103365997B (en) | A kind of opining mining method based on integrated study | |
CN110909164A (en) | Text enhancement semantic classification method and system based on convolutional neural network | |
CN104199833B (en) | The clustering method and clustering apparatus of a kind of network search words | |
US20080208840A1 (en) | Diverse Topic Phrase Extraction | |
CN106445919A (en) | Sentiment classifying method and device | |
CN106599054A (en) | Method and system for title classification and push | |
CN105917364B (en) | Ranking discussion topics in question-and-answer forums | |
WO2021184674A1 (en) | Text keyword extraction method, electronic device, and computer readable storage medium | |
CN107169086B (en) | Text classification method | |
CN110516074B (en) | Website theme classification method and device based on deep learning | |
CN106021572A (en) | Binary feature dictionary construction method and device | |
CN109062958B (en) | Primary school composition automatic classification method based on TextRank and convolutional neural network | |
CN106997379A (en) | A kind of merging method of the close text based on picture text click volume | |
Li et al. | Text classification method based on convolution neural network | |
CN106649264B (en) | A kind of Chinese fruit variety information extraction method and device based on chapter information | |
CN107908649B (en) | Text classification control method | |
Ma et al. | A microblog recommendation algorithm based on multi-tag correlation | |
CN103324942B (en) | A kind of image classification method, Apparatus and system | |
Gao et al. | Text categorization based on improved Rocchio algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170606 Termination date: 20210127 |
|
CF01 | Termination of patent right due to non-payment of annual fee |