CN101609450A

CN101609450A - Web page classification method based on training set

Info

Publication number: CN101609450A
Application number: CNA2009100307095A
Authority: CN
Inventors: 王攀; 张顺颐; 汤琛; 于伟涛
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2009-04-10
Filing date: 2009-04-10
Publication date: 2009-12-23

Abstract

Based on the training set automatic webpage classification method, assorting process is by Feature Selection, and the feature weights determine that text vector relatively waits the combination of method.Automated taxonomy based on taxonomic hierarchies mainly is that the basis class models that foundation is good in advance is a training set, will treat that classifying documents is included into respective classes.Along with Development of Multimedia Technology, the content-form of info web is also rich and varied, not only comprises text message, also comprises a lot of structural informations, and other form informations such as sound, figure, image.But because the text based webpage still occupies bigger ratio, therefore classification is still occupied an leading position based on web page text.This method have theoretical support reliably, good extensibility and accuracy, and be easy to the application interface relevant and dock with operator.

Description

Web page classification method based on training set

Technical field

The present invention be directed to any Chinese web page and carry out the research of web page contents automatic classification method, how main research makes up training set and utilizes vectorial relative method exactly unknown webpage to be classified, design automatic webpage classification model and algorithm, related to technical fields such as file characteristics extraction and the calculating of feature weights.

Background technology

Along with the develop rapidly of Internet technology with popularize, the info web amount on the Web rapidly increases, and people have stepped into informative epoch.In the face of so abundant Web information, it is at a loss as to what to do that people often feel, how effectively finding resource requirement becomes the problem that people pay close attention to.As the most frequently used network information gopher of user (as baidu and google), there are shortcomings such as low such as precision ratio, that information redundancy is big in the keyword search engine.Because immature on the Chinese web page automatic classification technology, most of catalogue search engines adopt manual sorts' method, as YAHOO.Though precision ratio improves, there are drawbacks such as poor in timeliness, classification results are inconsistent, database small scale, simple use manual sort costs dearly and is unpractical.Therefore, the automatic classification of Chinese web page has just become fast and an important technology of magnanimity information on the organization network effectively.

Automatically classifying at the Chinese web page of the unknown possesses certain degree of difficulty, and following reason is arranged:

The first, Chinese web page uses the Chinese editor, and unlike the interval that has nature between the English word, Chinese need carry out the processing of participle, and the effect of participle can influence classifying quality significantly.

The second, the variation of webpage format.Multiple form is also deposited, and also there are a plurality of standards in the webpage of same form, and simultaneously because the writing style and the content change of webpage are all very big, the webpage of therefore how to resolve different-format, different-style becomes the pretreated difficult point of webpage.

Three, classification scheme is fuzzy.The knowledge system of internet undergoes an unusual development rapidly, and the various new structures of knowledge are constantly emerged in large numbers, if training corpus can not get upgrading in time, will cause webpage to classify or classification accuracy declines to a great extent.

Four, webpage denoising.Have noise information a large amount of and that page theme is irrelevant in the webpage, the performance that how to improve denoise algorithm is the problem that requires study.

Five, structure of web page information.Webpage contains abundant structures information, except that plain text, also has some other content that classification is had contribution.Title and paragraph subtitle as Head and Title mark webpage, name property value in the meta mark and content property value are the descriptions to Web page subject, the content that hyperlink in the webpage is pointed to might be the content relevant with Web page subject, also might be noise, how distinguish and extraction is the difficult point place.

Therefore the design of the automatic classification system of Chinese web page and realization exist a lot of problems and very big difficulty, so we study this.

Summary of the invention

Technical matters: the objective of the invention is to set up a kind of Web page classification method based on training set, promptly to the unknown classification webpage with the training set comparison to obtain the method for the corresponding classification of this webpage, and feature extraction algorithm, the distance vector comparison algorithm of the vector representation model of design webpage and vector, by Web page classifying being determined can being done more deep analysis to user's visit behavior of surfing the Net.

Technical scheme: the Web page classification method based on training set of the present invention comprises 3 parts, is respectively that web page contents processing, webpage vector representation and webpage vector compare:

The web page contents processing section:

A1.) get access to this webpage source code content automatically according to webpage URL,

A2.) utilize regular expression to filter out picture in the web page contents, noise informations such as hyperlink extract effective text information,

A3.) web page text after will filtering, through word segmentation processing,

A4.) text behind the participle is filtered, with function word, entries such as auxiliary word filter, and stay the keyword that can summarize content of text;

Webpage vector representation part:

This part is divided into again to measure feature speech dimension and subtracts approximately, and feature speech eigenwert is determined 2 processes,

Feature speech dimension subtracts approximately:

B1.) all participles gather in the training set, and training set is after previous action, and submitting the form of coming to is the text of the good speech of branch, and text leaves in respectively in the different files according to the difference classification, on demand all texts is gathered by batch processing; The keyword entry that all classification have so just been arranged,

B2.) entry length screening, between 5, the entry in this length range is not considered as not quite even play interference effect to the classification effect with the length restriction to 2 of all entries, these entries are rejected,

B3.) the entry uniqueness is done qualification, all the entry frequencies in total vocabulary text are restricted to once, with raising computing velocity and minimizing miscount,

B4.) calculate the frequency that each entry occurs respectively in the difference classification, with all frequency summations, characteristic item is chosen algorithm and is finished then,

B5.) calculate between every pair of different entry classification four kinds and concern frequency, then according to χ ²The dimension that computing method obtain every pair of entry subtracts weights approximately,

B6.) weights are pressed descending sort, get preceding 1000 entries, finish determining of characteristic item as characteristic item;

Feature speech eigenwert is determined:

B7.) obtain characteristic item,

B8.) according to the quantity dynamically creating data tables of characteristic item,

B9.) comprise the number of files of characteristic item in the training centralized calculation,

B10.) add up total amount of text, total categorical measure, all kinds of contained amount of text,

B11.) frequency of calculated characteristics item in each literary composition, and handle with matrix form,

B12.) according to different texts, the eigenwert of calculated characteristics item is finished the vector representation of text,

B13.) the vector representation algorithm finishes;

Webpage vector rating unit

C1.) obtain the proper vector of test text X,

C2.) from training set, take out a text feature vector T i,

C3.) calculate two proper vectors similarity sim (X, Ti),

C4.) judging whether to finish with all vector calculation in the training set, is then to carry out C5), otherwise jump to step C2) continue to carry out,

C5.) the similarity result of calculation of calculating is carried out quicksort, takes out K the highest text of similarity,

C6.) the similarity category of this k text is added up,

C7.) get similarity maximal value Si and corresponding class Ci,

C8.) the sign text may belong to the Ci class,

C9.) sorting algorithm finishes.

Beneficial effect: based on the Web page classification method of training set, promptly to the unknown classification webpage with the training set comparison to obtain the method for the corresponding classification of this webpage, and feature extraction algorithm, the distance vector comparison algorithm of the vector representation model of design webpage and vector, by Web page classifying being determined can being done more deep analysis to user's visit behavior of surfing the Net.

The user is the direct user of network, also is simultaneously the final judge that the network service quality quality is judged.Traditional network service also exists deficiency, as can not initiatively providing information needed to the user bring the huge while easily to the user.User behavior analysis, can be the information that is hidden under the user behavior, hobby as the user, user's field, user's access frequency etc. is concluded summary, by study to user behavior, make the network service more targetedly towards the specific user, preferentially or initiatively return the required essential information of user.

Utilization can be carried out classification analysis to user's browsing web record based on the Web page classification method of training set, can obtain the user utilizes network often to pay close attention to the information of which aspect, be engaged in any aspect work and like on which website, carrying out user behavior information such as consume activity.This is for improving network service quality, and it all is very important improving network management.

Traditional Web page classification method all is to utilize artificial treatment, and this method accuracy rate can finely guarantee, and for webpage quantity excessive the time poor efficiency of this disposal route will come out, more can't reach real-time effect.And can be with the assorting process robotization based on the Web page classification method of training set, and the method that adopts has certain assurance on accuracy rate.

Description of drawings

Fig. 1 is the Web page classifying functional diagram.Provided each processing procedure of classification among the figure.

Fig. 2 is that the characteristic item of webpage vector is determined method flow diagram.Provide characteristic item among the figure and determined the concrete processing procedure of method.

Fig. 3 is that the eigenwert of the characteristic item of webpage vector is determined method flow diagram.The eigenwert that has provided characteristic item among the figure is determined the concrete processing procedure of method.

Fig. 4 is vectorial comparative approach process flow diagram.Provided the concrete processing procedure of vectorial comparative approach among the figure.

Embodiment

The present invention proposes a kind of effectively to the technological frame of automatic webpage classification, and detailed design sorting algorithm, as shown in Figure 1.As can be seen from the figure, system is divided into three parts, is respectively: web page contents processing, webpage vector representation and webpage vector are relatively.

Here it may be noted that 2 text terms.Training set refers to the webpage source code set of a large amount of known classification, and source code is stored with textual form, and is stored in respectively in the different files according to the civilian class of correspondence, and these texts finally all pass through treatment conversion and become the form of corresponding vector to represent.Feature extraction is meant definite process of each element of webpage vector, and wherein element is the keyword entry that can embody web page contents, and the value of element is the weights result of calculation of entry to classification importance.Each webpage all has the vector representation of oneself.

Key method of the present invention is at vector representation part and vectorial rating unit, and vector representation partly mainly comprises two methods: the characteristic item of webpage vector determines that the eigenwert of method and characteristic item determines method; The main method of vector rating unit is: webpage vector to be measured is with training set vector method relatively.

The characteristic item of webpage vector is determined method: the basic foundation of feature selecting is the effect size of feature to classification results, utilizes statistic to measure.The result of feature selecting also will guarantee not change the character of original feature space, and the dimension of feature space is reduced in the ideal range.Because will be based on the principle above original, we have selected statistical method, think that Chinese keyword in the webpage satisfies between generic to distribute.This statistics value is high more, and the independence between keyword is generic is more little, and correlativity is strong more, and promptly keyword is contributed big more to such other.All keywords in the training set after handling are gathered in the text, and calculate generic 4 kinds of each keyword and concern frequency: 1. the frequency n that in classification j, occurs of keyword i ₁₁, the 2. frequency n that occurs in keyword i other classifications outside classification j ₁₂, the 3. frequency n that all entries occur in classification j except that keyword i ₂₁, the 4. frequency n that outside classification j, occurs in other classifications of all entries except that keyword i ₂₂Pass through formula then:

χ^{2} = \frac{n \times {(n_{11} \times n_{22} - n_{12} \times n_{21})}^{2}}{(n_{11} \times n_{12}) \times (n_{21} \times n_{22}) \times (n_{11} \times n_{21}) \times (n_{12} \times n_{22})}

Calculate statistic.Wherein n is the frequency summation of all keywords.Every couple of keyword i and classification j are calculated χ ²Value, take out bigger preceding 1000 speech of result as feature, promptly finished determining to measure feature.

The eigenwert of characteristic item is determined method: after having finished the selection of characteristic item, compose with weight for the characteristic item of selecting, and be used for describing the content and the importance of feature in text of document.For the special document of form web page, because its design feature and feature corresponding class information, we calculate more accurate statistic on the basis of the weighing computation method of TF*IDF, to describe the importance of characteristic item for web page contents.This method be characteristic item i document j the frequency TF of appearance _IjAnd the inverse ratio document frequency of characteristic item i and document j

Pass through formula:

w_{ij} = \frac{\sqrt{{TF}_{ij} \times \log (\frac{N}{n_{j}} + 0.01)}}{\sqrt{Σ_{j = 1}^{n} {({TF}_{ij} \times \log (\frac{N}{n_{j}} + 0.01))}^{2}}}

Calculate the feature weights.N wherein _jBe web page text d in the training set _iIn characteristic item t appears _jTextual data.

Vector method relatively: utilize the K nearest neighbor algorithm, vector to be measured is compared with each text in the training set, calculate their similarity, find out K training text the most similar.And give each text class marking on this basis, score value is to belong to such text and the similarity sum between the test text in K the training text, sorts by score value then.Get the big person of score value result as a comparison.Concrete computing formula is:

y (\overset{&RightArrow;}{χ}, c_{j}) = \underset{{\overset{&RightArrow;}{d}}_{i} &Element; kNN}{Σ} sim (\overset{&RightArrow;}{χ}, {\overset{&RightArrow;}{d}}_{i}) y ({\overset{&RightArrow;}{d}}_{i}, c_{j}) - b_{j} .

Wherein:

Be webpage vector to be measured, Be webpage vector in the training set, c is the element that classification is concentrated,

sim (\overset{&RightArrow;}{χ}, {\overset{&RightArrow;}{d}}_{i}) = \frac{\overset{&RightArrow;}{χ} \cdot \overset{&RightArrow;}{d}}{| \overset{&RightArrow;}{χ} | | \overset{&RightArrow;}{d} |},

The size 0, value between the 1} (when

Get 1 when belonging to c; Otherwise be 0).

Below introduce the various piece function implementation method of this design in detail.

1. web page contents processing module

Function: this part is obtained the named web page source code earlier, and utilizes regular expression to extract the Chinese part of source code, again the Chinese text after extracting is carried out word segmentation processing and stores with textual form.

Interface: the web page text of this part after with participle offers next functional module.

2. webpage vector representation module

Function: part at first obtains the expression of webpage vector by training set.To submit to each web page text that comes to generate vector again, and store in the database through calculating.

Interface: this funtion part is a webpage vector comparison module, and the data that compare are provided.Vector of every record expression in the database, the web page text that the row representative is different, the feature of row representation vector, the data value of storage is the weights that calculate of each feature speech in this web page text.

This layer mainly comprises two methods: the characteristic item of webpage vector determines that the eigenwert of method and characteristic item determines method.By at first determining each element of vector, promptly the keyword in the web page contents calculates and composes with corresponding weight value for the importance of web page contents the feature speech again, finishes the vector representation of webpage.

◆ the characteristic item of webpage vector is determined method.Method processing procedure such as accompanying drawing 2.

(1) all participles gather in the training set, and training set is after previous action, and submitting the form of coming to is the text of the good speech of branch, and text leaves in respectively in the different files according to the difference classification, on demand all texts is gathered by batch processing.The keyword entry that all classification have so just been arranged, but it is too big as the characteristic item calculated amount with so many keywords, the result who calculates can be not big and accurate because of the quantity of characteristic item yet, therefore characteristic item need be carried out dimension and subtract approximately, the feature speech is reduced to certain limit to improve computing velocity.

(2) entry length screening, between 5, the entry in this length range is not considered as not quite even play interference effect to the classification effect, and these entries are rejected with the length restriction to 2 of all entries.

(3) the entry uniqueness is done qualification, because amount of text is huge, the probability that same entry occurs is also very big, but the entry that is used to calculate is only once all right with calculating, therefore all the entry frequencies in total vocabulary text must be restricted to once, to improve computing velocity and to reduce miscount.

(4) calculate the frequency that each entry occurs respectively in the difference classification, then with all frequency summations.

(5) calculate between every pair of different entry classification four kinds and concern frequency.Then according to χ ²Computing method obtain the weights of every pair of entry.

(6) weights are pressed descending sort, get preceding 1000 entries, finish determining of characteristic item as characteristic item.

(7) characteristic item is chosen the algorithm end.

◆ the eigenwert of characteristic item is determined method.Method flow as shown in Figure 3.

(1) obtains characteristic item.

(2) according to the quantity dynamically creating data tables of characteristic item.

(3) training centralized calculation to comprise the number of files of characteristic item.

(4) add up total amount of text, total categorical measure, all kinds of contained amount of text.

(5) frequency of calculated characteristics item in each literary composition, and handle with matrix form.

(6) according to different texts, the eigenwert of calculated characteristics item is finished the vector representation of text.

(7) the vector representation algorithm finishes.

3. webpage vector comparison module

Function: this partial function is that the webpage vector to be measured after the last resume module is compared with the institute's directed quantity in the training set, calculates the result of webpage vector to be measured with all webpage vectors in the training set through special algorithm.Find with vector the most similar in the training set, the classification of its correspondence is document classification to be measured.

Interface: final classification results is stored in the database.

This part is the core of categorizing system, comprises vectorial comparative approach.

◆ vectorial comparative approach.Method flow as shown in Figure 4.

(1) obtains the proper vector of test text X.

(2) from training set, take out a text feature vector T i.

(3) calculate two proper vectors similarity sim (X, Ti).

(4) judge whether that it is then to carry out (5) that calculating finishes, continue to carry out otherwise jump to step (2).

(5) the similarity result of calculation of calculating is carried out quicksort, take out K the highest text of similarity.

(6) the similarity category of this k text is added up.

(7) get similarity maximal value Si and corresponding class Ci.

(8) the sign text may belong to the Ci class.

(9) sorting algorithm finishes.

4. the application of automatic webpage classification system

Have very wide significance and using value for automatic webpage classification.Mainly can be applied in:

◆ the automatic taxonomic clustering research of Chinese web page;

◆ the research of Chinese web web page characteristics;

◆ information retrieval technique research;

◆ for the thematic search engine of specialty is laid the groundwork;

◆ Internet information is obtained and is utilized form analysis.

This method is partly having design utilization for the reverse engine in the automatic webpage classification system of our development ﹠ construction.System by the B/S form again in conjunction with just/support of contrary engine, realize searching related urls and given URL is known its classification by classification.Through sampling Detection, based on the global url and the Chinese url rank of alexa rank net, the coverage rate of native system reaches 50% and 97% respectively.We still need classification accuracy continue to improve, and also need tighter division, the customization of classification also to remain to 3 grades to 4 grades trend development, to guarantee that coverage rate is more complete more extensive to defining of training set.

The environment for use of system is built simply, only need be under the windows environment, and be equipped with Net2.0 framework and oracle9i or above version, internet in the connection just can move native system.System easy to use, in conjunction with simple and clear B/S framework, the user can carry out the associative search operation according to prompting.The real-time of system also can accomplish to require in time to upgrade the url database with timing according to user oneself.

Claims

1. the Web page classification method based on training set is characterized in that this method comprises 3 parts, is respectively that web page contents processing, webpage vector representation and webpage vector compare:

The web page contents processing section:

A3.) web page text after will filtering, through word segmentation processing,

Webpage vector representation part:

This part is divided into again to measure feature speech dimension and subtracts approximately, and feature speech eigenwert is determined 2 processes, and feature speech dimension subtracts approximately:

B6.) weights are pressed descending sort, get preceding 1000 entries, finish determining of characteristic item as characteristic item; Feature speech eigenwert is determined:

B7.) obtain characteristic item,

B13.) the vector representation algorithm finishes;

Webpage vector rating unit

C1.) obtain the proper vector of test text X,

C2.) from training set, take out a text feature vector T i,

C3.) calculate two proper vectors similarity sim (X, Ti),

C4.) judging whether to finish with all vector calculation in the training set, is then to carry out (C5), otherwise jumps to step C2) continue to carry out,

C6.) the similarity category of this k text is added up,

C7.) get similarity maximal value Si and corresponding class Ci,

C8.) the sign text may belong to the Ci class,

C9.) sorting algorithm finishes.