CN104881458A

CN104881458A - Labeling method and device for web page topics

Info

Publication number: CN104881458A
Application number: CN201510266108.XA
Authority: CN
Inventors: 李扬曦; 杜翠兰; 李睿; 佟玲玲; 翟羽佳; 王晶; 刘洋; 秦韬; 付戈
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2015-05-22
Filing date: 2015-05-22
Publication date: 2015-09-02
Anticipated expiration: 2035-05-22
Also published as: CN104881458B

Abstract

The invention discloses a labeling method and device for web page topics. The method includes the steps that based on titles and main bodies of web pages, topic feature vectors of the web pages are acquired; classification processing is performed on the topic feature vectors through a classifier which is obtained through training in advance; whether types which the topic feature vectors belong to exist is judged; if yes, the web pages are labeled as the types which the topic feature vectors belong to; otherwise, the web pages are labeled as web pages to be labeled; furthermore, clustering processing is performed on the multiple web pages to be labeled; the type of each cluster is obtained through analysis; the web pages to be labeled are labeled as the types of the clusters which the web pages belong to. By the adoption of a supervised classification method and unsupervised clustering method cascading mode, the topics are automatically acquired from the web pages, the web pages are labeled, and the labeling efficiency and accuracy of the web page topics are effectively improved.

Description

A kind of mask method of Web page subject and device

Technical field

The present invention relates to technical field of data processing, particularly relate to a kind of mask method and device of Web page subject.

Background technology

By analyzing internet web page contents, extracting and marking the important foundation that Web page subject is the application such as internet data management and excavation.At present, Web page subject mark adopts key word matching method, by web page title and part predetermined keyword being carried out mating the mark realizing webpage more.But the way of this direct coupling is too simple, and if the keyword in web page title changes, then the method accurately cannot mark theme, and the accuracy rate of web standards cannot ensure.Another kind of Web page subject mark is the method adopting cluster, carries out cluster to webpage, from gather be a class webpage extract the mark of keyword as this class webpage.But because clustering algorithm is comparatively consuming time, when webpage quantity to be marked is more, the practicality of this kind of algorithm is poor, and only use the webpage label accuracy rate of unsupervised learning algorithm not high.

Summary of the invention

The invention provides a kind of mask method and device of Web page subject, in order to solve the problem that in prior art, Web page subject mark accuracy rate is low.

Based on above-mentioned technical matters, the present invention solves by the following technical programs.

The invention provides a kind of mask method of Web page subject, comprising: based on title and the text of webpage, obtain the theme feature vector of described webpage; Utilize the sorter that training in advance obtains, classification process is carried out to described theme feature vector; Judge whether to there is the type belonging to described theme feature vector; If so, then by the type of described webpage label belonging to described theme feature vector; If not, be then webpage to be marked by described Web Page Tags; Further, clustering processing is carried out to multiple webpage to be marked; Analyze the type of each cluster set; By the type of the cluster set of webpage label to be marked belonging to it.

Wherein, based on title and the text of webpage, obtain the theme feature vector of described webpage, comprising: extract the title in webpage and text respectively; According to described title, build title feature vector; According to described text, build text proper vector; Text proper vector described in described title feature vector sum is spliced into described theme feature vector.

Wherein, build web page title proper vector according to described title, comprising: utilize the title dictionary built in advance, word segmentation processing is carried out to described title, obtain title participle; Described title participle is mapped in described title dictionary; Based on the weighted value of described title participle, process is weighted to described title dictionary, constructs the title feature vector of described webpage.

Wherein, build Web page text proper vector according to described text, comprising: utilize the text dictionary built in advance, word segmentation processing is carried out to described text, obtains multiple text participle, and record the appearance order of each described text participle in described text; Multiple described text participle is mapped in described text dictionary respectively; Based on weighted value and the appearance order of each text participle, process is weighted to described text dictionary, builds the text proper vector of described webpage.

Wherein, utilize the sorter that training in advance obtains, classification process is carried out to described theme feature vector, comprising: pre-defined multiple type of webpage; Described sorter, for every type, is once marked to the theme feature vector of described webpage; The scoring score value of the correspondence of every type is compared with the mark threshold value preset respectively; By type corresponding for the scoring score value being greater than described mark threshold value, be judged to be the type belonging to described theme feature vector; Wherein, the type belonging to described theme feature vector is one or more.

Wherein, analyze the type of cluster set, comprising: the title and the text that extract each webpage to be marked in cluster set respectively; Utilize the title dictionary built in advance, word segmentation processing is carried out to all titles, obtains multiple title participle; Utilize the text dictionary built in advance, word segmentation processing is carried out to all texts, obtains multiple text participle; In multiple described title participle and multiple described text participle, obtain the participle that the frequency of occurrences is maximum, using the type as described cluster set.

Present invention also offers a kind of annotation equipment of Web page subject, comprising: obtain module, for based on the title of webpage and text, obtain the theme feature vector of described webpage; Sort module, the sorter obtained for utilizing training in advance, carries out classification process to described theme feature vector; , there is the type belonging to described theme feature vector for judging whether in judge module; Labeling module, for it is determined that the presence of the type belonging to described theme feature vector at described judge module, by the type of described webpage label belonging to described theme feature vector; Described Web Page Tags, for when described judge module judges the type do not existed belonging to described theme feature vector, is webpage to be marked by mark module; Cluster module, for carrying out clustering processing to multiple webpage to be marked; Analysis module, for analyzing the type of each cluster set; Described labeling module, also for the type by the cluster set of webpage label to be marked belonging to it.

Wherein, described acquisition module comprises: extraction unit, for extracting title in webpage and text respectively; First construction unit, for according to described title, builds title feature vector; Second construction unit, for according to described text, builds text proper vector; Concatenation unit, for being spliced into described theme feature vector by text proper vector described in described title feature vector sum.

Wherein, described first construction unit specifically for: utilize the title dictionary that builds in advance, word segmentation processing carried out to described title, obtain title participle; Described title participle is mapped in described title dictionary; Based on the weighted value of described title participle, process is weighted to described title dictionary, constructs the title feature vector of described webpage; Described second construction unit specifically for: utilize the text dictionary that builds in advance, word segmentation processing carried out to described text, obtains multiple text participle, and record the appearance order of each described text participle in described text; Multiple described text participle is mapped in described text dictionary respectively; Based on weighted value and the appearance order of each text participle, process is weighted to described text dictionary, builds the text proper vector of described webpage.

Wherein, sort module is specifically for pre-defined multiple type of webpage; Call described sorter, to make described sorter for every type, the theme feature vector of described webpage is once marked; The scoring score value of the correspondence of every type is compared with the mark threshold value preset respectively; By type corresponding for the scoring score value being greater than described mark threshold value, be judged to be the type belonging to described theme feature vector; Wherein, the type belonging to described theme feature vector is one or more; Analysis module is specifically for the title and the text that extract each webpage to be marked in cluster set respectively; Utilize the title dictionary built in advance, word segmentation processing is carried out to all titles, obtains multiple title participle; Utilize the text dictionary built in advance, word segmentation processing is carried out to all texts, obtains multiple text participle; In multiple described title participle and multiple described text participle, obtain the participle that the frequency of occurrences is maximum, using the type as described cluster set.Beneficial effect of the present invention is as follows:

The present invention adopts the mode having the sorting technique of supervision and unsupervised clustering method cascade, obtains theme automatically and mark webpage from webpage, effectively improves efficiency and the accuracy of Web page subject mark.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the mask method of Web page subject according to an embodiment of the invention;

Fig. 2 is the process flow diagram of the mask method of Web page subject according to another embodiment of the present invention;

Fig. 3 is the flow chart of steps building web page title proper vector according to an embodiment of the invention;

Fig. 4 is the flow chart of steps building Web page text proper vector according to an embodiment of the invention;

Fig. 5 is the splicing schematic diagram of title feature vector sum text proper vector according to an embodiment of the invention;

Fig. 6 is according to an embodiment of the invention to the flow chart of steps that theme feature vector is classified;

Fig. 7 is the structural drawing of the annotation equipment of Web page subject according to an embodiment of the invention;

Fig. 8 is the structural drawing of acquisition module according to an embodiment of the invention.

Embodiment

Below in conjunction with accompanying drawing and embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, do not limit the present invention.

Present embodiments providing a kind of mask method of Web page subject, as shown in Figure 1, is the process flow diagram of the mask method of Web page subject according to an embodiment of the invention.The present embodiment is the step performed for each webpage.

Step S110, based on title and the text of webpage, obtains the theme feature vector of this webpage.

Because the length of web page title and text, diction are different, the present embodiment extracts title in webpage and text respectively; According to title, build title feature vector; According to text, build text proper vector; Title feature vector sum text proper vector is spliced into the theme feature vector of webpage.Wherein, title feature vector sum text proper vector all comprises the word vectors of the theme for embodying webpage.

Adopt different dictionaries, structural attitude vector, can describe web page contents so more accurately respectively, and then improves the accuracy of Web page subject mark.

Step S120, utilizes the sorter that training in advance obtains, and carries out classification process to this theme feature vector.

Sorter is used for classifying to theme feature vector, determines the type of theme feature vector.Theme feature vector can embody Web page subject, so determines that the type of theme feature vector that is to say the type determining webpage.The type comprises: news category, economic class, amusement class, scientific and technological class etc.

In order to improve the accuracy of Web page classifying, the present embodiment adopts the sorting technique having supervision, and sorter utilizes pre-prepd classification annotation system and training data, is obtained by training.

Classification annotation system refers to predefined multiple type of webpage.Such as: news category, economic class, amusement class, scientific and technological class.Training data comprises: based on classification annotation system, analyzed go out multiple webpages of type.Based on classification annotation system and training data, support vector machines is adopted to carry out training classifier.

Step S130, judges whether to exist this type belonging to theme feature vector.If so, then step S140 is performed; If not, then step S150 is performed.

According to the classification result of sorter, judge whether to exist this type belonging to theme feature vector.If there is the type belonging to theme feature vector, then the type that is the theme belonging to proper vector of this classification result; If there is no the type belonging to theme feature vector, then this classification result is null value.

Step S140, by the type of this webpage label belonging to this theme feature vector.

This Web Page Tags is webpage to be marked by step S150.

The webpage of type can be determined for sorter, mark corresponding classification.The webpage of type can not be determined for sorter, put into collections of web pages to be marked, use follow-up method to process, to ensure the accuracy of webpage label.

As shown in Figure 2, be the process flow diagram of the mask method of Web page subject according to another embodiment of the present invention.The present embodiment is the process carried out for webpage to be marked.

Step S210, carries out clustering processing to multiple webpage to be marked.

Each preset time period, determines the webpage quantity being marked as webpage to be marked, if this webpage quantity is greater than default amount threshold, then clustering processing is carried out to webpage to be marked, if this webpage quantity is less than or equal to amount threshold, then interval preset time period, again carries out webpage quantity and determine.

The present embodiment adopts unsupervised clustering method, therefore, when carrying out clustering processing, utilize the similarity algorithm pre-set, such as, adopt kmeans algorithm, to the Similarity Measure that multiple webpage to be marked carries out between any two, two webpages to be marked similarity being greater than default similarity threshold are divided in same cluster set.

Step S220, analyzes the type of each cluster set.

Canopy algorithm can be adopted, analyze the type of each cluster set.

In one embodiment, following steps can be performed for each cluster set: the title and the text that extract each webpage to be marked in cluster set respectively; Utilize title dictionary, word segmentation processing is carried out to all titles, obtain multiple title participle; Utilize text dictionary, word segmentation processing is carried out to all texts, obtain multiple text participle; In multiple title participle and multiple text participle, obtain the participle that the frequency of occurrences is maximum, using the type as this cluster set.Wherein, the participle that the frequency of occurrences is maximum can be title participle, also can be text participle.

Step S230, by the type of the cluster set of webpage label to be marked belonging to it.

In other words, what the type of cluster set is, then what type is exactly, and what the mark of the webpage to be marked in this cluster set is exactly.

In one embodiment, at set intervals, utilize cluster result, sorter is trained again, to increase the precision of classification.Further, after mark completes, the new type that this can be obtained by cluster and the webpage of this new type add in classification annotation system and training data.And then can increase the webpage of new type and this new type is trained.

The mode combined by sorter and clustering processing determines the type of webpage, can improve accuracy and the standard performance of webpage label.

For step S110,

Fig. 3 is the flow chart of steps building web page title proper vector according to an embodiment of the invention.

Step S310, builds title dictionary in advance.

Step 1, collects the title of webpage, forms title corpus.

Step 2, carries out participle to the title text in title corpus, only retains qualified word in word segmentation result.Such as, this word segmentation result has practical significance.Can utilize default segmentation methods, segmentation methods comprises a dictionary usually, and title text is divided into one or more participle word by this dictionary.

Step 3, calculates IDF (the Inverted Document Frequency) value of the word be retained, and IDF value is greater than the word composition title dictionary of a default IDF threshold value.The word representativeness that IDF value is larger is stronger, and the word representativeness that IDF value is less is more weak.

The account form of the IDF value of word w is shown below:

I D F (w) = l o g \frac{N}{n_{d}} - - - (1.1)

In formula (1.1), N represents the quantity of the title that whole corpus is collected, n _drepresent the title quantity occurring word w.Log represents logarithm, and its truth of a matter gets 10 or e, specifically determines according to demand.

Step S320, utilizes title dictionary, carries out word segmentation processing to title, obtains title participle.

Utilize the word in title dictionary, word segmentation processing is carried out to title, obtain one or more title participle.

Step S330, is mapped to title participle in title dictionary.

Multiple title participle is mapped in title dictionary respectively.Further, title dictionary comprises multiple word; Mapping relations are set up between word in title participle and title dictionary.Wherein, the title participle that there are mapping relations is identical with word.

After mapping relations are set up, can obtain the vector that a length equals title dictionary length, the dimension of vector equals the quantity of word in title dictionary, a word in the corresponding dictionary of each dimension.

Step S340, based on the weighted value of title participle, is weighted process to title dictionary, constructs the title feature vector of webpage.

Process is weighted to title dictionary, that is to say that the vector to above-mentioned length equals title dictionary length is weighted process.For the word that there are mapping relations in title dictionary, namely there is the word of mapping relations with title participle in vector, use TFIDF (term frequency – inverse document frequency) value weighting, the vector obtained after weighting is title feature vector.Wherein, TFIDF is a kind of conventional weighting technique explored for information retrieval and information.

Adding temporary, the value of each dimension of vector is the TFIDF value of word in this title corresponding to this dimension.The account form of the TFIDF value of word w is shown below:

T F I D F (w) = T F * I D F = \frac{c_{w}}{c} * l o g \frac{N}{n_{d}} - - - (1.2)

In formula (1.2), the calculating of IDF value is with (1.1) formula, and TF value represents the frequency that word w occurs in current head, c _wrepresent the number of times that word w occurs in current head, c represents the number of current head word (participle).

Fig. 4 is the flow chart of steps building Web page text proper vector according to an embodiment of the invention.

Step S410, the text dictionary built in advance.

Collecting body matter is text corpus, by carrying out participle to the body text in text corpus, only retains qualified word in word segmentation result, as: the word be of practical significance; Calculate the IDF value of the word be retained; IDF value is greater than the word composition text dictionary of default 2nd IDF threshold value.The building mode of text dictionary is identical with the structure of title dictionary.The computing reference formula (1.1) of IDF value.

Step S420, utilizes the text dictionary built, carries out word segmentation processing, obtain multiple text participle to text, and records each text participle appearance order in the body of the email.

Utilize the word in text dictionary, participle is carried out to text; According to text order from front to back, record the appearance order of each participle (word), first participle occurred is designated as 1, and second participle occurred is designated as 2, by that analogy, and the participle repeated not record.

Step S430, is mapped in text dictionary respectively by multiple text participle.

The text of webpage tends to utilize the brief word projecting motif of beginning, attract eyeball, and namely important word tends to appear at before text.

Text dictionary comprises multiple word; Mapping relations are set up between word in text participle and text dictionary.Wherein, the text participle that there are mapping relations is identical with word.

After mapping relations are set up, can obtain the vector that a length equals text dictionary length, the dimension of vector equals the quantity of word in text dictionary, a word in the corresponding dictionary of each dimension.

Step S440, based on weighted value and the appearance order of each text participle, aligns cliction allusion quotation and is weighted process, build the text proper vector of webpage.

Align cliction allusion quotation and be weighted process, that is to say that the vector to above-mentioned length equals text dictionary length is weighted process.For the word that there are mapping relations in text dictionary, namely there is the word of mapping relations with text participle in vector, use the appearance order weighting of the text participle of TFIDF value and mapping, the vector obtained after weighting is text proper vector.A word in the corresponding dictionary of each dimension of text proper vector, the value of each dimension is according to the appearance order of word in this text corresponding to this dimension and the TFIDF value of this word, the weighted value weight of acquisition _zw:

{weight}_{z w} (w) = (1 - \frac{r a n k (w)}{Σ_{w &Element; W} r a n k (w)}) * T F I D F (w) - - - (1.3)

In formula (1.3), weight _zww () represents the weighted value (dimension value) of word w in text proper vector, the serial number that rank (w) occurs in the body of the email for w, ∑ _{w ∈ W}the summation that rank (w) is all word order number, the description relevant to title with reference to formula (1.2), can be replaced by the description that text is relevant by TFIDF (w).Adopt said method can obtain text proper vector.In formula (1.3), the symbol of word adopts consistent with the symbol of word in formula (1.2), all uses w, only understands the computation process of TFIDF (w) in formula (1.3) for convenience.

Generally speaking, title uses brief statement to designate content, the theme of webpage.Therefore, title is shorter, text is longer, the present embodiment considers that the length of title feature vector is less than the length of text proper vector usually, but the importance of title feature vector is greater than text proper vector, the present embodiment proposes title feature vector sum text proper vector to adopt the mode of weighting to be spliced into the proper vector expressing this Web page subject, i.e. theme feature vector.The such as connecting method shown in accompanying drawing 5.Can avoid causing title feature vector by the present embodiment, text proper vector plays a role unbalance deviation in study.

Before splicing, for dimension value TFIDF (w) value of the word w in title feature vector, use title weight w _btbe weighted, that is:

weight _bt(w)＝w _bt*TFIDF(w) (1.4)

Before splicing, for the dimension value not right to use weight values of the word in text proper vector.

When splicing, unweighted for the title feature vector sum after weighting text proper vector is spliced.The present embodiment adopts end to end mode to splice, and forms the vector that a length equals title feature vector sum text proper vector sum, and wherein, the title feature vector after weighting is positioned at before unweighted text proper vector.

The present embodiment adopts the mode of grid search to obtain w _bt, w _btrange of choice with reference to formula (1.5).At each w _btunder, sorter carries out cross validation to training data, calculates classification accuracy rate, gets the w that the highest accuracy is corresponding _btas the w of final utilization _btvalue.

w_{b t} &Element; 1, 1 + 0.01, ..., 1 + 0.01 * n; 1 + 0.01 * n < \frac{N_{z w}}{N_{b t}} - - - (1.5)

In formula (1.5), N _btrepresent the dimension of title feature vector, N _zwrepresent text feature vector dimension.

For step S120 specifically,

Fig. 6 is according to an embodiment of the invention to the flow chart of steps that theme feature vector is classified.

Step S610, sorter, for every type, is once marked to the theme feature vector of webpage.

Every type, the theme feature vector of webpage has a score value of marking.That is, if having polytype, then multiple scoring score value is had.Whether scoring score value meets type corresponding to this scoring score value for weighing webpage.

Sorter comprises multiple classifier functions, the corresponding type of each classifier functions; Theme feature vector is substituted into each classifier functions respectively, just can obtain the scoring score value of each type.

Such as, a=[a1, a2, a3] is sorter, and y=a1*x1+a2*x2+a3*x3 is news category classifier functions; Certainly the classifier functions of other types can also be had; Title feature vector is substituted into news category classifier functions, y value can be obtained, score value of namely marking, when this scoring score value is greater than 0, represents that the webpage that title feature vector is corresponding is news category, otherwise be not news category; Supposing a=[1 ,-2,3], is title feature vector x=[1 of 3 by dimension, 2,3] substitute into news category classifier functions, can y=6 be obtained, so y>0, the webpage of title feature vector x=[1,2,3] correspondence is news web page.

Step S620, compares with the mark threshold value preset respectively by the scoring score value of the correspondence of every type.

Step S630, by being greater than the type corresponding to scoring score value of mark threshold value, is judged to be the type belonging to theme feature vector; Wherein, the type belonging to described theme feature vector is one or more.

Concrete, according to value order from big to small, multiple scoring score value can be sorted; Judging whether maximum scoring score value is greater than default mark threshold value, is if so, then the type that this maximum scoring score value is corresponding by webpage label, if not, is then webpage to be marked by Web Page Tags; Then, judging that size is only second to maximum scoring score value and whether is greater than default mark threshold value, is if so, then that this size is only second to type corresponding to maximum scoring score value by webpage label, if not, is then webpage to be marked by Web Page Tags; By that analogy, until each scoring score value compared with mark threshold value.

Present invention also offers a kind of annotation equipment of Web page subject, as shown in Figure 7, is the structural drawing of the annotation equipment of Web page subject according to an embodiment of the invention.

This device comprises:

Obtain module 710, for based on the title of webpage and text, obtain the theme feature vector of webpage.

Sort module 720, the sorter obtained for utilizing training in advance, carries out classification process to theme feature vector.

, there is the type belonging to theme feature vector for judging whether in judge module 730.

Labeling module 740, for it is determined that the presence of the type belonging to theme feature vector at judge module, type webpage label is the theme belonging to proper vector.

Web Page Tags, for when judge module judges the type do not existed belonging to theme feature vector, is webpage to be marked by mark module 750.

Cluster module 760, for carrying out clustering processing to multiple webpage to be marked.

Analysis module 770, for analyzing the type of each cluster set.

Labeling module 780, also for the type by the cluster set of webpage label to be marked belonging to it.

In one embodiment, obtain module 710 and comprise: extraction unit 711, for extracting title in webpage and text respectively; First construction unit 712, for according to title, builds title feature vector; Second construction unit 713, for according to text, builds text proper vector; Concatenation unit 714, for being spliced into theme feature vector by title feature vector sum text proper vector.As shown in Figure 8.

First construction unit 712 for: utilize the title dictionary that builds in advance, word segmentation processing carried out to title, obtain title participle; Title participle is mapped in title dictionary; Based on the weighted value of title participle, process is weighted to title dictionary, constructs the title feature vector of webpage.

Second construction unit 713 for: utilize the text dictionary that builds in advance, word segmentation processing carried out to text, obtains multiple text participle, and record each text participle appearance order in the body of the email; Multiple text participle is mapped in text dictionary respectively; Based on weighted value and the appearance order of each text participle, align cliction allusion quotation and be weighted process, build the text proper vector of webpage.

In another embodiment, sort module 720 is specifically for pre-defined multiple type of webpage; Calling classification device, to make sorter for every type, once marks to the theme feature vector of webpage; The scoring score value of the correspondence of every type is compared with the mark threshold value preset respectively; By type corresponding for the scoring score value being greater than mark threshold value, be judged to be the type belonging to theme feature vector; Wherein, the type belonging to theme feature vector is one or more.

In another embodiment, analysis module 770 is specifically for the title and the text that extract each webpage to be marked in cluster set respectively; Utilize the title dictionary built in advance, word segmentation processing is carried out to all titles, obtains multiple title participle; Utilize the text dictionary built in advance, word segmentation processing is carried out to all texts, obtains multiple text participle; In multiple title participle and multiple text participle, obtain the participle that the frequency of occurrences is maximum, using the type as cluster set.

The function of the device described in the present embodiment is described in the embodiment of the method shown in Fig. 1-Fig. 6, therefore not detailed part in the description of the present embodiment, see the related description in previous embodiment, can not repeat at this.

Although be example object, disclose the preferred embodiments of the present invention, it is also possible for those skilled in the art will recognize various improvement, increase and replacement, and therefore, scope of the present invention should be not limited to above-described embodiment.

Claims

1. a mask method for Web page subject, is characterized in that, comprising:

Based on title and the text of webpage, obtain the theme feature vector of described webpage;

Utilize the sorter that training in advance obtains, classification process is carried out to described theme feature vector;

Judge whether to there is the type belonging to described theme feature vector;

If so, then by the type of described webpage label belonging to described theme feature vector;

If not, be then webpage to be marked by described Web Page Tags; Further, clustering processing is carried out to multiple webpage to be marked; Analyze the type of each cluster set; By the type of the cluster set of webpage label to be marked belonging to it.

2. the method for claim 1, is characterized in that, based on title and the text of webpage, obtains the theme feature vector of described webpage, comprising:

Extract the title in webpage and text respectively;

According to described title, build title feature vector;

According to described text, build text proper vector;

Text proper vector described in described title feature vector sum is spliced into described theme feature vector.

3. method as claimed in claim 2, is characterized in that, builds web page title proper vector, comprising according to described title:

Utilize the title dictionary built in advance, word segmentation processing is carried out to described title, obtain title participle;

Described title participle is mapped in described title dictionary;

Based on the weighted value of described title participle, process is weighted to described title dictionary, constructs the title feature vector of described webpage.

4. method as claimed in claim 2, is characterized in that, builds Web page text proper vector, comprising according to described text:

Utilize the text dictionary built in advance, word segmentation processing is carried out to described text, obtains multiple text participle, and record the appearance order of each described text participle in described text;

Multiple described text participle is mapped in described text dictionary respectively;

Based on weighted value and the appearance order of each text participle, process is weighted to described text dictionary, builds the text proper vector of described webpage.

5. the method for claim 1, is characterized in that, utilizes the sorter that training in advance obtains, and carries out classification process, comprising described theme feature vector:

Pre-defined multiple type of webpage;

Described sorter, for every type, is once marked to the theme feature vector of described webpage;

The scoring score value of the correspondence of every type is compared with the mark threshold value preset respectively;

By type corresponding for the scoring score value being greater than described mark threshold value, be judged to be the type belonging to described theme feature vector; Wherein, the type belonging to described theme feature vector is one or more.

6. the method for claim 1, is characterized in that, analyzes the type of cluster set, comprising:

Extract title and the text of each webpage to be marked in cluster set respectively;

Utilize the title dictionary built in advance, word segmentation processing is carried out to all titles, obtains multiple title participle;

Utilize the text dictionary built in advance, word segmentation processing is carried out to all texts, obtains multiple text participle;

In multiple described title participle and multiple described text participle, obtain the participle that the frequency of occurrences is maximum, using the type as described cluster set.

7. an annotation equipment for Web page subject, is characterized in that, comprising:

Obtain module, for based on the title of webpage and text, obtain the theme feature vector of described webpage;

Sort module, the sorter obtained for utilizing training in advance, carries out classification process to described theme feature vector;

, there is the type belonging to described theme feature vector for judging whether in judge module;

Labeling module, for it is determined that the presence of the type belonging to described theme feature vector at described judge module, by the type of described webpage label belonging to described theme feature vector;

Described Web Page Tags, for when described judge module judges the type do not existed belonging to described theme feature vector, is webpage to be marked by mark module;

Cluster module, for carrying out clustering processing to multiple webpage to be marked;

Analysis module, for analyzing the type of each cluster set;

Described labeling module, also for the type by the cluster set of webpage label to be marked belonging to it.

8. device as claimed in claim 7, it is characterized in that, described acquisition module comprises:

Extraction unit, for extracting title in webpage and text respectively;

First construction unit, for according to described title, builds title feature vector;

Second construction unit, for according to described text, builds text proper vector;

Concatenation unit, for being spliced into described theme feature vector by text proper vector described in described title feature vector sum.

9. device as claimed in claim 8, is characterized in that,

Described first construction unit specifically for:

Described title participle is mapped in described title dictionary;

Based on the weighted value of described title participle, process is weighted to described title dictionary, constructs the title feature vector of described webpage;

Described second construction unit specifically for:

10. device as claimed in claim 7, is characterized in that,

Sort module specifically for:

Pre-defined multiple type of webpage; Call described sorter, to make described sorter for every type, the theme feature vector of described webpage is once marked;

By type corresponding for the scoring score value being greater than described mark threshold value, be judged to be the type belonging to described theme feature vector; Wherein, the type belonging to described theme feature vector is one or more;

Analysis module specifically for: