CN104881458A - Labeling method and device for web page topics - Google Patents

Labeling method and device for web page topics Download PDF

Info

Publication number
CN104881458A
CN104881458A CN201510266108.XA CN201510266108A CN104881458A CN 104881458 A CN104881458 A CN 104881458A CN 201510266108 A CN201510266108 A CN 201510266108A CN 104881458 A CN104881458 A CN 104881458A
Authority
CN
China
Prior art keywords
text
title
webpage
feature vector
participle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510266108.XA
Other languages
Chinese (zh)
Other versions
CN104881458B (en
Inventor
李扬曦
杜翠兰
李睿
佟玲玲
翟羽佳
王晶
刘洋
秦韬
付戈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN201510266108.XA priority Critical patent/CN104881458B/en
Publication of CN104881458A publication Critical patent/CN104881458A/en
Application granted granted Critical
Publication of CN104881458B publication Critical patent/CN104881458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention discloses a labeling method and device for web page topics. The method includes the steps that based on titles and main bodies of web pages, topic feature vectors of the web pages are acquired; classification processing is performed on the topic feature vectors through a classifier which is obtained through training in advance; whether types which the topic feature vectors belong to exist is judged; if yes, the web pages are labeled as the types which the topic feature vectors belong to; otherwise, the web pages are labeled as web pages to be labeled; furthermore, clustering processing is performed on the multiple web pages to be labeled; the type of each cluster is obtained through analysis; the web pages to be labeled are labeled as the types of the clusters which the web pages belong to. By the adoption of a supervised classification method and unsupervised clustering method cascading mode, the topics are automatically acquired from the web pages, the web pages are labeled, and the labeling efficiency and accuracy of the web page topics are effectively improved.

Description

A kind of mask method of Web page subject and device
Technical field
The present invention relates to technical field of data processing, particularly relate to a kind of mask method and device of Web page subject.
Background technology
By analyzing internet web page contents, extracting and marking the important foundation that Web page subject is the application such as internet data management and excavation.At present, Web page subject mark adopts key word matching method, by web page title and part predetermined keyword being carried out mating the mark realizing webpage more.But the way of this direct coupling is too simple, and if the keyword in web page title changes, then the method accurately cannot mark theme, and the accuracy rate of web standards cannot ensure.Another kind of Web page subject mark is the method adopting cluster, carries out cluster to webpage, from gather be a class webpage extract the mark of keyword as this class webpage.But because clustering algorithm is comparatively consuming time, when webpage quantity to be marked is more, the practicality of this kind of algorithm is poor, and only use the webpage label accuracy rate of unsupervised learning algorithm not high.
Summary of the invention
The invention provides a kind of mask method and device of Web page subject, in order to solve the problem that in prior art, Web page subject mark accuracy rate is low.
Based on above-mentioned technical matters, the present invention solves by the following technical programs.
The invention provides a kind of mask method of Web page subject, comprising: based on title and the text of webpage, obtain the theme feature vector of described webpage; Utilize the sorter that training in advance obtains, classification process is carried out to described theme feature vector; Judge whether to there is the type belonging to described theme feature vector; If so, then by the type of described webpage label belonging to described theme feature vector; If not, be then webpage to be marked by described Web Page Tags; Further, clustering processing is carried out to multiple webpage to be marked; Analyze the type of each cluster set; By the type of the cluster set of webpage label to be marked belonging to it.
Wherein, based on title and the text of webpage, obtain the theme feature vector of described webpage, comprising: extract the title in webpage and text respectively; According to described title, build title feature vector; According to described text, build text proper vector; Text proper vector described in described title feature vector sum is spliced into described theme feature vector.
Wherein, build web page title proper vector according to described title, comprising: utilize the title dictionary built in advance, word segmentation processing is carried out to described title, obtain title participle; Described title participle is mapped in described title dictionary; Based on the weighted value of described title participle, process is weighted to described title dictionary, constructs the title feature vector of described webpage.
Wherein, build Web page text proper vector according to described text, comprising: utilize the text dictionary built in advance, word segmentation processing is carried out to described text, obtains multiple text participle, and record the appearance order of each described text participle in described text; Multiple described text participle is mapped in described text dictionary respectively; Based on weighted value and the appearance order of each text participle, process is weighted to described text dictionary, builds the text proper vector of described webpage.
Wherein, utilize the sorter that training in advance obtains, classification process is carried out to described theme feature vector, comprising: pre-defined multiple type of webpage; Described sorter, for every type, is once marked to the theme feature vector of described webpage; The scoring score value of the correspondence of every type is compared with the mark threshold value preset respectively; By type corresponding for the scoring score value being greater than described mark threshold value, be judged to be the type belonging to described theme feature vector; Wherein, the type belonging to described theme feature vector is one or more.
Wherein, analyze the type of cluster set, comprising: the title and the text that extract each webpage to be marked in cluster set respectively; Utilize the title dictionary built in advance, word segmentation processing is carried out to all titles, obtains multiple title participle; Utilize the text dictionary built in advance, word segmentation processing is carried out to all texts, obtains multiple text participle; In multiple described title participle and multiple described text participle, obtain the participle that the frequency of occurrences is maximum, using the type as described cluster set.
Present invention also offers a kind of annotation equipment of Web page subject, comprising: obtain module, for based on the title of webpage and text, obtain the theme feature vector of described webpage; Sort module, the sorter obtained for utilizing training in advance, carries out classification process to described theme feature vector; , there is the type belonging to described theme feature vector for judging whether in judge module; Labeling module, for it is determined that the presence of the type belonging to described theme feature vector at described judge module, by the type of described webpage label belonging to described theme feature vector; Described Web Page Tags, for when described judge module judges the type do not existed belonging to described theme feature vector, is webpage to be marked by mark module; Cluster module, for carrying out clustering processing to multiple webpage to be marked; Analysis module, for analyzing the type of each cluster set; Described labeling module, also for the type by the cluster set of webpage label to be marked belonging to it.
Wherein, described acquisition module comprises: extraction unit, for extracting title in webpage and text respectively; First construction unit, for according to described title, builds title feature vector; Second construction unit, for according to described text, builds text proper vector; Concatenation unit, for being spliced into described theme feature vector by text proper vector described in described title feature vector sum.
Wherein, described first construction unit specifically for: utilize the title dictionary that builds in advance, word segmentation processing carried out to described title, obtain title participle; Described title participle is mapped in described title dictionary; Based on the weighted value of described title participle, process is weighted to described title dictionary, constructs the title feature vector of described webpage; Described second construction unit specifically for: utilize the text dictionary that builds in advance, word segmentation processing carried out to described text, obtains multiple text participle, and record the appearance order of each described text participle in described text; Multiple described text participle is mapped in described text dictionary respectively; Based on weighted value and the appearance order of each text participle, process is weighted to described text dictionary, builds the text proper vector of described webpage.
Wherein, sort module is specifically for pre-defined multiple type of webpage; Call described sorter, to make described sorter for every type, the theme feature vector of described webpage is once marked; The scoring score value of the correspondence of every type is compared with the mark threshold value preset respectively; By type corresponding for the scoring score value being greater than described mark threshold value, be judged to be the type belonging to described theme feature vector; Wherein, the type belonging to described theme feature vector is one or more; Analysis module is specifically for the title and the text that extract each webpage to be marked in cluster set respectively; Utilize the title dictionary built in advance, word segmentation processing is carried out to all titles, obtains multiple title participle; Utilize the text dictionary built in advance, word segmentation processing is carried out to all texts, obtains multiple text participle; In multiple described title participle and multiple described text participle, obtain the participle that the frequency of occurrences is maximum, using the type as described cluster set.Beneficial effect of the present invention is as follows:
The present invention adopts the mode having the sorting technique of supervision and unsupervised clustering method cascade, obtains theme automatically and mark webpage from webpage, effectively improves efficiency and the accuracy of Web page subject mark.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the mask method of Web page subject according to an embodiment of the invention;
Fig. 2 is the process flow diagram of the mask method of Web page subject according to another embodiment of the present invention;
Fig. 3 is the flow chart of steps building web page title proper vector according to an embodiment of the invention;
Fig. 4 is the flow chart of steps building Web page text proper vector according to an embodiment of the invention;
Fig. 5 is the splicing schematic diagram of title feature vector sum text proper vector according to an embodiment of the invention;
Fig. 6 is according to an embodiment of the invention to the flow chart of steps that theme feature vector is classified;
Fig. 7 is the structural drawing of the annotation equipment of Web page subject according to an embodiment of the invention;
Fig. 8 is the structural drawing of acquisition module according to an embodiment of the invention.
Embodiment
Below in conjunction with accompanying drawing and embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, do not limit the present invention.
Present embodiments providing a kind of mask method of Web page subject, as shown in Figure 1, is the process flow diagram of the mask method of Web page subject according to an embodiment of the invention.The present embodiment is the step performed for each webpage.
Step S110, based on title and the text of webpage, obtains the theme feature vector of this webpage.
Because the length of web page title and text, diction are different, the present embodiment extracts title in webpage and text respectively; According to title, build title feature vector; According to text, build text proper vector; Title feature vector sum text proper vector is spliced into the theme feature vector of webpage.Wherein, title feature vector sum text proper vector all comprises the word vectors of the theme for embodying webpage.
Adopt different dictionaries, structural attitude vector, can describe web page contents so more accurately respectively, and then improves the accuracy of Web page subject mark.
Step S120, utilizes the sorter that training in advance obtains, and carries out classification process to this theme feature vector.
Sorter is used for classifying to theme feature vector, determines the type of theme feature vector.Theme feature vector can embody Web page subject, so determines that the type of theme feature vector that is to say the type determining webpage.The type comprises: news category, economic class, amusement class, scientific and technological class etc.
In order to improve the accuracy of Web page classifying, the present embodiment adopts the sorting technique having supervision, and sorter utilizes pre-prepd classification annotation system and training data, is obtained by training.
Classification annotation system refers to predefined multiple type of webpage.Such as: news category, economic class, amusement class, scientific and technological class.Training data comprises: based on classification annotation system, analyzed go out multiple webpages of type.Based on classification annotation system and training data, support vector machines is adopted to carry out training classifier.
Step S130, judges whether to exist this type belonging to theme feature vector.If so, then step S140 is performed; If not, then step S150 is performed.
According to the classification result of sorter, judge whether to exist this type belonging to theme feature vector.If there is the type belonging to theme feature vector, then the type that is the theme belonging to proper vector of this classification result; If there is no the type belonging to theme feature vector, then this classification result is null value.
Step S140, by the type of this webpage label belonging to this theme feature vector.
This Web Page Tags is webpage to be marked by step S150.
The webpage of type can be determined for sorter, mark corresponding classification.The webpage of type can not be determined for sorter, put into collections of web pages to be marked, use follow-up method to process, to ensure the accuracy of webpage label.
As shown in Figure 2, be the process flow diagram of the mask method of Web page subject according to another embodiment of the present invention.The present embodiment is the process carried out for webpage to be marked.
Step S210, carries out clustering processing to multiple webpage to be marked.
Each preset time period, determines the webpage quantity being marked as webpage to be marked, if this webpage quantity is greater than default amount threshold, then clustering processing is carried out to webpage to be marked, if this webpage quantity is less than or equal to amount threshold, then interval preset time period, again carries out webpage quantity and determine.
The present embodiment adopts unsupervised clustering method, therefore, when carrying out clustering processing, utilize the similarity algorithm pre-set, such as, adopt kmeans algorithm, to the Similarity Measure that multiple webpage to be marked carries out between any two, two webpages to be marked similarity being greater than default similarity threshold are divided in same cluster set.
Step S220, analyzes the type of each cluster set.
Canopy algorithm can be adopted, analyze the type of each cluster set.
In one embodiment, following steps can be performed for each cluster set: the title and the text that extract each webpage to be marked in cluster set respectively; Utilize title dictionary, word segmentation processing is carried out to all titles, obtain multiple title participle; Utilize text dictionary, word segmentation processing is carried out to all texts, obtain multiple text participle; In multiple title participle and multiple text participle, obtain the participle that the frequency of occurrences is maximum, using the type as this cluster set.Wherein, the participle that the frequency of occurrences is maximum can be title participle, also can be text participle.
Step S230, by the type of the cluster set of webpage label to be marked belonging to it.
In other words, what the type of cluster set is, then what type is exactly, and what the mark of the webpage to be marked in this cluster set is exactly.
In one embodiment, at set intervals, utilize cluster result, sorter is trained again, to increase the precision of classification.Further, after mark completes, the new type that this can be obtained by cluster and the webpage of this new type add in classification annotation system and training data.And then can increase the webpage of new type and this new type is trained.
The mode combined by sorter and clustering processing determines the type of webpage, can improve accuracy and the standard performance of webpage label.
For step S110,
Fig. 3 is the flow chart of steps building web page title proper vector according to an embodiment of the invention.
Step S310, builds title dictionary in advance.
Step 1, collects the title of webpage, forms title corpus.
Step 2, carries out participle to the title text in title corpus, only retains qualified word in word segmentation result.Such as, this word segmentation result has practical significance.Can utilize default segmentation methods, segmentation methods comprises a dictionary usually, and title text is divided into one or more participle word by this dictionary.
Step 3, calculates IDF (the Inverted Document Frequency) value of the word be retained, and IDF value is greater than the word composition title dictionary of a default IDF threshold value.The word representativeness that IDF value is larger is stronger, and the word representativeness that IDF value is less is more weak.
The account form of the IDF value of word w is shown below:
I D F ( w ) = l o g N n d - - - ( 1.1 )
In formula (1.1), N represents the quantity of the title that whole corpus is collected, n drepresent the title quantity occurring word w.Log represents logarithm, and its truth of a matter gets 10 or e, specifically determines according to demand.
Step S320, utilizes title dictionary, carries out word segmentation processing to title, obtains title participle.
Utilize the word in title dictionary, word segmentation processing is carried out to title, obtain one or more title participle.
Step S330, is mapped to title participle in title dictionary.
Multiple title participle is mapped in title dictionary respectively.Further, title dictionary comprises multiple word; Mapping relations are set up between word in title participle and title dictionary.Wherein, the title participle that there are mapping relations is identical with word.
After mapping relations are set up, can obtain the vector that a length equals title dictionary length, the dimension of vector equals the quantity of word in title dictionary, a word in the corresponding dictionary of each dimension.
Step S340, based on the weighted value of title participle, is weighted process to title dictionary, constructs the title feature vector of webpage.
Process is weighted to title dictionary, that is to say that the vector to above-mentioned length equals title dictionary length is weighted process.For the word that there are mapping relations in title dictionary, namely there is the word of mapping relations with title participle in vector, use TFIDF (term frequency – inverse document frequency) value weighting, the vector obtained after weighting is title feature vector.Wherein, TFIDF is a kind of conventional weighting technique explored for information retrieval and information.
Adding temporary, the value of each dimension of vector is the TFIDF value of word in this title corresponding to this dimension.The account form of the TFIDF value of word w is shown below:
T F I D F ( w ) = T F * I D F = c w c * l o g N n d - - - ( 1.2 )
In formula (1.2), the calculating of IDF value is with (1.1) formula, and TF value represents the frequency that word w occurs in current head, c wrepresent the number of times that word w occurs in current head, c represents the number of current head word (participle).
Fig. 4 is the flow chart of steps building Web page text proper vector according to an embodiment of the invention.
Step S410, the text dictionary built in advance.
Collecting body matter is text corpus, by carrying out participle to the body text in text corpus, only retains qualified word in word segmentation result, as: the word be of practical significance; Calculate the IDF value of the word be retained; IDF value is greater than the word composition text dictionary of default 2nd IDF threshold value.The building mode of text dictionary is identical with the structure of title dictionary.The computing reference formula (1.1) of IDF value.
Step S420, utilizes the text dictionary built, carries out word segmentation processing, obtain multiple text participle to text, and records each text participle appearance order in the body of the email.
Utilize the word in text dictionary, participle is carried out to text; According to text order from front to back, record the appearance order of each participle (word), first participle occurred is designated as 1, and second participle occurred is designated as 2, by that analogy, and the participle repeated not record.
Step S430, is mapped in text dictionary respectively by multiple text participle.
The text of webpage tends to utilize the brief word projecting motif of beginning, attract eyeball, and namely important word tends to appear at before text.
Text dictionary comprises multiple word; Mapping relations are set up between word in text participle and text dictionary.Wherein, the text participle that there are mapping relations is identical with word.
After mapping relations are set up, can obtain the vector that a length equals text dictionary length, the dimension of vector equals the quantity of word in text dictionary, a word in the corresponding dictionary of each dimension.
Step S440, based on weighted value and the appearance order of each text participle, aligns cliction allusion quotation and is weighted process, build the text proper vector of webpage.
Align cliction allusion quotation and be weighted process, that is to say that the vector to above-mentioned length equals text dictionary length is weighted process.For the word that there are mapping relations in text dictionary, namely there is the word of mapping relations with text participle in vector, use the appearance order weighting of the text participle of TFIDF value and mapping, the vector obtained after weighting is text proper vector.A word in the corresponding dictionary of each dimension of text proper vector, the value of each dimension is according to the appearance order of word in this text corresponding to this dimension and the TFIDF value of this word, the weighted value weight of acquisition zw:
weight z w ( w ) = ( 1 - r a n k ( w ) Σ w ∈ W r a n k ( w ) ) * T F I D F ( w ) - - - ( 1.3 )
In formula (1.3), weight zww () represents the weighted value (dimension value) of word w in text proper vector, the serial number that rank (w) occurs in the body of the email for w, ∑ w ∈ Wthe summation that rank (w) is all word order number, the description relevant to title with reference to formula (1.2), can be replaced by the description that text is relevant by TFIDF (w).Adopt said method can obtain text proper vector.In formula (1.3), the symbol of word adopts consistent with the symbol of word in formula (1.2), all uses w, only understands the computation process of TFIDF (w) in formula (1.3) for convenience.
Generally speaking, title uses brief statement to designate content, the theme of webpage.Therefore, title is shorter, text is longer, the present embodiment considers that the length of title feature vector is less than the length of text proper vector usually, but the importance of title feature vector is greater than text proper vector, the present embodiment proposes title feature vector sum text proper vector to adopt the mode of weighting to be spliced into the proper vector expressing this Web page subject, i.e. theme feature vector.The such as connecting method shown in accompanying drawing 5.Can avoid causing title feature vector by the present embodiment, text proper vector plays a role unbalance deviation in study.
Before splicing, for dimension value TFIDF (w) value of the word w in title feature vector, use title weight w btbe weighted, that is:
weight bt(w)=w bt*TFIDF(w) (1.4)
Before splicing, for the dimension value not right to use weight values of the word in text proper vector.
When splicing, unweighted for the title feature vector sum after weighting text proper vector is spliced.The present embodiment adopts end to end mode to splice, and forms the vector that a length equals title feature vector sum text proper vector sum, and wherein, the title feature vector after weighting is positioned at before unweighted text proper vector.
The present embodiment adopts the mode of grid search to obtain w bt, w btrange of choice with reference to formula (1.5).At each w btunder, sorter carries out cross validation to training data, calculates classification accuracy rate, gets the w that the highest accuracy is corresponding btas the w of final utilization btvalue.
w b t &Element; 1 , 1 + 0.01 , ... , 1 + 0.01 * n ; 1 + 0.01 * n < N z w N b t - - - ( 1.5 )
In formula (1.5), N btrepresent the dimension of title feature vector, N zwrepresent text feature vector dimension.
For step S120 specifically,
Fig. 6 is according to an embodiment of the invention to the flow chart of steps that theme feature vector is classified.
Step S610, sorter, for every type, is once marked to the theme feature vector of webpage.
Every type, the theme feature vector of webpage has a score value of marking.That is, if having polytype, then multiple scoring score value is had.Whether scoring score value meets type corresponding to this scoring score value for weighing webpage.
Sorter comprises multiple classifier functions, the corresponding type of each classifier functions; Theme feature vector is substituted into each classifier functions respectively, just can obtain the scoring score value of each type.
Such as, a=[a1, a2, a3] is sorter, and y=a1*x1+a2*x2+a3*x3 is news category classifier functions; Certainly the classifier functions of other types can also be had; Title feature vector is substituted into news category classifier functions, y value can be obtained, score value of namely marking, when this scoring score value is greater than 0, represents that the webpage that title feature vector is corresponding is news category, otherwise be not news category; Supposing a=[1 ,-2,3], is title feature vector x=[1 of 3 by dimension, 2,3] substitute into news category classifier functions, can y=6 be obtained, so y>0, the webpage of title feature vector x=[1,2,3] correspondence is news web page.
Step S620, compares with the mark threshold value preset respectively by the scoring score value of the correspondence of every type.
Step S630, by being greater than the type corresponding to scoring score value of mark threshold value, is judged to be the type belonging to theme feature vector; Wherein, the type belonging to described theme feature vector is one or more.
Concrete, according to value order from big to small, multiple scoring score value can be sorted; Judging whether maximum scoring score value is greater than default mark threshold value, is if so, then the type that this maximum scoring score value is corresponding by webpage label, if not, is then webpage to be marked by Web Page Tags; Then, judging that size is only second to maximum scoring score value and whether is greater than default mark threshold value, is if so, then that this size is only second to type corresponding to maximum scoring score value by webpage label, if not, is then webpage to be marked by Web Page Tags; By that analogy, until each scoring score value compared with mark threshold value.
Present invention also offers a kind of annotation equipment of Web page subject, as shown in Figure 7, is the structural drawing of the annotation equipment of Web page subject according to an embodiment of the invention.
This device comprises:
Obtain module 710, for based on the title of webpage and text, obtain the theme feature vector of webpage.
Sort module 720, the sorter obtained for utilizing training in advance, carries out classification process to theme feature vector.
, there is the type belonging to theme feature vector for judging whether in judge module 730.
Labeling module 740, for it is determined that the presence of the type belonging to theme feature vector at judge module, type webpage label is the theme belonging to proper vector.
Web Page Tags, for when judge module judges the type do not existed belonging to theme feature vector, is webpage to be marked by mark module 750.
Cluster module 760, for carrying out clustering processing to multiple webpage to be marked.
Analysis module 770, for analyzing the type of each cluster set.
Labeling module 780, also for the type by the cluster set of webpage label to be marked belonging to it.
In one embodiment, obtain module 710 and comprise: extraction unit 711, for extracting title in webpage and text respectively; First construction unit 712, for according to title, builds title feature vector; Second construction unit 713, for according to text, builds text proper vector; Concatenation unit 714, for being spliced into theme feature vector by title feature vector sum text proper vector.As shown in Figure 8.
First construction unit 712 for: utilize the title dictionary that builds in advance, word segmentation processing carried out to title, obtain title participle; Title participle is mapped in title dictionary; Based on the weighted value of title participle, process is weighted to title dictionary, constructs the title feature vector of webpage.
Second construction unit 713 for: utilize the text dictionary that builds in advance, word segmentation processing carried out to text, obtains multiple text participle, and record each text participle appearance order in the body of the email; Multiple text participle is mapped in text dictionary respectively; Based on weighted value and the appearance order of each text participle, align cliction allusion quotation and be weighted process, build the text proper vector of webpage.
In another embodiment, sort module 720 is specifically for pre-defined multiple type of webpage; Calling classification device, to make sorter for every type, once marks to the theme feature vector of webpage; The scoring score value of the correspondence of every type is compared with the mark threshold value preset respectively; By type corresponding for the scoring score value being greater than mark threshold value, be judged to be the type belonging to theme feature vector; Wherein, the type belonging to theme feature vector is one or more.
In another embodiment, analysis module 770 is specifically for the title and the text that extract each webpage to be marked in cluster set respectively; Utilize the title dictionary built in advance, word segmentation processing is carried out to all titles, obtains multiple title participle; Utilize the text dictionary built in advance, word segmentation processing is carried out to all texts, obtains multiple text participle; In multiple title participle and multiple text participle, obtain the participle that the frequency of occurrences is maximum, using the type as cluster set.
The function of the device described in the present embodiment is described in the embodiment of the method shown in Fig. 1-Fig. 6, therefore not detailed part in the description of the present embodiment, see the related description in previous embodiment, can not repeat at this.
Although be example object, disclose the preferred embodiments of the present invention, it is also possible for those skilled in the art will recognize various improvement, increase and replacement, and therefore, scope of the present invention should be not limited to above-described embodiment.

Claims (10)

1. a mask method for Web page subject, is characterized in that, comprising:
Based on title and the text of webpage, obtain the theme feature vector of described webpage;
Utilize the sorter that training in advance obtains, classification process is carried out to described theme feature vector;
Judge whether to there is the type belonging to described theme feature vector;
If so, then by the type of described webpage label belonging to described theme feature vector;
If not, be then webpage to be marked by described Web Page Tags; Further, clustering processing is carried out to multiple webpage to be marked; Analyze the type of each cluster set; By the type of the cluster set of webpage label to be marked belonging to it.
2. the method for claim 1, is characterized in that, based on title and the text of webpage, obtains the theme feature vector of described webpage, comprising:
Extract the title in webpage and text respectively;
According to described title, build title feature vector;
According to described text, build text proper vector;
Text proper vector described in described title feature vector sum is spliced into described theme feature vector.
3. method as claimed in claim 2, is characterized in that, builds web page title proper vector, comprising according to described title:
Utilize the title dictionary built in advance, word segmentation processing is carried out to described title, obtain title participle;
Described title participle is mapped in described title dictionary;
Based on the weighted value of described title participle, process is weighted to described title dictionary, constructs the title feature vector of described webpage.
4. method as claimed in claim 2, is characterized in that, builds Web page text proper vector, comprising according to described text:
Utilize the text dictionary built in advance, word segmentation processing is carried out to described text, obtains multiple text participle, and record the appearance order of each described text participle in described text;
Multiple described text participle is mapped in described text dictionary respectively;
Based on weighted value and the appearance order of each text participle, process is weighted to described text dictionary, builds the text proper vector of described webpage.
5. the method for claim 1, is characterized in that, utilizes the sorter that training in advance obtains, and carries out classification process, comprising described theme feature vector:
Pre-defined multiple type of webpage;
Described sorter, for every type, is once marked to the theme feature vector of described webpage;
The scoring score value of the correspondence of every type is compared with the mark threshold value preset respectively;
By type corresponding for the scoring score value being greater than described mark threshold value, be judged to be the type belonging to described theme feature vector; Wherein, the type belonging to described theme feature vector is one or more.
6. the method for claim 1, is characterized in that, analyzes the type of cluster set, comprising:
Extract title and the text of each webpage to be marked in cluster set respectively;
Utilize the title dictionary built in advance, word segmentation processing is carried out to all titles, obtains multiple title participle;
Utilize the text dictionary built in advance, word segmentation processing is carried out to all texts, obtains multiple text participle;
In multiple described title participle and multiple described text participle, obtain the participle that the frequency of occurrences is maximum, using the type as described cluster set.
7. an annotation equipment for Web page subject, is characterized in that, comprising:
Obtain module, for based on the title of webpage and text, obtain the theme feature vector of described webpage;
Sort module, the sorter obtained for utilizing training in advance, carries out classification process to described theme feature vector;
, there is the type belonging to described theme feature vector for judging whether in judge module;
Labeling module, for it is determined that the presence of the type belonging to described theme feature vector at described judge module, by the type of described webpage label belonging to described theme feature vector;
Described Web Page Tags, for when described judge module judges the type do not existed belonging to described theme feature vector, is webpage to be marked by mark module;
Cluster module, for carrying out clustering processing to multiple webpage to be marked;
Analysis module, for analyzing the type of each cluster set;
Described labeling module, also for the type by the cluster set of webpage label to be marked belonging to it.
8. device as claimed in claim 7, it is characterized in that, described acquisition module comprises:
Extraction unit, for extracting title in webpage and text respectively;
First construction unit, for according to described title, builds title feature vector;
Second construction unit, for according to described text, builds text proper vector;
Concatenation unit, for being spliced into described theme feature vector by text proper vector described in described title feature vector sum.
9. device as claimed in claim 8, is characterized in that,
Described first construction unit specifically for:
Utilize the title dictionary built in advance, word segmentation processing is carried out to described title, obtain title participle;
Described title participle is mapped in described title dictionary;
Based on the weighted value of described title participle, process is weighted to described title dictionary, constructs the title feature vector of described webpage;
Described second construction unit specifically for:
Utilize the text dictionary built in advance, word segmentation processing is carried out to described text, obtains multiple text participle, and record the appearance order of each described text participle in described text;
Multiple described text participle is mapped in described text dictionary respectively;
Based on weighted value and the appearance order of each text participle, process is weighted to described text dictionary, builds the text proper vector of described webpage.
10. device as claimed in claim 7, is characterized in that,
Sort module specifically for:
Pre-defined multiple type of webpage; Call described sorter, to make described sorter for every type, the theme feature vector of described webpage is once marked;
The scoring score value of the correspondence of every type is compared with the mark threshold value preset respectively;
By type corresponding for the scoring score value being greater than described mark threshold value, be judged to be the type belonging to described theme feature vector; Wherein, the type belonging to described theme feature vector is one or more;
Analysis module specifically for:
Extract title and the text of each webpage to be marked in cluster set respectively;
Utilize the title dictionary built in advance, word segmentation processing is carried out to all titles, obtains multiple title participle;
Utilize the text dictionary built in advance, word segmentation processing is carried out to all texts, obtains multiple text participle;
In multiple described title participle and multiple described text participle, obtain the participle that the frequency of occurrences is maximum, using the type as described cluster set.
CN201510266108.XA 2015-05-22 2015-05-22 A kind of mask method and device of Web page subject Active CN104881458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510266108.XA CN104881458B (en) 2015-05-22 2015-05-22 A kind of mask method and device of Web page subject

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510266108.XA CN104881458B (en) 2015-05-22 2015-05-22 A kind of mask method and device of Web page subject

Publications (2)

Publication Number Publication Date
CN104881458A true CN104881458A (en) 2015-09-02
CN104881458B CN104881458B (en) 2019-05-28

Family

ID=53948951

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510266108.XA Active CN104881458B (en) 2015-05-22 2015-05-22 A kind of mask method and device of Web page subject

Country Status (1)

Country Link
CN (1) CN104881458B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550292A (en) * 2015-12-11 2016-05-04 北京邮电大学 Web page classification method based on von Mises-Fisher probability model
CN105760526A (en) * 2016-03-01 2016-07-13 网易(杭州)网络有限公司 News classification method and device
CN105975573A (en) * 2016-05-04 2016-09-28 北京广利核系统工程有限公司 KNN-based text classification method
CN106021418A (en) * 2016-05-13 2016-10-12 北京奇虎科技有限公司 News event clustering method and device
CN106844328A (en) * 2016-08-23 2017-06-13 华南师范大学 A kind of new extensive document subject matter semantic analysis and system
CN107784037A (en) * 2016-08-31 2018-03-09 北京搜狗科技发展有限公司 Information processing method and device, the device for information processing
CN108090099A (en) * 2016-11-22 2018-05-29 科大讯飞股份有限公司 A kind of text handling method and device
CN108241662A (en) * 2016-12-23 2018-07-03 北京国双科技有限公司 The optimization method and device of data mark
CN109299271A (en) * 2018-10-30 2019-02-01 腾讯科技(深圳)有限公司 Training sample generation, text data, public sentiment event category method and relevant device
CN109359301A (en) * 2018-10-19 2019-02-19 国家计算机网络与信息安全管理中心 A kind of the various dimensions mask method and device of web page contents
CN109471937A (en) * 2018-10-11 2019-03-15 平安科技(深圳)有限公司 A kind of file classification method and terminal device based on machine learning
CN110287314A (en) * 2019-05-20 2019-09-27 中国科学院计算技术研究所 Long text credibility evaluation method and system based on Unsupervised clustering

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
US20120191776A1 (en) * 2011-01-20 2012-07-26 Linkedin Corporation Methods and systems for recommending a context based on content interaction
CN102831193A (en) * 2012-08-03 2012-12-19 人民搜索网络股份公司 Topic detecting device and topic detecting method based on distributed multistage cluster
CN103177024A (en) * 2011-12-23 2013-06-26 微梦创科网络科技(中国)有限公司 Method and device of topic information show
CN103235824A (en) * 2013-05-06 2013-08-07 上海河广信息科技有限公司 Method and system for determining web page texts users interested in according to browsed web pages

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
US20120191776A1 (en) * 2011-01-20 2012-07-26 Linkedin Corporation Methods and systems for recommending a context based on content interaction
CN103177024A (en) * 2011-12-23 2013-06-26 微梦创科网络科技(中国)有限公司 Method and device of topic information show
CN102831193A (en) * 2012-08-03 2012-12-19 人民搜索网络股份公司 Topic detecting device and topic detecting method based on distributed multistage cluster
CN103235824A (en) * 2013-05-06 2013-08-07 上海河广信息科技有限公司 Method and system for determining web page texts users interested in according to browsed web pages

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
程博: "Web文本分类方法研究与系统实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550292A (en) * 2015-12-11 2016-05-04 北京邮电大学 Web page classification method based on von Mises-Fisher probability model
CN105550292B (en) * 2015-12-11 2018-06-08 北京邮电大学 A kind of Web page classification method based on von Mises-Fisher probabilistic models
CN105760526A (en) * 2016-03-01 2016-07-13 网易(杭州)网络有限公司 News classification method and device
CN105760526B (en) * 2016-03-01 2019-05-07 网易(杭州)网络有限公司 A kind of method and apparatus of news category
CN105975573A (en) * 2016-05-04 2016-09-28 北京广利核系统工程有限公司 KNN-based text classification method
CN105975573B (en) * 2016-05-04 2019-08-13 北京广利核系统工程有限公司 A kind of file classification method based on KNN
CN106021418A (en) * 2016-05-13 2016-10-12 北京奇虎科技有限公司 News event clustering method and device
CN106021418B (en) * 2016-05-13 2019-09-06 北京奇虎科技有限公司 The clustering method and device of media event
CN106844328A (en) * 2016-08-23 2017-06-13 华南师范大学 A kind of new extensive document subject matter semantic analysis and system
CN106844328B (en) * 2016-08-23 2020-04-21 华南师范大学 Large-scale document theme semantic analysis method and system
CN107784037A (en) * 2016-08-31 2018-03-09 北京搜狗科技发展有限公司 Information processing method and device, the device for information processing
CN107784037B (en) * 2016-08-31 2022-02-01 北京搜狗科技发展有限公司 Information processing method and device, and device for information processing
CN108090099B (en) * 2016-11-22 2022-02-25 科大讯飞股份有限公司 Text processing method and device
CN108090099A (en) * 2016-11-22 2018-05-29 科大讯飞股份有限公司 A kind of text handling method and device
CN108241662A (en) * 2016-12-23 2018-07-03 北京国双科技有限公司 The optimization method and device of data mark
CN109471937A (en) * 2018-10-11 2019-03-15 平安科技(深圳)有限公司 A kind of file classification method and terminal device based on machine learning
CN109359301A (en) * 2018-10-19 2019-02-19 国家计算机网络与信息安全管理中心 A kind of the various dimensions mask method and device of web page contents
CN109299271A (en) * 2018-10-30 2019-02-01 腾讯科技(深圳)有限公司 Training sample generation, text data, public sentiment event category method and relevant device
CN110287314B (en) * 2019-05-20 2021-08-06 中国科学院计算技术研究所 Long text reliability assessment method and system based on unsupervised clustering
CN110287314A (en) * 2019-05-20 2019-09-27 中国科学院计算技术研究所 Long text credibility evaluation method and system based on Unsupervised clustering

Also Published As

Publication number Publication date
CN104881458B (en) 2019-05-28

Similar Documents

Publication Publication Date Title
CN104881458A (en) Labeling method and device for web page topics
CN106649818B (en) Application search intention identification method and device, application search method and server
CN106407484B (en) Video tag extraction method based on barrage semantic association
CN101305370B (en) Information classification paradigm
CN111104526A (en) Financial label extraction method and system based on keyword semantics
US10831993B2 (en) Method and apparatus for constructing binary feature dictionary
CN105975478A (en) Word vector analysis-based online article belonging event detection method and device
CN105183833A (en) User model based microblogging text recommendation method and recommendation apparatus thereof
CN103399901A (en) Keyword extraction method
CN102663139A (en) Method and system for constructing emotional dictionary
CN103473317A (en) Method and equipment for extracting keywords
CN110717040A (en) Dictionary expansion method and device, electronic equipment and storage medium
CN109446423B (en) System and method for judging sentiment of news and texts
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN111309910A (en) Text information mining method and device
CN111090753B (en) Training method of classification model, classification method, device and computer storage medium
CN110738033B (en) Report template generation method, device and storage medium
CN109190099B (en) Sentence pattern extraction method and device
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN111178080B (en) Named entity identification method and system based on structured information
CN112948575B (en) Text data processing method, apparatus and computer readable storage medium
CN116151220A (en) Word segmentation model training method, word segmentation processing method and device
CN103268346A (en) Semi-supervised classification method and semi-supervised classification system
CN109446522B (en) Automatic test question classification system and method
CN104699819A (en) Sememe classification method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant