CN104881458B - A kind of mask method and device of Web page subject - Google Patents

A kind of mask method and device of Web page subject Download PDF

Info

Publication number
CN104881458B
CN104881458B CN201510266108.XA CN201510266108A CN104881458B CN 104881458 B CN104881458 B CN 104881458B CN 201510266108 A CN201510266108 A CN 201510266108A CN 104881458 B CN104881458 B CN 104881458B
Authority
CN
China
Prior art keywords
text
title
feature vector
webpage
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510266108.XA
Other languages
Chinese (zh)
Other versions
CN104881458A (en
Inventor
李扬曦
杜翠兰
李睿
佟玲玲
翟羽佳
王晶
刘洋
秦韬
付戈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN201510266108.XA priority Critical patent/CN104881458B/en
Publication of CN104881458A publication Critical patent/CN104881458A/en
Application granted granted Critical
Publication of CN104881458B publication Critical patent/CN104881458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of mask method of Web page subject and devices.The described method includes: web-based title and text, obtain the theme feature vector of the webpage;The classifier obtained using preparatory training carries out classification processing to the theme feature vector;Judge whether there is type belonging to the theme feature vector;If so, being type belonging to the theme feature vector by the webpage label;If it is not, being then webpage to be marked by the Web Page Tags;Further, clustering processing is carried out to multiple webpages to be marked;Analyze the type of each cluster set;It is the type of the cluster set belonging to it by webpage label to be marked.The present invention is obtained theme from webpage using the classification method and the unsupervised cascade mode of clustering method for having supervision, automatically and marks webpage, and the efficiency and accuracy of Web page subject mark are effectively increased.

Description

A kind of mask method and device of Web page subject
Technical field
The present invention relates to technical field of data processing, more particularly to the mask method and device of a kind of Web page subject.
Background technique
It is that internet data management and excavation etc. are answered to extract and mark Web page subject by analyzing internet web page contents Important foundation.Currently, Web page subject mark mostly uses key word matching method, closed by presetting web page title and part Keyword carries out the mark that webpage is realized in matching.But it is this directly matched way it is too simple, moreover, if web page title In keyword change, then this method will be unable to accurately mark theme, and the accuracy rate of web standards will be unable to guarantee.It is another Kind Web page subject mark is to be clustered using the method for cluster to webpage, is made from gathering for extraction keyword in a kind of webpage For the mark of this kind of webpages.But since clustering algorithm is more time-consuming, when webpage quantity to be marked is more, this kind of calculation The practicability of method is poor, and the webpage label accuracy rate that unsupervised learning algorithm is used only is not high.
Summary of the invention
The present invention provides the mask method and device of a kind of Web page subject, to solve Web page subject mark in the prior art The low problem of accuracy rate.
Based on above-mentioned technical problem, the present invention solves by the following technical programs.
The present invention provides a kind of mask methods of Web page subject, comprising: web-based title and text, described in acquisition The theme feature vector of webpage;The classifier obtained using preparatory training carries out classification processing to the theme feature vector;Sentence It is disconnected to whether there is type belonging to the theme feature vector;If so, being the theme feature vector by the webpage label Affiliated type;If it is not, being then webpage to be marked by the Web Page Tags;Further, multiple webpages to be marked are gathered Class processing;Analyze the type of each cluster set;It is the type of the cluster set belonging to it by webpage label to be marked.
Wherein, web-based title and text obtain the theme feature vector of the webpage, comprising: extract net respectively Title and text in page;According to the title, title feature vector is constructed;According to the text, text feature vector is constructed; Text feature vector described in the title feature vector sum is spliced into the theme feature vector.
Wherein, web page title feature vector is constructed according to the title, comprising: right using the title dictionary constructed in advance The title carries out word segmentation processing, obtains title participle;Title participle is mapped in the title dictionary;Based on described The weighted value of title participle, is weighted processing to the title dictionary, constructs the title feature vector of the webpage.
Wherein, Web page text feature vector is constructed according to the text, comprising: right using the text dictionary constructed in advance The text carries out word segmentation processing, obtains multiple text participles, and records each text participle going out in the text Now sequence;Multiple texts are respectively mapped in the text dictionary;Based on each text participle weighted value and Appearance sequence, is weighted processing to the text dictionary, constructs the text feature vector of the webpage.
Wherein, the classifier obtained using preparatory training carries out classification processing to the theme feature vector, comprising: pre- First define a variety of type of webpage;The classifier is directed to each type, is once commented the theme feature vector of the webpage Point;Each type of corresponding scoring score value is compared with preset mark threshold value respectively;It will be greater than the mark threshold value The corresponding type of scoring score value, be determined as type belonging to the theme feature vector;Wherein, the theme feature vector institute The type of category is one or more.
Wherein, analysis cluster set type, comprising: respectively extract cluster set in each webpage to be marked title and Text;Using the title dictionary constructed in advance, word segmentation processing is carried out to all titles, obtains multiple title participles;Using preparatory The text dictionary of building carries out word segmentation processing to all texts, obtains multiple text participles;In multiple titles participles and more In a text participle, the most participle of the frequency of occurrences is obtained, using the type as the cluster set.
The present invention also provides a kind of annotation equipments of Web page subject, comprising: obtains module, is used for web-based title And text, obtain the theme feature vector of the webpage;Categorization module, for the classifier using training acquisition in advance, to institute It states theme feature vector and carries out classification processing;Judgment module, for judging whether there is class belonging to the theme feature vector Type;Labeling module, for determining in the judgment module there are in the case where type belonging to the theme feature vector, by institute Stating webpage label is type belonging to the theme feature vector;Mark module, for being not present in judgment module judgement It is webpage to be marked by the Web Page Tags in the case where type belonging to the theme feature vector;Cluster module, for pair Multiple webpages to be marked carry out clustering processing;Analysis module, for analyzing the type of each cluster set;The mark mould Block is also used to the type by webpage label to be marked for the cluster set belonging to it.
Wherein, the acquisition module includes: extraction unit, for extracting title and text in webpage respectively;First structure Unit is built, for constructing title feature vector according to the title;Second construction unit, for according to the text, building to be just Literary feature vector;Concatenation unit, it is special for text feature vector described in the title feature vector sum to be spliced into the theme Levy vector.
Wherein, first construction unit is specifically used for: using the title dictionary constructed in advance, dividing the title Word processing obtains title participle;Title participle is mapped in the title dictionary;Weighting based on title participle Value, is weighted processing to the title dictionary, constructs the title feature vector of the webpage;The second construction unit tool Body is used for: using the text dictionary constructed in advance, being carried out word segmentation processing to the text, is obtained multiple text participles, and record Appearance sequence of each text participle in the text;Multiple texts are respectively mapped to the positive cliction In allusion quotation;Weighted value and appearance sequence based on each text participle, are weighted processing to the text dictionary, construct the net The text feature vector of page.
Wherein, categorization module is specifically used for: pre-defining a variety of type of webpage;The classifier is called, it is described to make Classifier is directed to each type, is once scored the theme feature vector of the webpage;It corresponding is commented each type of Score value is divided to be compared respectively with preset mark threshold value;The corresponding type of scoring score value that will be greater than the mark threshold value, sentences It is set to type belonging to the theme feature vector;Wherein, type belonging to the theme feature vector is one or more;Point Analysis module is specifically used for: extracting the title and text of each webpage to be marked in cluster set respectively;Utilize the mark constructed in advance Allusion quotation is write inscription, word segmentation processing is carried out to all titles, obtains multiple title participles;Using the text dictionary constructed in advance, to all Text carries out word segmentation processing, obtains multiple text participles;In multiple title participles and multiple text participles, obtain The most participle of the frequency of occurrences, using the type as the cluster set.The present invention has the beneficial effect that:
The present invention is using the classification method and the unsupervised cascade mode of clustering method for having supervision, automatically from webpage It obtains theme and marks webpage, effectively increase the efficiency and accuracy of Web page subject mark.
Detailed description of the invention
Fig. 1 is the flow chart of the mask method of Web page subject according to an embodiment of the invention;
Fig. 2 is the flow chart of the mask method of Web page subject according to another embodiment of the present invention;
Fig. 3 is the step flow chart of building web page title feature vector according to an embodiment of the invention;
Fig. 4 is the step flow chart of building Web page text feature vector according to an embodiment of the invention;
Fig. 5 is the splicing schematic diagram of title feature vector sum text feature vector according to an embodiment of the invention;
Fig. 6 is the step flow chart according to an embodiment of the invention classified to theme feature vector;
Fig. 7 is the structure chart of the annotation equipment of Web page subject according to an embodiment of the invention;
Fig. 8 is the structure chart according to an embodiment of the invention for obtaining module.
Specific embodiment
Below in conjunction with attached drawing and embodiment, the present invention will be described in further detail.It should be appreciated that described herein Specific embodiment be only used to explain the present invention, limit the present invention.
A kind of mask method of Web page subject is present embodiments provided, as shown in Figure 1, for according to one embodiment of the invention The flow chart of the mask method of Web page subject.The present embodiment is the step of execution for each webpage.
Step S110, web-based title and text obtain the theme feature vector of the webpage.
Since the length of web page title and text, diction are different, the present embodiment extract respectively title in webpage and Text;According to title, title feature vector is constructed;According to text, text feature vector is constructed;By title feature vector sum text Feature vector is spliced into the theme feature vector of webpage.Wherein, title feature vector sum text feature vector all includes for body The word vectors of the theme of existing webpage.
Using different dictionaries, construction feature vector, can more accurately describe web page contents, and then improve so respectively The accuracy of Web page subject mark.
Step S120, the classifier obtained using preparatory training carry out classification processing to the theme feature vector.
Classifier determines the type of theme feature vector for classifying to theme feature vector.Theme feature vector Web page subject can be embodied, then it is determined that the type of theme feature vector that is to say the type of determining webpage.The type includes: new Hear class, economy class, amusement class, science and technology etc..
In order to improve the accuracy of Web page classifying, for the present embodiment using the classification method for having supervision, classifier is using pre- The classification annotation system and training data first prepared is obtained by training.
Classification annotation system refers to a variety of type of webpage predetermined.Such as: news category, economy class, amusement class, science and technology Class.Training data includes: to be parsed out multiple webpages of type based on classification annotation system.Based on classification annotation system And training data, classifier is trained using support vector machines.
Step S130 judges whether there is type belonging to the theme feature vector.If so, thening follow the steps S140;If It is no, then follow the steps S150.
According to the classification processing of classifier as a result, judging whether there is type belonging to the theme feature vector.If deposited The type belonging to theme feature vector, then the classification processing result is the theme type belonging to feature vector;If there is no Type belonging to theme feature vector, then the classification processing result is null value.
The webpage label is type belonging to the theme feature vector by step S140.
The Web Page Tags are webpage to be marked by step S150.
The webpage that can determine type for classifier marks corresponding classification.Type can not be determined for classifier Webpage, be put into collections of web pages to be marked, handled using subsequent method, to guarantee the accuracy of webpage label.
As shown in Fig. 2, for according to the flow chart of the mask method of the Web page subject of another embodiment of the present invention.The present embodiment It is the processing carried out for webpage to be marked.
Step S210 carries out clustering processing to multiple webpages to be marked.
Each preset time period determines the webpage quantity for being marked as webpage to be marked, if the webpage quantity is greater than in advance If amount threshold, then to webpage to be marked carry out clustering processing, if the webpage quantity be less than or equal to amount threshold, be spaced Preset time period carries out webpage quantity again and determines.
The present embodiment uses unsupervised clustering method, therefore, when carrying out clustering processing, using pre-set similar Algorithm is spent, for example, similarity calculation between any two is carried out to multiple webpages to be marked, by similarity using kmeans algorithm It is divided into same cluster set greater than two webpages to be marked of preset similarity threshold.
Step S220 analyzes the type of each cluster set.
Canopy algorithm can be used, to analyze the type of each cluster set.
In one embodiment, following steps can be executed for each cluster set: extracted respectively every in cluster set The title and text of a webpage to be marked;Using title dictionary, word segmentation processing is carried out to all titles, obtains multiple titles point Word;Using text dictionary, word segmentation processing is carried out to all texts, obtains multiple text participles;Multiple titles participle and it is multiple In text participle, the most participle of the frequency of occurrences is obtained, using the type as the cluster set.Wherein, the frequency of occurrences is most Participle can be title participle, be also possible to text participle.
Webpage label to be marked is the type of the cluster set belonging to it by step S230.
In other words, what the type for clustering set is, then what type is exactly, the webpage to be marked in the cluster set What mark is exactly.
In one embodiment, at regular intervals, using cluster result, classifier is trained again, to increase The precision of bonus point class.Further, new type that can be obtained this by cluster after the completion of mark and this is new The webpage of type is added in classification annotation system and training data.And then it can increase to new type and the new type Webpage be trained.
The type that webpage is determined in such a way that classifier and clustering processing combine, can be improved the standard of webpage label True property and standard performance.
For step S110,
Fig. 3 is the step flow chart according to the building web page title feature vector of one embodiment of the invention.
Step S310 constructs title dictionary in advance.
Step 1, the title of webpage is collected, title corpus is formed.
Step 2, the title text in title corpus is segmented, only retains qualified word in word segmentation result Language.For example, the word segmentation result has practical significance.It can use preset segmentation methods, segmentation methods generally comprise a word Title text is divided into one or more participle words by allusion quotation, the dictionary.
Step 3, IDF (Inverted Document Frequency) value of retained word is calculated, and by IDF value Word greater than default first IDF threshold value forms title dictionary.The bigger word representativeness of IDF value is stronger, the smaller word of IDF value Language representativeness is weaker.
The calculation of the IDF value of word w is shown below:
In formula (1.1), N indicates the quantity for the title that entire corpus is collected, ndIndicate the title number for word w occurred Amount.Log indicates logarithm, and the truth of a matter takes 10 or e, determines with specific reference to demand.
Step S320 carries out word segmentation processing to title using title dictionary, obtains title participle.
Using the word in title dictionary, word segmentation processing is carried out to title, obtains one or more title participles.
Title participle is mapped in title dictionary by step S330.
Multiple titles are respectively mapped in title dictionary.It further, include multiple words in title dictionary;? Mapping relations are established between word in title participle and title dictionary.Wherein, there are the title of mapping relations participle and words It is identical.
After mapping relations foundation, the vector that a length is equal to title dictionary length, the dimension of vector can be obtained Equal to the quantity of word in title dictionary, each dimension corresponds to a word in dictionary.
Step S340 is weighted processing to title dictionary, is constructed the title of webpage based on the weighted value of title participle Feature vector.
Processing is weighted to title dictionary, that is to say that the vector for being equal to title dictionary length to above-mentioned length is weighted Processing.For there are the words of mapping relations in title dictionary, i.e., segments with title there are the word of mapping relations, make in vector It is weighted with TFIDF (term frequency-inverse document frequency) value, the vector obtained after weighting is Title feature vector.Wherein, TFIDF is a kind of common weighting technique for information retrieval and information exploration.
In weighting, the value of each dimension of vector is TFIDF value of the corresponding word of the dimension in the title.Word The calculation of the TFIDF value of language w is shown below:
In formula (1.2), with (1.1) formula, TF value indicates the frequency that word w occurs in current head, c for the calculating of IDF valuew Indicate that the number that word w occurs in current head, c indicate the number of current head word (participle).
Fig. 4 is the step flow chart according to the building Web page text feature vector of one embodiment of the invention.
Step S410, the text dictionary constructed in advance.
Collection body matter is that text corpus is only retained by segmenting to the body text in text corpus Qualified word in word segmentation result, such as: the word being of practical significance;Calculate the IDF value of retained word;By IDF value Word greater than default 2nd IDF threshold value forms text dictionary.The building mode of text dictionary is identical as the building of title dictionary. The calculating of IDF value refers to formula (1.1).
Step S420 carries out word segmentation processing to text using the text dictionary of building, obtains multiple text participles, and remember Record the appearance sequence of each text participle in the body of the email.
Using the word in text dictionary, text is segmented;According to the sequence of text from front to back, each point of record The appearance sequence of word (word), the participle of first appearance are denoted as 1, and the participle of second appearance is denoted as 2, and so on, it repeats The participle of appearance does not record.
Multiple texts are respectively mapped in text dictionary by step S430.
The text of webpage tends to using starting brief text projecting motif, attracting eyeball, i.e., important word is inclined to In appearing in front of text.
It include multiple words in text dictionary;Mapping relations are established between the word in text participle and text dictionary. Wherein, there are the text of mapping relations participle is identical with word.
After mapping relations foundation, the vector that a length is equal to text dictionary length, the dimension of vector can be obtained Equal to the quantity of word in text dictionary, each dimension corresponds to a word in dictionary.
Step S440, weighted value and appearance sequence based on each text participle, is weighted processing, structure to text dictionary The text feature vector of networking page.
Processing is weighted to text dictionary, that is to say that the vector for being equal to text dictionary length to above-mentioned length is weighted Processing.For there are the words of mapping relations in text dictionary, i.e., segments with text there are the word of mapping relations, make in vector The appearance sequence segmented with TFIDF value and the text of mapping weights, and the vector obtained after weighting is text feature vector.Text Each dimension of feature vector corresponds to a word in dictionary, and the corresponding word of the dimension exists according to the value of each dimension The TFIDF value of appearance sequence and the word in the text, the weighted value weight of acquisitionzw:
In formula (1.3), weightzw(w) weighted value (dimension value) of word w in text feature vector, rank (w) are indicated For the serial number that w occurs in the body of the email, ∑w∈WRank (w) is the summation of all word orders number, and TFIDF (w) can refer to formula (1.2), description relevant to title is changed to the relevant description of text.Text feature can be obtained using the above method Vector.The symbol of word uses consistent with the symbol of word in formula (1.2) in formula (1.3), all uses w, only for convenience of understanding formula (1.3) calculating process of TFIDF (w) in.
In general, title designates the content of webpage, theme using brief sentence.Therefore, title is shorter, text compared with Long, the present embodiment is usually less than the length of text feature vector, but title feature vector in view of the length of title feature vector Importance be but greater than text feature vector, the present embodiment is proposed title feature vector sum text feature vector using weighting Mode is spliced into the feature vector for expressing the Web page subject, i.e. theme feature vector.Such as attached connecting method shown in fig. 5.It is logical It crosses the present embodiment and can avoid title feature vector, text feature vector plays a role unbalance deviation in study.
Before splicing, for dimension value TFIDF (w) value of the word w in title feature vector, title weight is used wbtIt is weighted, it may be assumed that
weightbt(w)=wbt*TFIDF(w) (1.4)
Before splicing, weighted value is not used for the dimension value of the word in text feature vector.
In splicing, the unweighted text feature vector of title feature vector sum after weighting is spliced.This implementation Example is spliced by the way of end to end, forms a length equal to the sum of title feature vector sum text feature vector Vector, wherein the title feature vector after weighting is located at before unweighted text feature vector.
The present embodiment obtains w by the way of grid searchbt, wbtRange of choice refer to formula (1.5).In each wbtUnder, Classifier carries out cross validation to training data, calculates classification accuracy rate, takes the corresponding w of highest accuracybtIt is used as final WbtValue.
In formula (1.5), NbtIndicate the dimension of title feature vector, NzwIndicate text feature vector dimension.
For step S120 specifically,
Fig. 6 is the step flow chart classified to theme feature vector according to one embodiment of the invention.
Step S610, classifier are directed to each type, are once scored the theme feature vector of webpage.
Each type, the theme feature vector of webpage have a scoring score value.That is, then having more if there is multiple types A scoring score value.Scoring score value is for measuring whether webpage meets the corresponding type of scoring score value.
Classifier includes multiple classifier functions, the corresponding type of each classifier functions;By theme feature vector point Each classifier functions are not substituted into, so that it may obtain the scoring score value of each type.
For example, a=[a1, a2, a3] is classifier, y=a1*x1+a2*x2+a3*x3 is news category classifier functions;When Can also so there are other kinds of classifier functions;Title feature vector is substituted into news category classifier functions, available y Value, i.e. scoring score value indicate that the corresponding webpage of title feature vector is news category, otherwise are not when the scoring score value is greater than 0 News category;Assuming that a=[1, -2,3], substitutes into news category classifier functions for title feature vector x=[1,2,3] that dimension is 3, Available y=6, then y > 0, title feature vector x=[1,2,3] corresponding webpage is news web page.
Each type of corresponding scoring score value is compared with preset mark threshold value by step S620 respectively.
Step S630 will be greater than the corresponding type of scoring score value of mark threshold value, be determined as belonging to theme feature vector Type;Wherein, type belonging to the theme feature vector is one or more.
Specifically, can be ranked up according to the sequence of value from big to small to multiple scoring score values;Judge maximum scoring Whether score value is greater than preset mark threshold value, if so, be the corresponding type of the maximum scoring score value by webpage label, if It is no, then it is webpage to be marked by Web Page Tags;Then, judge that size is only second to whether maximum scoring score value is greater than preset mark Threshold value is infused, if so, being that the size is only second to the corresponding type of maximum scoring score value by webpage label, if it is not, then by webpage Labeled as webpage to be marked;And so on, until each scoring score value carried out comparison with mark threshold value.
The present invention also provides a kind of annotation equipments of Web page subject, as shown in fig. 7, for according to one embodiment of the invention The structure chart of the annotation equipment of Web page subject.
The device includes:
Module 710 is obtained, web-based title and text is used for, obtains the theme feature vector of webpage.
Categorization module 720, for carrying out classification processing to theme feature vector using the classifier that training obtains in advance.
Judgment module 730, for judging whether there is type belonging to theme feature vector.
Labeling module 740, for determining in judgment module there are in the case where type belonging to theme feature vector, by net Page is labeled as type belonging to theme feature vector.
Mark module 750, for determining to incite somebody to action there is no in the case where type belonging to theme feature vector in judgment module Web Page Tags are webpage to be marked.
Cluster module 760, for carrying out clustering processing to multiple webpages to be marked.
Analysis module 770, for analyzing the type of each cluster set.
Labeling module 780 is also used to the type by webpage label to be marked for the cluster set belonging to it.
In one embodiment, obtain module 710 include: extraction unit 711, for extract respectively the title in webpage and Text;First construction unit 712, for constructing title feature vector according to title;Second construction unit 713, for according to just Text constructs text feature vector;Concatenation unit 714, for title feature vector sum text feature vector to be spliced the spy that is the theme Levy vector.As shown in Figure 8.
First construction unit 712 is used for: using the title dictionary constructed in advance, being carried out word segmentation processing to title, is marked Topic participle;Title participle is mapped in title dictionary;Based on the weighted value of title participle, place is weighted to title dictionary Reason, constructs the title feature vector of webpage.
Second construction unit 713 is used for: using the text dictionary constructed in advance, being carried out word segmentation processing to text, is obtained more A text participle, and record the appearance sequence of each text participle in the body of the email;Multiple texts are respectively mapped to text In dictionary;Weighted value and appearance sequence based on each text participle, are weighted processing to text dictionary, are constructing webpage just Literary feature vector.
In another embodiment, categorization module 720 is specifically used for: pre-defining a variety of type of webpage;Calling classification device, with Just make classifier for each type, once scored the theme feature vector of webpage;It corresponding is commented each type of Score value is divided to be compared respectively with preset mark threshold value;The corresponding type of scoring score value that will be greater than mark threshold value, is determined as Type belonging to theme feature vector;Wherein, type belonging to theme feature vector is one or more.
In another embodiment, analysis module 770 is specifically used for: extracting each webpage to be marked in cluster set respectively Title and text;Using the title dictionary constructed in advance, word segmentation processing is carried out to all titles, obtains multiple title participles;Benefit With the text dictionary constructed in advance, word segmentation processing is carried out to all texts, obtains multiple text participles;Multiple titles participle and In multiple text participles, the most participle of the frequency of occurrences is obtained, using the type as cluster set.
The function of device described in the present embodiment is described in Fig. 1-embodiment of the method shown in fig. 6, therefore Not detailed place, may refer to the related description in previous embodiment, this will not be repeated here in the description of the present embodiment.
Although for illustrative purposes, the preferred embodiment of the present invention has been disclosed, those skilled in the art will recognize It is various improve, increase and replace be also it is possible, therefore, the scope of the present invention should be not limited to the above embodiments.

Claims (7)

1. a kind of mask method of Web page subject characterized by comprising
The title and text in webpage are extracted respectively;
According to the title, title feature vector is constructed;
According to the text, text feature vector is constructed;
Text feature vector described in the title feature vector sum is spliced into theme feature vector by the way of weighting;
The classifier obtained using preparatory training carries out classification processing to the theme feature vector;
Judge whether there is type belonging to the theme feature vector;
If so, being type belonging to the theme feature vector by the webpage label;
If it is not, being then webpage to be marked by the Web Page Tags;Further, clustering processing is carried out to multiple webpages to be marked; Analyze the type of each cluster set;It is the type of the cluster set belonging to it by webpage label to be marked;
The type of analysis cluster set, comprising:
The title and text of each webpage to be marked in cluster set are extracted respectively;
Using the title dictionary constructed in advance, word segmentation processing is carried out to all titles, obtains multiple title participles;
Using the text dictionary constructed in advance, word segmentation processing is carried out to all texts, obtains multiple text participles;
In multiple titles participles and multiple texts participles, the most participle of the frequency of occurrences is obtained, using as described Cluster the type of set.
2. the method as described in claim 1, which is characterized in that construct web page title feature vector according to the title, comprising:
Using the title dictionary constructed in advance, word segmentation processing is carried out to the title, obtains title participle;
Title participle is mapped in the title dictionary;
Based on the weighted value of title participle, processing is weighted to the title dictionary, constructs the title of the webpage Feature vector.
3. the method as described in claim 1, which is characterized in that construct Web page text feature vector according to the text, comprising:
Using the text dictionary constructed in advance, word segmentation processing is carried out to the text, obtains multiple text participles, and record each Appearance sequence of the text participle in the text;
Multiple texts are respectively mapped in the text dictionary;
Weighted value and appearance sequence based on each text participle, are weighted processing to the text dictionary, construct the net The text feature vector of page.
4. the method as described in claim 1, which is characterized in that special to the theme using the classifier that training obtains in advance It levies vector and carries out classification processing, comprising:
Pre-define a variety of type of webpage;
The classifier is directed to each type, is once scored the theme feature vector of the webpage;
Each type of corresponding scoring score value is compared with preset mark threshold value respectively;
The corresponding type of scoring score value that will be greater than the mark threshold value, is determined as type belonging to the theme feature vector; Wherein, type belonging to the theme feature vector is one or more.
5. a kind of annotation equipment of Web page subject characterized by comprising
Module is obtained, web-based title and text is used for, obtains the theme feature vector of the webpage;
The acquisition module includes:
Extraction unit, for extracting title and text in webpage respectively;
First construction unit, for constructing title feature vector according to the title;
Second construction unit, for constructing text feature vector according to the text;
Concatenation unit, it is described for text feature vector described in the title feature vector sum to be spliced by the way of weighting Theme feature vector;
Categorization module, for carrying out classification processing to the theme feature vector using the classifier that training obtains in advance;
Judgment module, for judging whether there is type belonging to the theme feature vector;
Labeling module, for determining to incite somebody to action there are in the case where type belonging to the theme feature vector in the judgment module The webpage label is type belonging to the theme feature vector;
Mark module, for determining in the judgment module there is no in the case where type belonging to the theme feature vector, It is webpage to be marked by the Web Page Tags;
Cluster module, for carrying out clustering processing to multiple webpages to be marked;
Analysis module, for analyzing the type of each cluster set;
The labeling module is also used to the type by webpage label to be marked for the cluster set belonging to it;
Analysis module is specifically used for:
The title and text of each webpage to be marked in cluster set are extracted respectively;
Using the title dictionary constructed in advance, word segmentation processing is carried out to all titles, obtains multiple title participles;
Using the text dictionary constructed in advance, word segmentation processing is carried out to all texts, obtains multiple text participles;
In multiple titles participles and multiple texts participles, the most participle of the frequency of occurrences is obtained, using as described Cluster the type of set.
6. device as claimed in claim 5, which is characterized in that
First construction unit is specifically used for:
Using the title dictionary constructed in advance, word segmentation processing is carried out to the title, obtains title participle;
Title participle is mapped in the title dictionary;
Based on the weighted value of title participle, processing is weighted to the title dictionary, constructs the title of the webpage Feature vector;
Second construction unit is specifically used for:
Using the text dictionary constructed in advance, word segmentation processing is carried out to the text, obtains multiple text participles, and record each Appearance sequence of the text participle in the text;
Multiple texts are respectively mapped in the text dictionary;
Weighted value and appearance sequence based on each text participle, are weighted processing to the text dictionary, construct the net The text feature vector of page.
7. device as claimed in claim 5, which is characterized in that
Categorization module is specifically used for:
Pre-define a variety of type of webpage;The classifier is called, to make the classifier for each type, to the net The theme feature vector of page is once scored;
Each type of corresponding scoring score value is compared with preset mark threshold value respectively;
The corresponding type of scoring score value that will be greater than the mark threshold value, is determined as type belonging to the theme feature vector; Wherein, type belonging to the theme feature vector is one or more.
CN201510266108.XA 2015-05-22 2015-05-22 A kind of mask method and device of Web page subject Active CN104881458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510266108.XA CN104881458B (en) 2015-05-22 2015-05-22 A kind of mask method and device of Web page subject

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510266108.XA CN104881458B (en) 2015-05-22 2015-05-22 A kind of mask method and device of Web page subject

Publications (2)

Publication Number Publication Date
CN104881458A CN104881458A (en) 2015-09-02
CN104881458B true CN104881458B (en) 2019-05-28

Family

ID=53948951

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510266108.XA Active CN104881458B (en) 2015-05-22 2015-05-22 A kind of mask method and device of Web page subject

Country Status (1)

Country Link
CN (1) CN104881458B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550292B (en) * 2015-12-11 2018-06-08 北京邮电大学 A kind of Web page classification method based on von Mises-Fisher probabilistic models
CN105760526B (en) * 2016-03-01 2019-05-07 网易(杭州)网络有限公司 A kind of method and apparatus of news category
CN105975573B (en) * 2016-05-04 2019-08-13 北京广利核系统工程有限公司 A kind of file classification method based on KNN
CN106021418B (en) * 2016-05-13 2019-09-06 北京奇虎科技有限公司 The clustering method and device of media event
CN106844328B (en) * 2016-08-23 2020-04-21 华南师范大学 Large-scale document theme semantic analysis method and system
CN107784037B (en) * 2016-08-31 2022-02-01 北京搜狗科技发展有限公司 Information processing method and device, and device for information processing
CN108090099B (en) * 2016-11-22 2022-02-25 科大讯飞股份有限公司 Text processing method and device
CN108241662B (en) * 2016-12-23 2021-12-28 北京国双科技有限公司 Data annotation optimization method and device
CN109471937A (en) * 2018-10-11 2019-03-15 平安科技(深圳)有限公司 A kind of file classification method and terminal device based on machine learning
CN109359301A (en) * 2018-10-19 2019-02-19 国家计算机网络与信息安全管理中心 A kind of the various dimensions mask method and device of web page contents
CN109299271B (en) * 2018-10-30 2022-04-05 腾讯科技(深圳)有限公司 Training sample generation method, text data method, public opinion event classification method and related equipment
CN110287314B (en) * 2019-05-20 2021-08-06 中国科学院计算技术研究所 Long text reliability assessment method and system based on unsupervised clustering

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN102831193A (en) * 2012-08-03 2012-12-19 人民搜索网络股份公司 Topic detecting device and topic detecting method based on distributed multistage cluster
CN103177024A (en) * 2011-12-23 2013-06-26 微梦创科网络科技(中国)有限公司 Method and device of topic information show
CN103235824A (en) * 2013-05-06 2013-08-07 上海河广信息科技有限公司 Method and system for determining web page texts users interested in according to browsed web pages

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9172762B2 (en) * 2011-01-20 2015-10-27 Linkedin Corporation Methods and systems for recommending a context based on content interaction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN103177024A (en) * 2011-12-23 2013-06-26 微梦创科网络科技(中国)有限公司 Method and device of topic information show
CN102831193A (en) * 2012-08-03 2012-12-19 人民搜索网络股份公司 Topic detecting device and topic detecting method based on distributed multistage cluster
CN103235824A (en) * 2013-05-06 2013-08-07 上海河广信息科技有限公司 Method and system for determining web page texts users interested in according to browsed web pages

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Web文本分类方法研究与系统实现";程博;《中国优秀硕士学位论文全文数据库 信息科技辑》;20110415(第4期);正文第39页第5.1.3节、第41-43页第5.2-5.3节

Also Published As

Publication number Publication date
CN104881458A (en) 2015-09-02

Similar Documents

Publication Publication Date Title
CN104881458B (en) A kind of mask method and device of Web page subject
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN111104526A (en) Financial label extraction method and system based on keyword semantics
CN108763213A (en) Theme feature text key word extracting method
JP6335898B2 (en) Information classification based on product recognition
CN110188197B (en) Active learning method and device for labeling platform
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN112347778A (en) Keyword extraction method and device, terminal equipment and storage medium
CN103049435A (en) Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device
CN108038099B (en) Low-frequency keyword identification method based on word clustering
CN110287314B (en) Long text reliability assessment method and system based on unsupervised clustering
CN110738033B (en) Report template generation method, device and storage medium
CN112559684A (en) Keyword extraction and information retrieval method
CN109255022B (en) Automatic abstract extraction method for network articles
CN110019820A (en) Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history
CN112989208A (en) Information recommendation method and device, electronic equipment and storage medium
CN111199151A (en) Data processing method and data processing device
CN113468339B (en) Label extraction method and system based on knowledge graph, electronic equipment and medium
CN112926340A (en) Semantic matching model for knowledge point positioning
CN112818693A (en) Automatic extraction method and system for electronic component model words
CN112949299A (en) Method and device for generating news manuscript, storage medium and electronic device
CN116561320A (en) Method, device, equipment and medium for classifying automobile comments
CN107122378A (en) Object processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant