CN104881458B - A kind of mask method and device of Web page subject - Google Patents
A kind of mask method and device of Web page subject Download PDFInfo
- Publication number
- CN104881458B CN104881458B CN201510266108.XA CN201510266108A CN104881458B CN 104881458 B CN104881458 B CN 104881458B CN 201510266108 A CN201510266108 A CN 201510266108A CN 104881458 B CN104881458 B CN 104881458B
- Authority
- CN
- China
- Prior art keywords
- text
- title
- feature vector
- webpage
- type
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of mask method of Web page subject and devices.The described method includes: web-based title and text, obtain the theme feature vector of the webpage;The classifier obtained using preparatory training carries out classification processing to the theme feature vector;Judge whether there is type belonging to the theme feature vector;If so, being type belonging to the theme feature vector by the webpage label;If it is not, being then webpage to be marked by the Web Page Tags;Further, clustering processing is carried out to multiple webpages to be marked;Analyze the type of each cluster set;It is the type of the cluster set belonging to it by webpage label to be marked.The present invention is obtained theme from webpage using the classification method and the unsupervised cascade mode of clustering method for having supervision, automatically and marks webpage, and the efficiency and accuracy of Web page subject mark are effectively increased.
Description
Technical field
The present invention relates to technical field of data processing, more particularly to the mask method and device of a kind of Web page subject.
Background technique
It is that internet data management and excavation etc. are answered to extract and mark Web page subject by analyzing internet web page contents
Important foundation.Currently, Web page subject mark mostly uses key word matching method, closed by presetting web page title and part
Keyword carries out the mark that webpage is realized in matching.But it is this directly matched way it is too simple, moreover, if web page title
In keyword change, then this method will be unable to accurately mark theme, and the accuracy rate of web standards will be unable to guarantee.It is another
Kind Web page subject mark is to be clustered using the method for cluster to webpage, is made from gathering for extraction keyword in a kind of webpage
For the mark of this kind of webpages.But since clustering algorithm is more time-consuming, when webpage quantity to be marked is more, this kind of calculation
The practicability of method is poor, and the webpage label accuracy rate that unsupervised learning algorithm is used only is not high.
Summary of the invention
The present invention provides the mask method and device of a kind of Web page subject, to solve Web page subject mark in the prior art
The low problem of accuracy rate.
Based on above-mentioned technical problem, the present invention solves by the following technical programs.
The present invention provides a kind of mask methods of Web page subject, comprising: web-based title and text, described in acquisition
The theme feature vector of webpage;The classifier obtained using preparatory training carries out classification processing to the theme feature vector;Sentence
It is disconnected to whether there is type belonging to the theme feature vector;If so, being the theme feature vector by the webpage label
Affiliated type;If it is not, being then webpage to be marked by the Web Page Tags;Further, multiple webpages to be marked are gathered
Class processing;Analyze the type of each cluster set;It is the type of the cluster set belonging to it by webpage label to be marked.
Wherein, web-based title and text obtain the theme feature vector of the webpage, comprising: extract net respectively
Title and text in page;According to the title, title feature vector is constructed;According to the text, text feature vector is constructed;
Text feature vector described in the title feature vector sum is spliced into the theme feature vector.
Wherein, web page title feature vector is constructed according to the title, comprising: right using the title dictionary constructed in advance
The title carries out word segmentation processing, obtains title participle;Title participle is mapped in the title dictionary;Based on described
The weighted value of title participle, is weighted processing to the title dictionary, constructs the title feature vector of the webpage.
Wherein, Web page text feature vector is constructed according to the text, comprising: right using the text dictionary constructed in advance
The text carries out word segmentation processing, obtains multiple text participles, and records each text participle going out in the text
Now sequence;Multiple texts are respectively mapped in the text dictionary;Based on each text participle weighted value and
Appearance sequence, is weighted processing to the text dictionary, constructs the text feature vector of the webpage.
Wherein, the classifier obtained using preparatory training carries out classification processing to the theme feature vector, comprising: pre-
First define a variety of type of webpage;The classifier is directed to each type, is once commented the theme feature vector of the webpage
Point;Each type of corresponding scoring score value is compared with preset mark threshold value respectively;It will be greater than the mark threshold value
The corresponding type of scoring score value, be determined as type belonging to the theme feature vector;Wherein, the theme feature vector institute
The type of category is one or more.
Wherein, analysis cluster set type, comprising: respectively extract cluster set in each webpage to be marked title and
Text;Using the title dictionary constructed in advance, word segmentation processing is carried out to all titles, obtains multiple title participles;Using preparatory
The text dictionary of building carries out word segmentation processing to all texts, obtains multiple text participles;In multiple titles participles and more
In a text participle, the most participle of the frequency of occurrences is obtained, using the type as the cluster set.
The present invention also provides a kind of annotation equipments of Web page subject, comprising: obtains module, is used for web-based title
And text, obtain the theme feature vector of the webpage;Categorization module, for the classifier using training acquisition in advance, to institute
It states theme feature vector and carries out classification processing;Judgment module, for judging whether there is class belonging to the theme feature vector
Type;Labeling module, for determining in the judgment module there are in the case where type belonging to the theme feature vector, by institute
Stating webpage label is type belonging to the theme feature vector;Mark module, for being not present in judgment module judgement
It is webpage to be marked by the Web Page Tags in the case where type belonging to the theme feature vector;Cluster module, for pair
Multiple webpages to be marked carry out clustering processing;Analysis module, for analyzing the type of each cluster set;The mark mould
Block is also used to the type by webpage label to be marked for the cluster set belonging to it.
Wherein, the acquisition module includes: extraction unit, for extracting title and text in webpage respectively;First structure
Unit is built, for constructing title feature vector according to the title;Second construction unit, for according to the text, building to be just
Literary feature vector;Concatenation unit, it is special for text feature vector described in the title feature vector sum to be spliced into the theme
Levy vector.
Wherein, first construction unit is specifically used for: using the title dictionary constructed in advance, dividing the title
Word processing obtains title participle;Title participle is mapped in the title dictionary;Weighting based on title participle
Value, is weighted processing to the title dictionary, constructs the title feature vector of the webpage;The second construction unit tool
Body is used for: using the text dictionary constructed in advance, being carried out word segmentation processing to the text, is obtained multiple text participles, and record
Appearance sequence of each text participle in the text;Multiple texts are respectively mapped to the positive cliction
In allusion quotation;Weighted value and appearance sequence based on each text participle, are weighted processing to the text dictionary, construct the net
The text feature vector of page.
Wherein, categorization module is specifically used for: pre-defining a variety of type of webpage;The classifier is called, it is described to make
Classifier is directed to each type, is once scored the theme feature vector of the webpage;It corresponding is commented each type of
Score value is divided to be compared respectively with preset mark threshold value;The corresponding type of scoring score value that will be greater than the mark threshold value, sentences
It is set to type belonging to the theme feature vector;Wherein, type belonging to the theme feature vector is one or more;Point
Analysis module is specifically used for: extracting the title and text of each webpage to be marked in cluster set respectively;Utilize the mark constructed in advance
Allusion quotation is write inscription, word segmentation processing is carried out to all titles, obtains multiple title participles;Using the text dictionary constructed in advance, to all
Text carries out word segmentation processing, obtains multiple text participles;In multiple title participles and multiple text participles, obtain
The most participle of the frequency of occurrences, using the type as the cluster set.The present invention has the beneficial effect that:
The present invention is using the classification method and the unsupervised cascade mode of clustering method for having supervision, automatically from webpage
It obtains theme and marks webpage, effectively increase the efficiency and accuracy of Web page subject mark.
Detailed description of the invention
Fig. 1 is the flow chart of the mask method of Web page subject according to an embodiment of the invention;
Fig. 2 is the flow chart of the mask method of Web page subject according to another embodiment of the present invention;
Fig. 3 is the step flow chart of building web page title feature vector according to an embodiment of the invention;
Fig. 4 is the step flow chart of building Web page text feature vector according to an embodiment of the invention;
Fig. 5 is the splicing schematic diagram of title feature vector sum text feature vector according to an embodiment of the invention;
Fig. 6 is the step flow chart according to an embodiment of the invention classified to theme feature vector;
Fig. 7 is the structure chart of the annotation equipment of Web page subject according to an embodiment of the invention;
Fig. 8 is the structure chart according to an embodiment of the invention for obtaining module.
Specific embodiment
Below in conjunction with attached drawing and embodiment, the present invention will be described in further detail.It should be appreciated that described herein
Specific embodiment be only used to explain the present invention, limit the present invention.
A kind of mask method of Web page subject is present embodiments provided, as shown in Figure 1, for according to one embodiment of the invention
The flow chart of the mask method of Web page subject.The present embodiment is the step of execution for each webpage.
Step S110, web-based title and text obtain the theme feature vector of the webpage.
Since the length of web page title and text, diction are different, the present embodiment extract respectively title in webpage and
Text;According to title, title feature vector is constructed;According to text, text feature vector is constructed;By title feature vector sum text
Feature vector is spliced into the theme feature vector of webpage.Wherein, title feature vector sum text feature vector all includes for body
The word vectors of the theme of existing webpage.
Using different dictionaries, construction feature vector, can more accurately describe web page contents, and then improve so respectively
The accuracy of Web page subject mark.
Step S120, the classifier obtained using preparatory training carry out classification processing to the theme feature vector.
Classifier determines the type of theme feature vector for classifying to theme feature vector.Theme feature vector
Web page subject can be embodied, then it is determined that the type of theme feature vector that is to say the type of determining webpage.The type includes: new
Hear class, economy class, amusement class, science and technology etc..
In order to improve the accuracy of Web page classifying, for the present embodiment using the classification method for having supervision, classifier is using pre-
The classification annotation system and training data first prepared is obtained by training.
Classification annotation system refers to a variety of type of webpage predetermined.Such as: news category, economy class, amusement class, science and technology
Class.Training data includes: to be parsed out multiple webpages of type based on classification annotation system.Based on classification annotation system
And training data, classifier is trained using support vector machines.
Step S130 judges whether there is type belonging to the theme feature vector.If so, thening follow the steps S140;If
It is no, then follow the steps S150.
According to the classification processing of classifier as a result, judging whether there is type belonging to the theme feature vector.If deposited
The type belonging to theme feature vector, then the classification processing result is the theme type belonging to feature vector;If there is no
Type belonging to theme feature vector, then the classification processing result is null value.
The webpage label is type belonging to the theme feature vector by step S140.
The Web Page Tags are webpage to be marked by step S150.
The webpage that can determine type for classifier marks corresponding classification.Type can not be determined for classifier
Webpage, be put into collections of web pages to be marked, handled using subsequent method, to guarantee the accuracy of webpage label.
As shown in Fig. 2, for according to the flow chart of the mask method of the Web page subject of another embodiment of the present invention.The present embodiment
It is the processing carried out for webpage to be marked.
Step S210 carries out clustering processing to multiple webpages to be marked.
Each preset time period determines the webpage quantity for being marked as webpage to be marked, if the webpage quantity is greater than in advance
If amount threshold, then to webpage to be marked carry out clustering processing, if the webpage quantity be less than or equal to amount threshold, be spaced
Preset time period carries out webpage quantity again and determines.
The present embodiment uses unsupervised clustering method, therefore, when carrying out clustering processing, using pre-set similar
Algorithm is spent, for example, similarity calculation between any two is carried out to multiple webpages to be marked, by similarity using kmeans algorithm
It is divided into same cluster set greater than two webpages to be marked of preset similarity threshold.
Step S220 analyzes the type of each cluster set.
Canopy algorithm can be used, to analyze the type of each cluster set.
In one embodiment, following steps can be executed for each cluster set: extracted respectively every in cluster set
The title and text of a webpage to be marked;Using title dictionary, word segmentation processing is carried out to all titles, obtains multiple titles point
Word;Using text dictionary, word segmentation processing is carried out to all texts, obtains multiple text participles;Multiple titles participle and it is multiple
In text participle, the most participle of the frequency of occurrences is obtained, using the type as the cluster set.Wherein, the frequency of occurrences is most
Participle can be title participle, be also possible to text participle.
Webpage label to be marked is the type of the cluster set belonging to it by step S230.
In other words, what the type for clustering set is, then what type is exactly, the webpage to be marked in the cluster set
What mark is exactly.
In one embodiment, at regular intervals, using cluster result, classifier is trained again, to increase
The precision of bonus point class.Further, new type that can be obtained this by cluster after the completion of mark and this is new
The webpage of type is added in classification annotation system and training data.And then it can increase to new type and the new type
Webpage be trained.
The type that webpage is determined in such a way that classifier and clustering processing combine, can be improved the standard of webpage label
True property and standard performance.
For step S110,
Fig. 3 is the step flow chart according to the building web page title feature vector of one embodiment of the invention.
Step S310 constructs title dictionary in advance.
Step 1, the title of webpage is collected, title corpus is formed.
Step 2, the title text in title corpus is segmented, only retains qualified word in word segmentation result
Language.For example, the word segmentation result has practical significance.It can use preset segmentation methods, segmentation methods generally comprise a word
Title text is divided into one or more participle words by allusion quotation, the dictionary.
Step 3, IDF (Inverted Document Frequency) value of retained word is calculated, and by IDF value
Word greater than default first IDF threshold value forms title dictionary.The bigger word representativeness of IDF value is stronger, the smaller word of IDF value
Language representativeness is weaker.
The calculation of the IDF value of word w is shown below:
In formula (1.1), N indicates the quantity for the title that entire corpus is collected, ndIndicate the title number for word w occurred
Amount.Log indicates logarithm, and the truth of a matter takes 10 or e, determines with specific reference to demand.
Step S320 carries out word segmentation processing to title using title dictionary, obtains title participle.
Using the word in title dictionary, word segmentation processing is carried out to title, obtains one or more title participles.
Title participle is mapped in title dictionary by step S330.
Multiple titles are respectively mapped in title dictionary.It further, include multiple words in title dictionary;?
Mapping relations are established between word in title participle and title dictionary.Wherein, there are the title of mapping relations participle and words
It is identical.
After mapping relations foundation, the vector that a length is equal to title dictionary length, the dimension of vector can be obtained
Equal to the quantity of word in title dictionary, each dimension corresponds to a word in dictionary.
Step S340 is weighted processing to title dictionary, is constructed the title of webpage based on the weighted value of title participle
Feature vector.
Processing is weighted to title dictionary, that is to say that the vector for being equal to title dictionary length to above-mentioned length is weighted
Processing.For there are the words of mapping relations in title dictionary, i.e., segments with title there are the word of mapping relations, make in vector
It is weighted with TFIDF (term frequency-inverse document frequency) value, the vector obtained after weighting is
Title feature vector.Wherein, TFIDF is a kind of common weighting technique for information retrieval and information exploration.
In weighting, the value of each dimension of vector is TFIDF value of the corresponding word of the dimension in the title.Word
The calculation of the TFIDF value of language w is shown below:
In formula (1.2), with (1.1) formula, TF value indicates the frequency that word w occurs in current head, c for the calculating of IDF valuew
Indicate that the number that word w occurs in current head, c indicate the number of current head word (participle).
Fig. 4 is the step flow chart according to the building Web page text feature vector of one embodiment of the invention.
Step S410, the text dictionary constructed in advance.
Collection body matter is that text corpus is only retained by segmenting to the body text in text corpus
Qualified word in word segmentation result, such as: the word being of practical significance;Calculate the IDF value of retained word;By IDF value
Word greater than default 2nd IDF threshold value forms text dictionary.The building mode of text dictionary is identical as the building of title dictionary.
The calculating of IDF value refers to formula (1.1).
Step S420 carries out word segmentation processing to text using the text dictionary of building, obtains multiple text participles, and remember
Record the appearance sequence of each text participle in the body of the email.
Using the word in text dictionary, text is segmented;According to the sequence of text from front to back, each point of record
The appearance sequence of word (word), the participle of first appearance are denoted as 1, and the participle of second appearance is denoted as 2, and so on, it repeats
The participle of appearance does not record.
Multiple texts are respectively mapped in text dictionary by step S430.
The text of webpage tends to using starting brief text projecting motif, attracting eyeball, i.e., important word is inclined to
In appearing in front of text.
It include multiple words in text dictionary;Mapping relations are established between the word in text participle and text dictionary.
Wherein, there are the text of mapping relations participle is identical with word.
After mapping relations foundation, the vector that a length is equal to text dictionary length, the dimension of vector can be obtained
Equal to the quantity of word in text dictionary, each dimension corresponds to a word in dictionary.
Step S440, weighted value and appearance sequence based on each text participle, is weighted processing, structure to text dictionary
The text feature vector of networking page.
Processing is weighted to text dictionary, that is to say that the vector for being equal to text dictionary length to above-mentioned length is weighted
Processing.For there are the words of mapping relations in text dictionary, i.e., segments with text there are the word of mapping relations, make in vector
The appearance sequence segmented with TFIDF value and the text of mapping weights, and the vector obtained after weighting is text feature vector.Text
Each dimension of feature vector corresponds to a word in dictionary, and the corresponding word of the dimension exists according to the value of each dimension
The TFIDF value of appearance sequence and the word in the text, the weighted value weight of acquisitionzw:
In formula (1.3), weightzw(w) weighted value (dimension value) of word w in text feature vector, rank (w) are indicated
For the serial number that w occurs in the body of the email, ∑w∈WRank (w) is the summation of all word orders number, and TFIDF (w) can refer to formula
(1.2), description relevant to title is changed to the relevant description of text.Text feature can be obtained using the above method
Vector.The symbol of word uses consistent with the symbol of word in formula (1.2) in formula (1.3), all uses w, only for convenience of understanding formula
(1.3) calculating process of TFIDF (w) in.
In general, title designates the content of webpage, theme using brief sentence.Therefore, title is shorter, text compared with
Long, the present embodiment is usually less than the length of text feature vector, but title feature vector in view of the length of title feature vector
Importance be but greater than text feature vector, the present embodiment is proposed title feature vector sum text feature vector using weighting
Mode is spliced into the feature vector for expressing the Web page subject, i.e. theme feature vector.Such as attached connecting method shown in fig. 5.It is logical
It crosses the present embodiment and can avoid title feature vector, text feature vector plays a role unbalance deviation in study.
Before splicing, for dimension value TFIDF (w) value of the word w in title feature vector, title weight is used
wbtIt is weighted, it may be assumed that
weightbt(w)=wbt*TFIDF(w) (1.4)
Before splicing, weighted value is not used for the dimension value of the word in text feature vector.
In splicing, the unweighted text feature vector of title feature vector sum after weighting is spliced.This implementation
Example is spliced by the way of end to end, forms a length equal to the sum of title feature vector sum text feature vector
Vector, wherein the title feature vector after weighting is located at before unweighted text feature vector.
The present embodiment obtains w by the way of grid searchbt, wbtRange of choice refer to formula (1.5).In each wbtUnder,
Classifier carries out cross validation to training data, calculates classification accuracy rate, takes the corresponding w of highest accuracybtIt is used as final
WbtValue.
In formula (1.5), NbtIndicate the dimension of title feature vector, NzwIndicate text feature vector dimension.
For step S120 specifically,
Fig. 6 is the step flow chart classified to theme feature vector according to one embodiment of the invention.
Step S610, classifier are directed to each type, are once scored the theme feature vector of webpage.
Each type, the theme feature vector of webpage have a scoring score value.That is, then having more if there is multiple types
A scoring score value.Scoring score value is for measuring whether webpage meets the corresponding type of scoring score value.
Classifier includes multiple classifier functions, the corresponding type of each classifier functions;By theme feature vector point
Each classifier functions are not substituted into, so that it may obtain the scoring score value of each type.
For example, a=[a1, a2, a3] is classifier, y=a1*x1+a2*x2+a3*x3 is news category classifier functions;When
Can also so there are other kinds of classifier functions;Title feature vector is substituted into news category classifier functions, available y
Value, i.e. scoring score value indicate that the corresponding webpage of title feature vector is news category, otherwise are not when the scoring score value is greater than 0
News category;Assuming that a=[1, -2,3], substitutes into news category classifier functions for title feature vector x=[1,2,3] that dimension is 3,
Available y=6, then y > 0, title feature vector x=[1,2,3] corresponding webpage is news web page.
Each type of corresponding scoring score value is compared with preset mark threshold value by step S620 respectively.
Step S630 will be greater than the corresponding type of scoring score value of mark threshold value, be determined as belonging to theme feature vector
Type;Wherein, type belonging to the theme feature vector is one or more.
Specifically, can be ranked up according to the sequence of value from big to small to multiple scoring score values;Judge maximum scoring
Whether score value is greater than preset mark threshold value, if so, be the corresponding type of the maximum scoring score value by webpage label, if
It is no, then it is webpage to be marked by Web Page Tags;Then, judge that size is only second to whether maximum scoring score value is greater than preset mark
Threshold value is infused, if so, being that the size is only second to the corresponding type of maximum scoring score value by webpage label, if it is not, then by webpage
Labeled as webpage to be marked;And so on, until each scoring score value carried out comparison with mark threshold value.
The present invention also provides a kind of annotation equipments of Web page subject, as shown in fig. 7, for according to one embodiment of the invention
The structure chart of the annotation equipment of Web page subject.
The device includes:
Module 710 is obtained, web-based title and text is used for, obtains the theme feature vector of webpage.
Categorization module 720, for carrying out classification processing to theme feature vector using the classifier that training obtains in advance.
Judgment module 730, for judging whether there is type belonging to theme feature vector.
Labeling module 740, for determining in judgment module there are in the case where type belonging to theme feature vector, by net
Page is labeled as type belonging to theme feature vector.
Mark module 750, for determining to incite somebody to action there is no in the case where type belonging to theme feature vector in judgment module
Web Page Tags are webpage to be marked.
Cluster module 760, for carrying out clustering processing to multiple webpages to be marked.
Analysis module 770, for analyzing the type of each cluster set.
Labeling module 780 is also used to the type by webpage label to be marked for the cluster set belonging to it.
In one embodiment, obtain module 710 include: extraction unit 711, for extract respectively the title in webpage and
Text;First construction unit 712, for constructing title feature vector according to title;Second construction unit 713, for according to just
Text constructs text feature vector;Concatenation unit 714, for title feature vector sum text feature vector to be spliced the spy that is the theme
Levy vector.As shown in Figure 8.
First construction unit 712 is used for: using the title dictionary constructed in advance, being carried out word segmentation processing to title, is marked
Topic participle;Title participle is mapped in title dictionary;Based on the weighted value of title participle, place is weighted to title dictionary
Reason, constructs the title feature vector of webpage.
Second construction unit 713 is used for: using the text dictionary constructed in advance, being carried out word segmentation processing to text, is obtained more
A text participle, and record the appearance sequence of each text participle in the body of the email;Multiple texts are respectively mapped to text
In dictionary;Weighted value and appearance sequence based on each text participle, are weighted processing to text dictionary, are constructing webpage just
Literary feature vector.
In another embodiment, categorization module 720 is specifically used for: pre-defining a variety of type of webpage;Calling classification device, with
Just make classifier for each type, once scored the theme feature vector of webpage;It corresponding is commented each type of
Score value is divided to be compared respectively with preset mark threshold value;The corresponding type of scoring score value that will be greater than mark threshold value, is determined as
Type belonging to theme feature vector;Wherein, type belonging to theme feature vector is one or more.
In another embodiment, analysis module 770 is specifically used for: extracting each webpage to be marked in cluster set respectively
Title and text;Using the title dictionary constructed in advance, word segmentation processing is carried out to all titles, obtains multiple title participles;Benefit
With the text dictionary constructed in advance, word segmentation processing is carried out to all texts, obtains multiple text participles;Multiple titles participle and
In multiple text participles, the most participle of the frequency of occurrences is obtained, using the type as cluster set.
The function of device described in the present embodiment is described in Fig. 1-embodiment of the method shown in fig. 6, therefore
Not detailed place, may refer to the related description in previous embodiment, this will not be repeated here in the description of the present embodiment.
Although for illustrative purposes, the preferred embodiment of the present invention has been disclosed, those skilled in the art will recognize
It is various improve, increase and replace be also it is possible, therefore, the scope of the present invention should be not limited to the above embodiments.
Claims (7)
1. a kind of mask method of Web page subject characterized by comprising
The title and text in webpage are extracted respectively;
According to the title, title feature vector is constructed;
According to the text, text feature vector is constructed;
Text feature vector described in the title feature vector sum is spliced into theme feature vector by the way of weighting;
The classifier obtained using preparatory training carries out classification processing to the theme feature vector;
Judge whether there is type belonging to the theme feature vector;
If so, being type belonging to the theme feature vector by the webpage label;
If it is not, being then webpage to be marked by the Web Page Tags;Further, clustering processing is carried out to multiple webpages to be marked;
Analyze the type of each cluster set;It is the type of the cluster set belonging to it by webpage label to be marked;
The type of analysis cluster set, comprising:
The title and text of each webpage to be marked in cluster set are extracted respectively;
Using the title dictionary constructed in advance, word segmentation processing is carried out to all titles, obtains multiple title participles;
Using the text dictionary constructed in advance, word segmentation processing is carried out to all texts, obtains multiple text participles;
In multiple titles participles and multiple texts participles, the most participle of the frequency of occurrences is obtained, using as described
Cluster the type of set.
2. the method as described in claim 1, which is characterized in that construct web page title feature vector according to the title, comprising:
Using the title dictionary constructed in advance, word segmentation processing is carried out to the title, obtains title participle;
Title participle is mapped in the title dictionary;
Based on the weighted value of title participle, processing is weighted to the title dictionary, constructs the title of the webpage
Feature vector.
3. the method as described in claim 1, which is characterized in that construct Web page text feature vector according to the text, comprising:
Using the text dictionary constructed in advance, word segmentation processing is carried out to the text, obtains multiple text participles, and record each
Appearance sequence of the text participle in the text;
Multiple texts are respectively mapped in the text dictionary;
Weighted value and appearance sequence based on each text participle, are weighted processing to the text dictionary, construct the net
The text feature vector of page.
4. the method as described in claim 1, which is characterized in that special to the theme using the classifier that training obtains in advance
It levies vector and carries out classification processing, comprising:
Pre-define a variety of type of webpage;
The classifier is directed to each type, is once scored the theme feature vector of the webpage;
Each type of corresponding scoring score value is compared with preset mark threshold value respectively;
The corresponding type of scoring score value that will be greater than the mark threshold value, is determined as type belonging to the theme feature vector;
Wherein, type belonging to the theme feature vector is one or more.
5. a kind of annotation equipment of Web page subject characterized by comprising
Module is obtained, web-based title and text is used for, obtains the theme feature vector of the webpage;
The acquisition module includes:
Extraction unit, for extracting title and text in webpage respectively;
First construction unit, for constructing title feature vector according to the title;
Second construction unit, for constructing text feature vector according to the text;
Concatenation unit, it is described for text feature vector described in the title feature vector sum to be spliced by the way of weighting
Theme feature vector;
Categorization module, for carrying out classification processing to the theme feature vector using the classifier that training obtains in advance;
Judgment module, for judging whether there is type belonging to the theme feature vector;
Labeling module, for determining to incite somebody to action there are in the case where type belonging to the theme feature vector in the judgment module
The webpage label is type belonging to the theme feature vector;
Mark module, for determining in the judgment module there is no in the case where type belonging to the theme feature vector,
It is webpage to be marked by the Web Page Tags;
Cluster module, for carrying out clustering processing to multiple webpages to be marked;
Analysis module, for analyzing the type of each cluster set;
The labeling module is also used to the type by webpage label to be marked for the cluster set belonging to it;
Analysis module is specifically used for:
The title and text of each webpage to be marked in cluster set are extracted respectively;
Using the title dictionary constructed in advance, word segmentation processing is carried out to all titles, obtains multiple title participles;
Using the text dictionary constructed in advance, word segmentation processing is carried out to all texts, obtains multiple text participles;
In multiple titles participles and multiple texts participles, the most participle of the frequency of occurrences is obtained, using as described
Cluster the type of set.
6. device as claimed in claim 5, which is characterized in that
First construction unit is specifically used for:
Using the title dictionary constructed in advance, word segmentation processing is carried out to the title, obtains title participle;
Title participle is mapped in the title dictionary;
Based on the weighted value of title participle, processing is weighted to the title dictionary, constructs the title of the webpage
Feature vector;
Second construction unit is specifically used for:
Using the text dictionary constructed in advance, word segmentation processing is carried out to the text, obtains multiple text participles, and record each
Appearance sequence of the text participle in the text;
Multiple texts are respectively mapped in the text dictionary;
Weighted value and appearance sequence based on each text participle, are weighted processing to the text dictionary, construct the net
The text feature vector of page.
7. device as claimed in claim 5, which is characterized in that
Categorization module is specifically used for:
Pre-define a variety of type of webpage;The classifier is called, to make the classifier for each type, to the net
The theme feature vector of page is once scored;
Each type of corresponding scoring score value is compared with preset mark threshold value respectively;
The corresponding type of scoring score value that will be greater than the mark threshold value, is determined as type belonging to the theme feature vector;
Wherein, type belonging to the theme feature vector is one or more.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510266108.XA CN104881458B (en) | 2015-05-22 | 2015-05-22 | A kind of mask method and device of Web page subject |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510266108.XA CN104881458B (en) | 2015-05-22 | 2015-05-22 | A kind of mask method and device of Web page subject |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104881458A CN104881458A (en) | 2015-09-02 |
CN104881458B true CN104881458B (en) | 2019-05-28 |
Family
ID=53948951
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510266108.XA Active CN104881458B (en) | 2015-05-22 | 2015-05-22 | A kind of mask method and device of Web page subject |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104881458B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105550292B (en) * | 2015-12-11 | 2018-06-08 | 北京邮电大学 | A kind of Web page classification method based on von Mises-Fisher probabilistic models |
CN105760526B (en) * | 2016-03-01 | 2019-05-07 | 网易(杭州)网络有限公司 | A kind of method and apparatus of news category |
CN105975573B (en) * | 2016-05-04 | 2019-08-13 | 北京广利核系统工程有限公司 | A kind of file classification method based on KNN |
CN106021418B (en) * | 2016-05-13 | 2019-09-06 | 北京奇虎科技有限公司 | The clustering method and device of media event |
CN106844328B (en) * | 2016-08-23 | 2020-04-21 | 华南师范大学 | Large-scale document theme semantic analysis method and system |
CN107784037B (en) * | 2016-08-31 | 2022-02-01 | 北京搜狗科技发展有限公司 | Information processing method and device, and device for information processing |
CN108090099B (en) * | 2016-11-22 | 2022-02-25 | 科大讯飞股份有限公司 | Text processing method and device |
CN108241662B (en) * | 2016-12-23 | 2021-12-28 | 北京国双科技有限公司 | Data annotation optimization method and device |
CN109471937A (en) * | 2018-10-11 | 2019-03-15 | 平安科技(深圳)有限公司 | A kind of file classification method and terminal device based on machine learning |
CN109359301A (en) * | 2018-10-19 | 2019-02-19 | 国家计算机网络与信息安全管理中心 | A kind of the various dimensions mask method and device of web page contents |
CN109299271B (en) * | 2018-10-30 | 2022-04-05 | 腾讯科技(深圳)有限公司 | Training sample generation method, text data method, public opinion event classification method and related equipment |
CN110287314B (en) * | 2019-05-20 | 2021-08-06 | 中国科学院计算技术研究所 | Long text reliability assessment method and system based on unsupervised clustering |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727500A (en) * | 2010-01-15 | 2010-06-09 | 清华大学 | Text classification method of Chinese web page based on steam clustering |
CN102831193A (en) * | 2012-08-03 | 2012-12-19 | 人民搜索网络股份公司 | Topic detecting device and topic detecting method based on distributed multistage cluster |
CN103177024A (en) * | 2011-12-23 | 2013-06-26 | 微梦创科网络科技(中国)有限公司 | Method and device of topic information show |
CN103235824A (en) * | 2013-05-06 | 2013-08-07 | 上海河广信息科技有限公司 | Method and system for determining web page texts users interested in according to browsed web pages |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9172762B2 (en) * | 2011-01-20 | 2015-10-27 | Linkedin Corporation | Methods and systems for recommending a context based on content interaction |
-
2015
- 2015-05-22 CN CN201510266108.XA patent/CN104881458B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727500A (en) * | 2010-01-15 | 2010-06-09 | 清华大学 | Text classification method of Chinese web page based on steam clustering |
CN103177024A (en) * | 2011-12-23 | 2013-06-26 | 微梦创科网络科技(中国)有限公司 | Method and device of topic information show |
CN102831193A (en) * | 2012-08-03 | 2012-12-19 | 人民搜索网络股份公司 | Topic detecting device and topic detecting method based on distributed multistage cluster |
CN103235824A (en) * | 2013-05-06 | 2013-08-07 | 上海河广信息科技有限公司 | Method and system for determining web page texts users interested in according to browsed web pages |
Non-Patent Citations (1)
Title |
---|
Web文本分类方法研究与系统实现";程博;《中国优秀硕士学位论文全文数据库 信息科技辑》;20110415(第4期);正文第39页第5.1.3节、第41-43页第5.2-5.3节 |
Also Published As
Publication number | Publication date |
---|---|
CN104881458A (en) | 2015-09-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104881458B (en) | A kind of mask method and device of Web page subject | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN110413787B (en) | Text clustering method, device, terminal and storage medium | |
CN111104526A (en) | Financial label extraction method and system based on keyword semantics | |
CN108763213A (en) | Theme feature text key word extracting method | |
JP6335898B2 (en) | Information classification based on product recognition | |
CN110188197B (en) | Active learning method and device for labeling platform | |
WO2015149533A1 (en) | Method and device for word segmentation processing on basis of webpage content classification | |
CN110362678A (en) | A kind of method and apparatus automatically extracting Chinese text keyword | |
CN112347778A (en) | Keyword extraction method and device, terminal equipment and storage medium | |
CN103049435A (en) | Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device | |
CN108038099B (en) | Low-frequency keyword identification method based on word clustering | |
CN110287314B (en) | Long text reliability assessment method and system based on unsupervised clustering | |
CN110738033B (en) | Report template generation method, device and storage medium | |
CN112559684A (en) | Keyword extraction and information retrieval method | |
CN109255022B (en) | Automatic abstract extraction method for network articles | |
CN110019820A (en) | Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history | |
CN112989208A (en) | Information recommendation method and device, electronic equipment and storage medium | |
CN111199151A (en) | Data processing method and data processing device | |
CN113468339B (en) | Label extraction method and system based on knowledge graph, electronic equipment and medium | |
CN112926340A (en) | Semantic matching model for knowledge point positioning | |
CN112818693A (en) | Automatic extraction method and system for electronic component model words | |
CN112949299A (en) | Method and device for generating news manuscript, storage medium and electronic device | |
CN116561320A (en) | Method, device, equipment and medium for classifying automobile comments | |
CN107122378A (en) | Object processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |