CN103177036A - Method and system for label automatic extraction - Google Patents

Method and system for label automatic extraction Download PDF

Info

Publication number
CN103177036A
CN103177036A CN 201110440739 CN201110440739A CN103177036A CN 103177036 A CN103177036 A CN 103177036A CN 201110440739 CN201110440739 CN 201110440739 CN 201110440739 A CN201110440739 A CN 201110440739A CN 103177036 A CN103177036 A CN 103177036A
Authority
CN
China
Prior art keywords
classification
webpage
training
vocabulary
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201110440739
Other languages
Chinese (zh)
Inventor
陈运文
宋海涛
刘作涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shengle Information Technolpogy Shanghai Co Ltd
Original Assignee
Shengle Information Technolpogy Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shengle Information Technolpogy Shanghai Co Ltd filed Critical Shengle Information Technolpogy Shanghai Co Ltd
Priority to CN 201110440739 priority Critical patent/CN103177036A/en
Publication of CN103177036A publication Critical patent/CN103177036A/en
Pending legal-status Critical Current

Links

Abstract

The invention relates to the technical field of networks, in particular to a method and a system for label automatic extraction. The method comprises grabbing Chinese words and training web pages, respectively generating a Chinese dictionary and a training sample bank; generating a training classification model according to the training web pages in the Chinese dictionary and the training sample bank; carrying out label extraction on to-be-extracted web pages according to the Chinese dictionary and the training sample bank, and generating labels. The Chinese words and the training web pages are grabbed at regular intervals, so that the Chinese dictionary and the training sample bank are generated, the training web pages in the training sample bank are used for generating the training classification model, the training classification model and the Chinese dictionary are used for automatic label extraction of the to-be-extracted web pages, and label extraction is accurate in result and high in efficiency.

Description

A kind of label extraction method and system
Technical field
The present invention relates to networking technology area, particularly relate to a kind of label extraction method and system.
Background technology
Along with the fast development of internet, the internet has become most important information promulgating platform.For the magnanimity information that effectively utilizes the internet to exist, people use label (tag) to describe the content of issue.Label is a kind of description of accurately, summarizing of content that the user is delivered, and people can identify the theme of the document of browsing fast by text label.For example, the user can manually add label when delivering daily record, and described label is generally and the closely-related key word of document content.Other users can identify by label the theme of daily record fast when travel log.And for example, the user can obtain by the label that text adds the class document with same label when relevant search information, make Search Results more accurate.
In realizing process of the present invention, the inventor finds that in prior art, there are the following problems at least: on the one hand, it is initiatively that document adds label that the user often is unwilling, and manually adds tagged mode and rely on web editor, treatment effeciency is extremely low, and has wasted a large amount of manpowers.On the other hand, because label is the manual input of user, the label of user's input is of all kinds, a class document with same subject, content, label may be fully different, this brings difficulty just for concrete tag application, such as carrying out to the document with same subject, content accurately cluster etc.Therefore need a kind of text label automatic extracting system badly, can automatically generate text label.
Summary of the invention
For solving the problems of the technologies described above, the embodiment of the present invention provides a kind of label extraction method and system, generating labels automatically, and treatment effeciency is high.
On the one hand, the embodiment of the present invention provides a kind of label extraction method, and described method comprises:
Crawl Chinese vocabulary and training webpage generate respectively Chinese dictionary and training sample database;
Generate train classification models according to the training webpage in described Chinese dictionary and described training sample database;
According to described Chinese dictionary and train classification models, webpage to be extracted is carried out tag extraction, generating labels.
Preferably, described crawl Chinese vocabulary and training webpage generate respectively Chinese dictionary and training sample database and are:
Automatic capturing Chinese focus vocabulary, generate Chinese dictionary from network;
Grasp the training webpage corresponding with described classification according to predefined classification from the network address index that presets, generate training sample database.
Preferably, describedly grasp the training webpage corresponding with described classification according to predefined classification comprise from the network address index that presets:
Determine a plurality of class categories, for each class categories arranges the network address index as the source of training sample;
Extract training sample from described network address index.
Preferably, describedly generate train classification models according to the training webpage in described Chinese dictionary and described training sample database and be:
According to described Chinese dictionary, the word in described training webpage is carried out word segmentation processing, obtain feature vocabulary;
Obtain the classification of described feature vocabulary;
According to the classification results of described feature vocabulary, generate train classification models.
Preferably, describedly obtain being categorized as of described feature vocabulary:
Utilize the maximum entropy disaggregated model to obtain the classification of described feature vocabulary.
Preferably, describedly according to described Chinese dictionary and train classification models, webpage to be extracted is carried out tag extraction, generating labels comprises:
According to described Chinese dictionary, webpage to be extracted is carried out word segmentation processing, obtain feature vocabulary;
Obtain the weight of described feature vocabulary, the result that weight is the highest is as the first label;
The classification of obtaining described webpage to be extracted according to the feature vocabulary that obtains and described train classification models, with described classification results as the second label;
Obtain the attribute information of described webpage to be extracted, with described attribute information as the 3rd label.
Preferably, the classification of obtaining described webpage to be extracted of the feature vocabulary that obtains of described basis and described train classification models comprises:
Obtain classification under each feature vocabulary according to train classification models;
Classification under all feature vocabulary is added up, obtain the affiliated classification of webpage to be extracted;
With described classification results as the second label be:
With classification results greater than the classification of setting threshold as the second label.
On the other hand, the embodiment of the present invention also provides a kind of label automatic extracting system, and described system comprises:
Handling module is used for crawl Chinese vocabulary and training webpage, generates respectively Chinese dictionary and training sample database;
Training module is used for generating train classification models according to the training webpage of described Chinese dictionary and described training sample database;
The tag extraction module is used for according to described Chinese dictionary and described train classification models, webpage to be extracted being carried out tag extraction, generating labels.
Preferably, described handling module comprises the first handling module and the second handling module, wherein,
Described the first handling module is used for automatic capturing Chinese focus vocabulary, generates Chinese dictionary;
Described the second handling module is used for generating training sample database according to the network address index crawl training webpage corresponding with described classification of predefined classification from presetting.
Preferably, described training module comprises:
First participle unit is used for according to described Chinese dictionary, the word of described training webpage being carried out word segmentation processing, obtains feature vocabulary;
The First Characteristic extraction unit is used for the classification of obtaining described feature vocabulary;
The disaggregated model generation unit is used for the classification results according to described feature vocabulary, generates train classification models.
Preferably, described tag extraction module comprises:
The second participle unit according to described Chinese dictionary, carries out word segmentation processing to webpage to be extracted, obtains feature vocabulary;
The first extraction module, for the weight of obtaining described feature vocabulary, the result that weight is the highest is as the first label;
The second extraction module is used for the classification of obtaining described webpage to be extracted according to the feature vocabulary that obtains and described train classification models, with described classification results as the second label;
The 3rd extraction module is used for obtaining the attribute information of described webpage to be extracted, with described attribute information as the 3rd label.
Preferably, described system also comprises:
The first update module is used for crawl focus vocabulary, and described Chinese dictionary is upgraded;
The second update module is used for generating new training sample, merges with original training sample, and described training sample database is upgraded.
The beneficial effect that the embodiment of the present invention can reach is: the embodiment of the present invention is by regularly grasping Chinese vocabulary and training webpage, generate Chinese dictionary and training sample database, and utilize the training webpage in training sample database to generate training pattern, and utilize training pattern and Chinese dictionary automatically to carry out tag extraction to webpage to be extracted, the label result of extracting is accurate, and efficient is high.
On the other hand, the label that the embodiment of the present invention is extracted is described webpage from the content of webpage, affiliated classification, attribute equal angles respectively, and the label result of extraction comprehensively, has accurately been described webpage all sidedly, is convenient to use.On the one hand, the embodiment of the present invention is regularly upgraded Chinese focus vocabulary, training sample again, makes new vocabulary occur, when new classification occurs, all can join in train classification models, and self-adaptation is stronger, makes the tag extraction result more accurate.
Description of drawings
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, the below will do to introduce simply to the accompanying drawing of required use in embodiment or description of the Prior Art, apparently, the accompanying drawing that the following describes is only some embodiment that put down in writing in the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Label extraction method the first embodiment process flow diagram that Fig. 1 provides for the embodiment of the present invention;
Label extraction method the second embodiment process flow diagram that Fig. 2 provides for the embodiment of the present invention;
Label automatic extracting system the first embodiment schematic diagram that Fig. 3 provides for the embodiment of the present invention;
Label automatic extracting system the second embodiment schematic diagram that Fig. 4 provides for the embodiment of the present invention.
Embodiment
The embodiment of the present invention provides a kind of label extraction method and system, generating labels automatically, and treatment effeciency is high.
In order to make those skilled in the art person understand better technical scheme in the present invention, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Based on the embodiment in the present invention, those of ordinary skills should belong to the scope of protection of the invention not making the every other embodiment that obtains under the creative work prerequisite.
Referring to Fig. 1, label extraction method the first embodiment process flow diagram that provides for the embodiment of the present invention.Described method comprises:
S101, crawl Chinese vocabulary and training webpage generate respectively Chinese dictionary and training sample database.
S102 generates train classification models according to the training webpage in described Chinese dictionary and described training sample database.
S103 carries out tag extraction, generating labels according to described Chinese dictionary and train classification models to webpage to be extracted.
Referring to Fig. 2, label extraction method the second embodiment process flow diagram that provides for the embodiment of the present invention.
S201 grasps Chinese focus vocabulary, generates Chinese dictionary.
The Chinese focus vocabulary of crawl from network.For example, can regularly grasp focus vocabulary from Sina's microblogging roll of the hour, Baidu's roll of the hour vocabulary, and the Chinese focus vocabulary that will the grasp every day generation Chinese dictionary that gathers together.Store the own word of significant Chinese in described Chinese dictionary, to guarantee that neologisms can be split out.Here, the own word of Chinese not only comprises Chinese word commonly used, also comprises focus vocabulary, such as name, place name, cyberspeak etc.When Chinese neologisms occurring, regularly grasp Chinese vocabulary by network, can when carrying out word segmentation processing, the most emerging participle be cut out.
S202 according to predefined classification crawl training webpage corresponding with described classification from default network address index, generates training sample database.
Particularly, step S202 specifically can comprise:
S202A determines a plurality of class categories, for each class categories arranges the network address index as the source of training sample.
Usually, can preset the classification under text, theme that for example can the chosen in advance some is as the classification of classification, classification such as physical culture, amusement, fashion, photography, pet, data, history.Can preset a network address index for each classification, described network address index is used for the source as training sample.
S202B extracts training sample from described network address index.
Particularly, can obtain a plurality of webpages from the network address index, because webpage is generally by HTML (Hyper Text Mark-up Language, HTML (Hypertext Markup Language)) a class document of definition can extract training sample by the anchor text (anchor text link) of html text of extraction webpage.Concrete, after obtaining anchor text, filter according to length and the label substance of anchor text, remove the inferior quality content, form training sample, and training sample is stored in training sample database.
Anchor text claims again hypertext link, is the passage of being emphasized by line in webpage, is used to refer to other webpages of link, clicks anchor text, can obtain the webpage of sensing.Anchor text has set up the relation that text key word is connected with URL, and the form of anchor text is generally:<a href=" URL link " 〉.Anchor text can be used as the assessment of the content of its place page.We can obtain by anchor text the contents attribute of webpage.
Concrete, after opening the html text of webpage, extract anchor text, for example:
<a href=" http://my.ku6.com/watch? v=3wbxh6WJuaXdOORZ " generation of inclining imperial concubine 31-32 TV ...</a 〉
At<a href〉and</a between text be anchor text.
S203 generates train classification models according to the training webpage in described Chinese dictionary and described training sample database.
Concrete, step S203 can comprise:
S203A carries out word segmentation processing according to Chinese dictionary to the word in described training webpage, obtains feature vocabulary.
Here, word segmentation processing to as if training sample database in the training webpage.Chinese word segmentation is processed and is referred to significant Chinese character is organized together, and generates a significant word.The word here not only comprises Chinese vocabulary commonly used, also comprises name, place name, cyberspeak etc.For example, the word that the Chinese dictionary in the embodiment of the present invention may comprise has similar model ice ice, the such name of Yao Ming, also can comprise microblogging, green hand, flasher, push away he etc. cyberspeak.Owing to having included up-to-date focus vocabulary in Chinese dictionary, and regularly upgrade, therefore the most emerging Chinese neologisms can be able to be cut out.
S203B, the classification of obtaining feature vocabulary.
In embodiments of the present invention, the use maximum entropy model carries out classification based training.We will train webpage to carry out word segmentation processing in step S203A, obtain a plurality of feature vocabulary, and obtain the feature that each vocabulary belongs to corresponding classification.
When being applied to the tag extraction problem, we use the word frequency (being the number of times that word occurs in certain document) of each word as eigenwert.Namely for the training sample b at word w and its place, its feature
f w , a ′ ( a , b ) = mum ( b , w ) a = a ′ 0 otherwise - - - ( 1 )
Wherein, num (b, w) represents the number of times that word w occurs in document b.
In the maximum entropy disaggregated model, the feature that each vocabulary that obtains is in advance belonged to each classification extracts and preserves, and uses when classification.
S203C according to the classification results of described feature vocabulary, generates train classification models.
In the training sample set, in advance mark got well the set A of m classification=a1, a2, a3 ... am} is classification collection under document, and corresponding training sample is B={b1, b2, and b3 ... bn}
Theoretical according to maximum entropy: in the situation that meet the known constraints condition, the distribution that need to make unknown event is (thereby make the information entropy of problem maximum) as far as possible evenly.
Therefore, in the set that all training samples consist of, can use following formula to obtain any one piece of document b i∈ B belongs to any classification a jThe probability P of ∈ A *(a|b):
P * ( a | b ) = 1 π ( b ) exp { Σ i = 1 k λ i f i ( a , b ) } - - - ( 2 )
Wherein, P *(a|b) non-vanishing, k is the number (both numbers of word segmentation result in document) of fundamental function.π (b) is normalized factor:
π ( b ) = Σ a exp ( Σ i = 1 k λ i f i ( a , b ) ) - - - ( 3 )
Wherein, λ iBeing the parameter of setting, is also in the maximum entropy text model, calculative most important parameters of training stage; f i(a, b) is fundamental function.
Due in training sample, the classification p under each document *Known, so the numerical value on the left of this equation is known, by the training of great amount of samples, can be in the hope of equation right side Parameters in Formula λ iValue.Known λ iValue just obtained probability distribution function, completed the structure of maximum entropy model.This classification based training model in subsequent step, will be classified for the webpage to unknown classification.
S204 according to Chinese dictionary, carries out word segmentation processing to webpage to be extracted, obtains feature vocabulary.
Here, Chinese word segmentation process to as if webpage to be extracted.In embodiments of the present invention, webpage to be extracted refers to carry out the webpage of tag extraction.The process of word segmentation processing is identical with the process that the training webpage is extracted, and soon in text, significant Chinese character organizes together, and generates a significant word.The word here not only comprises Chinese vocabulary commonly used, also comprises name, place name, cyberspeak etc.For example, the word that the Chinese dictionary in the embodiment of the present invention may comprise has similar model ice ice, the such name of Yao Ming, also can comprise microblogging, green hand, flasher, push away he etc. cyberspeak.Owing to having included up-to-date focus vocabulary in Chinese dictionary, and regularly upgrade, therefore the most emerging Chinese neologisms can be able to be cut out.
S205 obtains the weight of feature vocabulary, and the result that weight is the highest is as the first label.
For each feature vocabulary, its weight is obtained like this: obtain the number of times Term_frequency that each feature vocabulary occurs in text, calculate the significance level of each feature vocabulary, the product of the number of times that the Feature Words remittance abroad is existing and feature vocabulary significance level is as the weight of described feature vocabulary.Concrete, can show with following formula table:
Term_weight=Term_frequency*Term_important (4)
Wherein, Term_weight is used for the weight of representation feature vocabulary, the number of times that Term_frequency representation feature vocabulary occurs in text, and the significance level of Term_important representation feature vocabulary can arrange an acquiescence value for it, and for example 10.
Optionally, described method further comprises:
Judge the attribute of described feature vocabulary, according to described attribute, the significance level of feature vocabulary is adjusted, with the significance level of the result after adjusting as feature vocabulary.Concrete:
(1) when feature vocabulary is named entity, the significance level of feature vocabulary adds 1 on the acquiescence value.Here, named entity is to excavate the specific vocabulary that generates from the Chinese vocabulary of orientation crawl, can comprise name, place name, brand name, film and television song title of scene etc., when for example carrying out word segmentation processing, after " Fan Bingbing " in word segmentation result coupling Chinese dictionary, it can be identified as a named entity, its significance level adds 1.
(2) when feature vocabulary is focus vocabulary, the significance level of feature vocabulary adds 2 on the acquiescence value.
(3) when the part of speech of feature vocabulary is function word, such as conjunction, adverbial word, interjection, auxiliary word, modal particle etc., the significance level of feature vocabulary multiply by 0.1 on the default value basis.
At last, obtain the weight of feature vocabulary according to the product of Feature Words remittance occurrence number and significance level, with the descending arrangement of the weight of all feature vocabulary, heavy first three the highest result of weighting is as the first label that extracts in text.
S206, the classification of obtaining described webpage to be extracted according to the feature vocabulary that obtains and described train classification models, with described classification results as the second label.
Utilize the feature vocabulary of the webpage to be extracted that step S204 obtains, obtain the probability that each feature vocabulary belongs to each classification, calculate maximum entropy, with the classification of each the Word probability maximum classification as feature vocabulary.Cumulative probability under all feature vocabulary in webpage to be identified is got up, get the classification of maximum probability as the classification under document.
Concrete, identical with training process, A={a1, a2, a3 ... am} is classification collection under document, to the webpage c of current unknown classification, so, can use following formula to obtain any one piece of document c and belong to any classification a jProbability P (a of ∈ A j| c):
P * ( a j | c ) = 1 π ( c ) exp { Σ i = 1 k λ i f i ( a j , c ) } - - - ( 5 )
Wherein, P *(a j| c) non-vanishing, k is the number of fundamental function.Wherein, π (c) is normalized factor:
π ( c ) = Σ a exp ( Σ i = 1 k λ i f i ( a j , c ) ) - - - ( 6 )
λ iThe parameter that has calculated in training process, f i(a j, c) be fundamental function.Concrete, select one " word-classification " to as a feature, use word frequency as eigenwert.The all categories probable value that this formula calculates will judge for following probable value: with document c iClass declaration be, document c iBelong to P *(a j| c)〉classification of ε, ε is predefined threshold value.For any one piece of document c i, possible it to belong to the probability of any classification all very low, so we set in advance certain threshold value, only are only document c greater than setting threshold ε iThe classification that belongs to.
We can be with the classification under document as the second label, and the second label is used for the type of expression document.
S207 obtains the attribute information of described webpage to be extracted, with described attribute information as the 3rd label.
Optionally, can further obtain the attribute information of webpage to be extracted, with attribute information as the 3rd label.
Step S207 is specifically as follows: according to the feature of importing into of document, getattr information.For example, the user is delivering the type of automatically selecting when meagre, as the type of the content type that selects.
Step S207 specifically can also for: obtain the feature tag of document, according to the attribute information of default Rule webpage.When for example delivering picture, the html label is<img src=" xxx "〉xxx</img 〉.Can be according to the attribute information of above feature according to default Rule webpage.Concrete, rule can be set as shown in the table:
Table 1
Feature tag Attribute
3 Picture
4 Picture group
5 Video
6 Music
8 Question and answer
12 Document
S208 shows the first label, the second label, the 3rd label as the list of labels of webpage to be identified.
For example, for one piece of webpage that comprises model ice ice, the last label that automatically extracts is: Fan Bingbing, amusement, picture, its label is made of jointly key word, classification, attribute, described webpage from the content topic (keyword) of webpage, the classification (belonging to physical culture, amusement or other classifications) of ownership, the attribute (picture, video, document etc.) of webpage respectively, the tag extraction result is accurate and comprehensive.
In embodiments of the present invention, regularly Chinese vocabulary, training sample are upgraded, can be split out when the word segmentation processing to guarantee up-to-date Chinese vocabulary, and constantly update training sample, and it is added in train classification models, make classification results more accurate.
Referring to Fig. 3, be embodiment of the present invention label automatic extracting system the first embodiment schematic diagram.Described system comprises:
Handling module 100 is used for crawl Chinese vocabulary and training webpage, generates respectively Chinese dictionary and training sample database;
Chinese dictionary 200 is used for the storage Chinese vocabulary;
Training sample database 300 is used for the storage training sample;
Training module 400 is used for generating train classification models according to the training webpage of described Chinese dictionary and described training sample database;
Tag extraction module 500 is used for according to described Chinese dictionary and described train classification models, webpage to be extracted being carried out tag extraction, generating labels.
Referring to Fig. 4, be embodiment of the present invention label automatic extracting system the second embodiment schematic diagram.
Concrete, handling module 100 comprises the first handling module 110 and the second handling module 120, wherein,
Described the first handling module 110 is used for automatic capturing Chinese focus vocabulary.
Described the second handling module 120 is used for according to network address index crawl with the described classification corresponding training webpage of predefined classification from presetting.
Concrete, described training module 400 comprises:
First participle unit 410 is used for according to described Chinese dictionary, the word of described training webpage being carried out word segmentation processing, obtains feature vocabulary;
First Characteristic extraction unit 420 is used for the classification of obtaining described feature vocabulary;
Disaggregated model generation unit 430 is used for the classification results according to described feature vocabulary, generates train classification models.
Concrete, described tag extraction module 500 comprises:
The second participle unit 510 according to described Chinese dictionary, carries out word segmentation processing to webpage to be extracted, obtains feature vocabulary;
The first extraction module 520, for the weight of obtaining described feature vocabulary, the result that weight is the highest is as the first label;
The second extraction module 530 is used for the classification of obtaining described webpage to be extracted according to the feature vocabulary that obtains and described train classification models, with described classification results as the second label;
The 3rd extraction module 540 is used for obtaining the attribute information of described webpage to be extracted, with described attribute information as the 3rd label.
Concrete, described system also comprises:
Update module 600 comprises the first update module 610, the second update module 630, wherein,
The first update module 610 is used for crawl focus vocabulary, and described Chinese dictionary is upgraded;
The second update module 620 is used for generating new training sample, merges with original training sample, and described training sample database is upgraded.
Concrete, need to prove, in this article, relational terms such as the first and second grades only is used for an entity or operation are separated with another entity or operational zone, and not necessarily requires or hint and have the relation of any this reality or sequentially between these entities or operation.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby make the process, method, article or the equipment that comprise a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or also be included as the intrinsic key element of this process, method, article or equipment.In the situation that not more restrictions, the key element that is limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment that comprises described key element and also have other identical element.
The present invention can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract data type, program, object, assembly, data structure etc.Also can put into practice the present invention in distributed computing environment, in these distributed computing environment, be executed the task by the teleprocessing equipment that is connected by communication network.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.
The above is only the specific embodiment of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (12)

1. a label extraction method, is characterized in that, described method comprises:
Crawl Chinese vocabulary and training webpage generate respectively Chinese dictionary and training sample database;
Generate train classification models according to the training webpage in described Chinese dictionary and described training sample database;
According to described Chinese dictionary and train classification models, webpage to be extracted is carried out tag extraction, generating labels.
2. method according to claim 1, is characterized in that, described crawl Chinese vocabulary and training webpage generate respectively Chinese dictionary and training sample database and be:
Automatic capturing Chinese focus vocabulary, generate Chinese dictionary from network;
Grasp the training webpage corresponding with described classification according to predefined classification from the network address index that presets, generate training sample database.
3. method according to claim 2, is characterized in that, describedly grasps the training webpage corresponding with described classification according to predefined classification comprise from the network address index that presets:
Determine a plurality of class categories, for each class categories arranges the network address index as the source of training sample;
Extract training sample from described network address index.
4. method according to claim 1, is characterized in that, describedly generates train classification models according to the training webpage in described Chinese dictionary and described training sample database and be:
According to described Chinese dictionary, the word in described training webpage is carried out word segmentation processing, obtain feature vocabulary;
Obtain the classification of described feature vocabulary;
According to the classification results of described feature vocabulary, generate train classification models.
5. method according to claim 4, is characterized in that, describedly obtains being categorized as of described feature vocabulary:
Utilize the maximum entropy disaggregated model to obtain the classification of described feature vocabulary.
6. method according to claim 1, is characterized in that, describedly according to described Chinese dictionary and train classification models, webpage to be extracted carried out tag extraction, and generating labels comprises:
According to described Chinese dictionary, webpage to be extracted is carried out word segmentation processing, obtain feature vocabulary;
Obtain the weight of described feature vocabulary, the result that weight is the highest is as the first label;
The classification of obtaining described webpage to be extracted according to the feature vocabulary that obtains and described train classification models, with described classification results as the second label;
Obtain the attribute information of described webpage to be extracted, with described attribute information as the 3rd label.
7. method according to claim 6, is characterized in that, the classification that the feature vocabulary that described basis is obtained and described train classification models are obtained described webpage to be extracted comprises:
Obtain classification under each feature vocabulary according to train classification models;
Classification under all feature vocabulary is added up, obtain the affiliated classification of webpage to be extracted;
With described classification results as the second label be:
With classification results greater than the classification of setting threshold as the second label.
8. a label automatic extracting system, is characterized in that, described system comprises:
Handling module is used for crawl Chinese vocabulary and training webpage, generates respectively Chinese dictionary and training sample database;
Training module is used for generating train classification models according to the training webpage of described Chinese dictionary and described training sample database;
The tag extraction module is used for according to described Chinese dictionary and described train classification models, webpage to be extracted being carried out tag extraction, generating labels.
9. system according to claim 8, is characterized in that, described handling module comprises the first handling module and the second handling module, wherein,
Described the first handling module is used for automatic capturing Chinese focus vocabulary, generates Chinese dictionary;
Described the second handling module is used for generating training sample database according to the network address index crawl training webpage corresponding with described classification of predefined classification from presetting.
10. system according to claim 8, is characterized in that, described training module comprises:
First participle unit is used for according to described Chinese dictionary, the word of described training webpage being carried out word segmentation processing, obtains feature vocabulary;
The First Characteristic extraction unit is used for the classification of obtaining described feature vocabulary;
The disaggregated model generation unit is used for the classification results according to described feature vocabulary, generates train classification models.
11. system according to claim 8 is characterized in that, described tag extraction module comprises:
The second participle unit according to described Chinese dictionary, carries out word segmentation processing to webpage to be extracted, obtains feature vocabulary;
The first extraction module, for the weight of obtaining described feature vocabulary, the result that weight is the highest is as the first label;
The second extraction module is used for the classification of obtaining described webpage to be extracted according to the feature vocabulary that obtains and described train classification models, with described classification results as the second label;
The 3rd extraction module is used for obtaining the attribute information of described webpage to be extracted, with described attribute information as the 3rd label.
12. system according to claim 8 is characterized in that, described system also comprises:
The first update module is used for crawl focus vocabulary, and described Chinese dictionary is upgraded;
The second update module is used for generating new training sample, merges with original training sample, and described training sample database is upgraded.
CN 201110440739 2011-12-23 2011-12-23 Method and system for label automatic extraction Pending CN103177036A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110440739 CN103177036A (en) 2011-12-23 2011-12-23 Method and system for label automatic extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110440739 CN103177036A (en) 2011-12-23 2011-12-23 Method and system for label automatic extraction

Publications (1)

Publication Number Publication Date
CN103177036A true CN103177036A (en) 2013-06-26

Family

ID=48636917

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110440739 Pending CN103177036A (en) 2011-12-23 2011-12-23 Method and system for label automatic extraction

Country Status (1)

Country Link
CN (1) CN103177036A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593690A (en) * 2013-11-25 2014-02-19 北京光年无限科技有限公司 User intelligent tagging system
CN103886034A (en) * 2014-03-05 2014-06-25 北京百度网讯科技有限公司 Method and equipment for building indexes and matching inquiry input information of user
CN104199969A (en) * 2014-09-22 2014-12-10 北京国双科技有限公司 Webpage data analysis method and device
CN104933296A (en) * 2015-05-28 2015-09-23 汤海京 Big data processing method based on multi-dimensional data fusion and big data processing equipment based on multi-dimensional data fusion
CN105740231A (en) * 2016-01-28 2016-07-06 浪潮软件股份有限公司 Data content associating method and apparatus
CN106682048A (en) * 2015-11-11 2017-05-17 财团法人资讯工业策进会 Web content extraction system and method
CN106874507A (en) * 2017-02-28 2017-06-20 百度在线网络技术(北京)有限公司 Method, device and server for pushed information
CN107430504A (en) * 2015-04-08 2017-12-01 利斯托株式会社 Data-translating system and method
CN107977375A (en) * 2016-10-25 2018-05-01 央视国际网络无锡有限公司 A kind of video tab generation method and device
CN108874996A (en) * 2018-06-13 2018-11-23 北京知道创宇信息技术有限公司 website classification method and device
CN109299271A (en) * 2018-10-30 2019-02-01 腾讯科技(深圳)有限公司 Training sample generation, text data, public sentiment event category method and relevant device
CN111832275A (en) * 2020-09-21 2020-10-27 北京百度网讯科技有限公司 Text creation method, device, equipment and storage medium

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593690B (en) * 2013-11-25 2017-08-08 北京光年无限科技有限公司 User's intelligent tagging systems
CN103593690A (en) * 2013-11-25 2014-02-19 北京光年无限科技有限公司 User intelligent tagging system
CN103886034A (en) * 2014-03-05 2014-06-25 北京百度网讯科技有限公司 Method and equipment for building indexes and matching inquiry input information of user
CN103886034B (en) * 2014-03-05 2019-03-19 北京百度网讯科技有限公司 A kind of method and apparatus of inquiry input information that establishing index and matching user
CN104199969A (en) * 2014-09-22 2014-12-10 北京国双科技有限公司 Webpage data analysis method and device
US10621245B2 (en) 2014-09-22 2020-04-14 Beijing Gridsum Technology Co., Ltd. Webpage data analysis method and device
WO2016045567A1 (en) * 2014-09-22 2016-03-31 北京国双科技有限公司 Webpage data analysis method and device
CN104199969B (en) * 2014-09-22 2017-10-03 北京国双科技有限公司 Web data analysis method and device
CN107430504A (en) * 2015-04-08 2017-12-01 利斯托株式会社 Data-translating system and method
CN104933296A (en) * 2015-05-28 2015-09-23 汤海京 Big data processing method based on multi-dimensional data fusion and big data processing equipment based on multi-dimensional data fusion
CN106682048A (en) * 2015-11-11 2017-05-17 财团法人资讯工业策进会 Web content extraction system and method
CN105740231A (en) * 2016-01-28 2016-07-06 浪潮软件股份有限公司 Data content associating method and apparatus
CN107977375A (en) * 2016-10-25 2018-05-01 央视国际网络无锡有限公司 A kind of video tab generation method and device
CN106874507A (en) * 2017-02-28 2017-06-20 百度在线网络技术(北京)有限公司 Method, device and server for pushed information
CN106874507B (en) * 2017-02-28 2020-12-25 百度在线网络技术(北京)有限公司 Method and device for pushing information and server
CN108874996A (en) * 2018-06-13 2018-11-23 北京知道创宇信息技术有限公司 website classification method and device
CN109299271A (en) * 2018-10-30 2019-02-01 腾讯科技(深圳)有限公司 Training sample generation, text data, public sentiment event category method and relevant device
CN111832275A (en) * 2020-09-21 2020-10-27 北京百度网讯科技有限公司 Text creation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN103177036A (en) Method and system for label automatic extraction
CN101216825B (en) Indexing key words extraction/ prediction method
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
Venugopalan et al. Exploring sentiment analysis on twitter data
CN104408093A (en) News event element extracting method and device
CN101231641B (en) Method and system for automatic analysis of hotspot subject propagation process in the internet
CN105630768B (en) A kind of product name recognition method and device based on stacking condition random field
CN101751458A (en) Network public sentiment monitoring system and method
CN103874994A (en) Method and apparatus for automatically summarizing the contents of electronic documents
CN105844424A (en) Product quality problem discovery and risk assessment method based on network comments
CN103914478A (en) Webpage training method and system and webpage prediction method and system
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN103049440A (en) Recommendation processing method and processing system for related articles
CN102880723A (en) Searching method and system for identifying user retrieval intention
CN104484431A (en) Multi-source individualized news webpage recommending method based on field body
CN102609427A (en) Public opinion vertical search analysis system and method
CN102722498A (en) Search engine and implementation method thereof
CN107885793A (en) A kind of hot microblog topic analyzing and predicting method and system
CN102169496A (en) Anchor text analysis-based automatic domain term generating method
CN102200975A (en) Vertical search engine system and method using semantic analysis
CN103377249A (en) Keyword putting method and system
CN102737021A (en) Search engine and realization method thereof
CN103886020A (en) Quick search method of real estate information
CN103365961A (en) Accurate search-oriented website structurization labeling method and system
CN111160019A (en) Public opinion monitoring method, device and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C05 Deemed withdrawal (patent law before 1993)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130626