CN103853823B

CN103853823B - Online encyclopedia oriented entity attribute extraction method and system

Info

Publication number: CN103853823B
Application number: CN201410065743.7A
Authority: CN
Inventors: 程学旗; 贾岩涛; 张泽慧; 王元卓; 冯凯; 熊锦华; 许洪波
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2014-02-26
Filing date: 2014-02-26
Publication date: 2017-01-18
Anticipated expiration: 2034-02-26
Also published as: CN103853823A

Abstract

The invention provides an online encyclopedia oriented entity attribute extraction method and system. The method comprises the steps: selecting a page from an online encyclopedia webpage text set T to be extracted, and extracting entity attribute expression rules of the page, so as to obtain a current rule set; extracting entity attribute of the online encyclopedia webpage text set T to be extracted by using the current rule set, extracting entity attribute expression rules of the T according to the entity attribute obtained through extracting, taking a rule set obtained through extracting as a current rule set, and repeating the process for k times, so as to obtain a final rule set; carrying out entity attribute extraction on the T by using the final rule set. The entity attribute extraction method provided by the invention can adapt to the change of text structures, is applicable to various online encyclopedias and has the effects of high recall rate and high accuracy.

Description

A kind of entity attribute abstracting method towards online encyclopaedia and system

Technical field

The present invention relates to areas of information technology, more particularly, to a kind of entity attribute abstracting method towards online encyclopaedia and be System.

Background technology

Online encyclopaedia, also known as network encyclopaedia, is to be disclosed to the encyclopedia that online friend consults on the internet, network encyclopaedia has Open and two kinds of non-opening.User can timely and conveniently inquire about various information resources using online encyclopaedia.Simultaneously because netizen Participate in the construction of open encyclopaedia, the information of online encyclopaedia more open transparent, more rich perfect.Famous open network encyclopaedia Have: wikipedia, popular encyclopaedia, Baidupedia, interactive encyclopaedia etc..

Online encyclopaedia is used for describing all kinds of entities for user's inquiry.Entity refers to the objective things in real world, is existing Any in the real world distinguish, discernible things.Entity not only can refer to the objective objects that can touch, and can also refer to abstract Event.Entity attribute refers to some basic feature characteristics of entity, and entity attribute contributes to people and comprehensively, objectively understands in fact Body, the more descriptions to this entity of entity attribute are more detailed, and therefore entity attribute extracts important in inhibiting.

Online encyclopaedia for entity description comprehensively and in detail, the corresponding description page of the entity in online encyclopaedia it Between have man-to-man relation.Additionally, the page structure of online encyclopaedia has certain rule, each physical page has it independent The part that entity attribute is described, and entity attribute description section is often semi-structured text, is easy to extract. At present, the main entity attribute extracting online encyclopaedia using the method for rule-based (template).However, because each is online The text structure of encyclopaedia is different, and the rule for extracting the entity attribute of each online encyclopaedia is also different, therefore existing Entity attribute abstracting method often only for a certain online encyclopaedia, be not applied for other online encyclopaedias.

Content of the invention

For solving the above problems, the present invention provides a kind of entity attribute abstracting method towards online encyclopaedia, methods described Including:

Step 1), in online encyclopaedia web page text set t to be extracted select a page, extract the entity of this page Attribute display rule, obtains current rule set and closes；

Step 2), using current rule set close entity genus is carried out to described online encyclopaedia web page text set t to be extracted Property extract, and according to extracting the entity attribute that obtains and extracting the entity attribute expression rule of t, with extracting the regular collection obtaining Merge as current rule set and repeat this process k time, obtain final regular collection；Wherein k is nonnegative integer；

Step 3), using described final regular collection, entity attribute extraction is carried out to t.

In one embodiment, step 1) includes:

Step 11), in online encyclopaedia web page text set t to be extracted select a page；

Step 12), mark the entity attribute of this page, obtain entity attribute set u；

Step 13), according to entity attribute set u, extract the entity attribute expression rule of this page, obtain regular collection r.

In one embodiment, step 13) also includes:

The position being occurred in the described page according to entity attribute expression rule, to every entity attribute expression rule in r Then assign weight；Wherein, occur in attribute description part in the described page entity attribute expression rule weight be more than occur in In the described page non-attribute description part and do not appear in attribute description part entity attribute expression rule weight.

In one embodiment, step 2) include:

Step 21), using regular collection r, entity attribute is carried out to described online encyclopaedia web page text set t to be extracted Extract；

Step 22), the position that occurred in the page according to entity attribute and the entity attribute extracting this entity attribute The weight of display rule, obtains entity attribute set u' from extracting the entity attribute obtaining；

Step 23), according to entity attribute set u' extract t entity attribute expression rule, obtain regular collection r'；

Step 24), r is updated to r' return to step 21) until this process is repeated k time, obtain final regular collection； Wherein k is nonnegative integer.

In one embodiment, step 22) include:

Step a), the position being occurred in the page according to entity attribute and the entity attribute table extracting this entity attribute Reach the weight of rule, weight is assigned to this entity attribute；

Step b), selection weighted value highest n entity attribute, obtain entity attribute set u'；Wherein n is positive integer.

In a further embodiment, step a) includes:

The entity attribute that will appear in attribute description part in the page assigns weight α₁*β；And

Will appear in non-attribute description part in the page and do not appear in the entity attribute of attribute description part and assign weight α₂*β；

Wherein, β is the weight of the entity attribute expression rule extracting this entity attribute, and α₂＜ α₁.

In one embodiment, step 22) also include: entity attribute set u is merged into u'.

In a further embodiment, step 24) also include: in return to step 21) when, u is updated to u'.

In one embodiment, step 23) also include:

The position being occurred in the page according to entity attribute expression rule, to every entity attribute table in regular collection r' Reach rule and assign weight；Wherein, occur in attribute description part in the page entity attribute expression rule weight be more than occur in In the page non-attribute description part and do not appear in attribute description part entity attribute expression rule weight.

In a further embodiment, step 24) also include:

In extracting the entity attribute expression rule obtaining, using regular for the expression of weighted value highest m entity attribute as Final regular collection；Wherein m is positive integer.

According to one embodiment of present invention, a kind of entity attribute extraction system towards online encyclopaedia is also provided, described System includes:

Rule device, for selecting a page in online encyclopaedia web page text set t to be extracted, extracting should The entity attribute expression rule of the page, obtains current rule set and closes；

New regulation generating means, for being closed to described online encyclopaedia web page text set to be extracted using current rule set T carries out entity attribute extraction, and according to extracting the entity attribute expression rule of the entity attribute extraction t obtaining, with extracting The regular collection arriving repeats this process k time as current rule set merging, obtains final regular collection；Wherein k is that non-negative is whole Number；And

Entity attribute draw-out device, for carrying out entity attribute extraction using described final regular collection to t.

The present invention can adaptively be supplemented and perfect to the rule for extracting entity attribute, recycles this supplement To extract all of entity attribute of the page with the rule after improving.The entity attribute abstracting method that the present invention provides can adapt to literary composition The change of this structure is it is adaptable to various online encyclopaedia, and has the characteristics that recall rate and accuracy rate are higher.

Brief description

Fig. 1 is the flow chart of the entity attribute abstracting method towards online encyclopaedia according to an embodiment of the invention；

Fig. 2 is the flow process of the method according to entity attribute set decimation rule set according to an embodiment of the invention Figure；

Fig. 3 is the flow chart of the method generating new regulation set according to an embodiment of the invention；And

Fig. 4 is the flow process of the method that use regular collection according to an embodiment of the invention extracts entity attribute set Figure.

Specific embodiment

With reference to the accompanying drawings and detailed description the present invention is illustrated.It should be appreciated that it is described herein concrete Embodiment only in order to explain the present invention, is not intended to limit the present invention.

According to one embodiment of present invention, provide a kind of entity attribute abstracting method towards online encyclopaedia.With reference to Fig. 1 And briefly, the method includes: obtain online encyclopaedia web page text, web page text is carried out pretreatment, mark entity attribute, Obtain rule, the entity attribute generating new regulation and extracting online encyclopaedia web page text.These step will be respectively described below Rapid:

Step s101, a number of online encyclopaedia web page text of acquisition

It will be understood by those skilled in the art that online encyclopaedia can be obtained using spiders or third-party application api Web page text (or claiming page text).

Step s102, pretreatment is carried out to acquired web page text

In one embodiment, preprocessing process includes removing all of space in web page text, interior in<ref>label The unwanted data such as appearance, obtain the web page text set t to be extracted of certain scale after pretreatment.

Step s103, mark entity attribute

In one embodiment, arbitrarily a page μ can be chosen, page μ's in web page text set t to be extracted Attribute description is partly labeled to entity attribute, obtains entity attribute set u.Entity attribute may include attribute-name and property value (for example being represented in the form of key-value pair).In one embodiment, the rule of mark is: makes entity attribute set u as far as possible How to cover all entity attributes of attribute description part appearance.

Step s104, Rule

According to the entity attribute set u obtaining in previous step, to the entity attribute expression rule (referred to as rule) in page μ Extracted, obtained regular collection r.As shown in Fig. 2 this step includes following sub-step:

4-1), utilize entity attribute set u to extract the rule in page-out μ, and remove the rule of repetition, obtain rule Set r.

It will be understood by those skilled in the art that various prior arts can be used, extract the rule in the page according to entity attribute Then.

4-2), the position being occurred in the page according to rule, assigns weight beta to the every rule in regular collection r.

Wherein, the weight occurring in the rule of attribute description part in the page is greater than and only occurs in non-attribute description part Without occurring in the weight of the rule of attribute description part.This is higher rather than attribute is retouched by the confidence level of attribute description part State the relatively low decision of confidence level of part, as known to those skilled in the art, occur in attribute description part (i.e. property box Interior) rule accuracy general relatively higher, and occur in the accuracy of the rule of non-attribute description part (i.e. attribute outer frame) Typically relatively low.In one embodiment, rule imparting weight can be divided into following several situation:

If a rule occurs in the attribute description part in page μ, weighted value β is assigned to this rule₁；

If a rule occurs in the non-attribute description part in page μ, weight beta is assigned to this rule₂；

If a rule simultaneously appears in attribute description part and non-attribute description part in page μ, according to this The attribute description part that rule occurs in page μ is processed, and assigns weight beta₁.Wherein, 0 ＜ β₂＜ β₁≤ 1, β₁And β₂Can be Meet β₂＜ β₁And belong to (0,1] any number.

Step s105, generation new regulation

Generally, this step is entered to web page text set t to be extracted first by the regular collection r obtaining from page μ Row entity attribute extracts, and obtains new entity attribute set u', recycles u' that the rule in all webpage t to be extracted is taken out Take, obtain new regular collection r'.R' is regarded r, u' and repeats this process (i.e. for the entity attribute of set t as u Extract and rule extraction process).After k takes turns (k is nonnegative integer), the regular collection finally giving is carried out be filtrated to get newly Regular collection r_l.As shown in figure 3, the step generating new regulation includes:

5-1), using regular collection r, entity attribute extraction is carried out to web page text set t to be extracted.

Wherein, when calculating the first round, this regular collection r is the regular collection obtaining from step s104.As shown in figure 4, The extraction process of entity attribute includes following sub-step:

A), use regular collection r, entity attribute is carried out to web page text set t to be extracted and extracts and carry out duplicate removal, obtain To entity attribute set u_r.

B), the position being occurred according to entity attribute, to u_rIn entity attribute assign weight α.

In one embodiment, entity attribute imparting weight can be divided into following several situation:

If entity attribute occurs in the attribute description part (i.e. entity attribute description section) in the page, to this entity Attribute assigns weight α₁*β；

If entity attribute occurs in the non-attribute description part in the page, weight α is assigned to this entity attribute₂*β；

If entity attribute simultaneously appears in attribute description part and non-attribute description part in the page, go out according to it Attribute description part is processed now, assigns weight α₁*β.Wherein, β is rule (the i.e. regular collection extracting this entity attribute Rule in r) weight, and 0 ＜ α₂＜ α₁≤1.

It will be understood by those skilled in the art that the reason so set entity attribute weight is: because in the page, attribute is retouched The confidence level stating part is higher, and the accuracy therefore appearing in the entity attribute of attribute description part is higher, and occurs in non-genus The accuracy of the entity attribute of property description section is relatively low.Wherein, α₁And α₂Can be to meet α₂＜ α₁And belong to (0,1] any Number.

5-2), in entity attribute set u_rIn, n according to weight α selection confidence level highest (i.e. weighted value highest) is real Body attribute is as new entity attribute set u'.

In one embodiment, can merge weighted value highest n entity attribute with entity attribute set u becomes new Entity attribute set u'.

5-3), according to new entity attribute set u', the rule in all pages to be extracted is extracted, including following Sub-step:

A), using u', web page text set t to be extracted is carried out with rule extraction, and go to extracting the rule obtaining Weight, obtains new regular collection r'.

B), weight beta is assigned to the rule in regular collection r'.

Similar to step s104, give weight to rule and include:

If rule occurs in the attribute description part in the page, weight beta is assigned to this rule₁；

If rule occurs in the non-attribute description part in the page, weight beta is assigned to this rule₂；

If rule simultaneously appears in attribute description part and non-attribute description part, occur in attribute description according to it Part is processed, and assigns weight beta₁.Wherein, 0 ＜ β₂＜ β₁≤1.

5-4), by new entity attribute set u' as u, new regular collection r' as r, repeat step 5-1) to step 5-3) k wheel, obtains regular collection r_k.

5-5), in regular collection r_kIn, according to weight beta weighting weight values highest m rule as final regular collection r_l.

In one embodiment, can be by weighted value by r_kIn rule carry out descending sort, take front m% rule as Whole regular collection r_l.

Step s106, use regular collection r_lEntity attribute extraction is carried out to web page text set t to be extracted, and removes miscellaneous Matter, obtains final entity attribute set.

It should be understood that impurity here includes unwanted data, that is, remove the impurity in result attribute-name and property value, and And remove the entity attribute of repetition.

According to one embodiment of present invention, a kind of entity attribute extraction system towards online encyclopaedia is also provided, including Rule device, new regulation generating means and entity attribute draw-out device.

Rule device is used for selecting a page in online encyclopaedia web page text set t to be extracted, and extracting should The entity attribute expression rule of the page, obtains current rule set and closes.

New regulation generating means are used for being closed using current rule set and online encyclopaedia web page text set t to be extracted are carried out Entity attribute extracts, and according to extracting the entity attribute expression rule of the entity attribute extraction t obtaining, with extracting the rule obtaining Then gather to merge as current rule set and repeat this process k time, obtain final regular collection.

Entity attribute draw-out device is used for carrying out entity attribute extraction using final regular collection to t.

For checking present invention offer towards the entity attribute abstracting method of online encyclopaedia and the effectiveness of system, inventor The wikipedia Chinese data collection on June 25th, 2013 is tested, this experiment parameter is as follows:

Arbitrarily choose 1000 physical page as page set to be extracted.In the wikipedia page, entity attribute Description section is located at the start-up portion of web page text, is terminated with " { { " mark starts, " } } " mark.β₁Value is 1, β₂Value is 0.8、α₁Value is 1, α₂Value is that 0.8, n takes 100；Because experimental data set is less, so k takes 5, m to take 50.

Through experiment, inventor has obtained following result: the number of the entity attribute extracting from each page is all more than The entity attribute number of the part of attribute described in this page, and accuracy rate averagely can reach more than 90%.

It should be noted that and understanding, in the feelings without departing from the spirit and scope of the present invention required by appended claims Under condition, various modifications and improvements can be made to the present invention of foregoing detailed description.It is therefore desirable to the model of the technical scheme of protection Enclose and do not limited by given any specific exemplary teachings.

Claims

1. a kind of entity attribute abstracting method towards online encyclopaedia, comprising:

Step 1), in online encyclopaedia web page text set t to be extracted select a page, extract the entity attribute of this page Display rule, the position being occurred in the described page according to described entity attribute expression rule, to every entity attribute expression rule Then assign weight, to obtain current rule set conjunction；Wherein, the entity attribute expression rule of attribute description part in the described page are occurred in Weight then is more than the entity attribute occurring in non-attribute description part in the described page and not appearing in attribute description part The weight of display rule；

Step 2), closed using current rule set and entity attribute is carried out to described online encyclopaedia web page text set t to be extracted take out Take, and the entity attribute being obtained according to extraction, extract the entity attribute expression rule of t, with extracting the rule set cooperation obtaining Merge for current rule set and repeat this process k time, obtain final regular collection；Wherein k is nonnegative integer；

2. method according to claim 1, wherein, step 1) include:

3. method according to claim 2, wherein, step 2) include:

Step 21), using regular collection r, entity attribute extraction is carried out to described online encyclopaedia web page text set t to be extracted；

Step 22), according to the entity attribute position occurring in the page and extract this entity attribute entity attribute expression The weight of rule, obtains entity attribute set u' from extracting the entity attribute obtaining；

Step 24), r is updated to r' return to step 21) until this process is repeated k time, obtain final regular collection；Wherein K is nonnegative integer.

4. method according to claim 3, wherein, step 22) include:

Step a), according to the entity attribute position occurring in the page and extract this entity attribute entity attribute expression rule Weight then, assigns weight to this entity attribute；

5. method according to claim 4, wherein, step a) includes:

6. method according to claim 3, wherein, step 22) also include:

Entity attribute set u is merged into u'.

7. method according to claim 6, wherein, step 24) also include:

In return to step 21) when, u is updated to u'.

8. method according to claim 3, wherein, step 23) also include:

The position being occurred in the page according to entity attribute expression rule, to every entity attribute expression rule in regular collection r' Then assign weight；Wherein, occur in attribute description part in the page entity attribute expression rule weight be more than occur in the page In non-attribute description part and do not appear in attribute description part entity attribute expression rule weight.

9. method according to claim 8, wherein, step 24) also include:

In the entity attribute expression rule that extraction obtains, using weighted value highest m entity attribute expression rule as finally Regular collection；Wherein m is positive integer.

10. a kind of entity attribute extraction system towards online encyclopaedia, comprising:

Rule device, for selecting a page in online encyclopaedia web page text set t to be extracted, extracts this page Entity attribute expression rule, according to described entity attribute expression rule in the described page occur position, to every entity Attribute display rule assigns weight, to obtain current rule set conjunction；Wherein, occur in the entity of attribute description part in the described page The weight of attribute display rule is more than and occurs in the described page non-attribute description part and do not appear in attribute description part Entity attribute expression rule weight；

New regulation generating means, for closing described online encyclopaedia web page text set t to be extracted is entered using current rule set Row entity attribute extracts, and according to extracting the entity attribute that obtains and extract the entity attribute expression rule of t, is obtained with extracting Regular collection repeats this process k time as current rule set merging, obtains final regular collection；Wherein k is nonnegative integer；With And