CN103853823A

CN103853823A - Online encyclopedia oriented entity attribute extraction method and system

Info

Publication number: CN103853823A
Application number: CN201410065743.7A
Authority: CN
Inventors: 程学旗; 贾岩涛; 张泽慧; 王元卓; 冯凯; 熊锦华; 许洪波
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2014-02-26
Filing date: 2014-02-26
Publication date: 2014-06-11
Anticipated expiration: 2034-02-26
Also published as: CN103853823B

Abstract

The invention provides an online encyclopedia oriented entity attribute extraction method and system. The method comprises the steps: selecting a page from an online encyclopedia webpage text set T to be extracted, and extracting entity attribute expression rules of the page, so as to obtain a current rule set; extracting entity attribute of the online encyclopedia webpage text set T to be extracted by using the current rule set, extracting entity attribute expression rules of the T according to the entity attribute obtained through extracting, taking a rule set obtained through extracting as a current rule set, and repeating the process for k times, so as to obtain a final rule set; carrying out entity attribute extraction on the T by using the final rule set. The entity attribute extraction method provided by the invention can adapt to the change of text structures, is applicable to various online encyclopedias and has the effects of high recall rate and high accuracy.

Description

A kind of entity attribute abstracting method and system towards online encyclopaedia

Technical field

The present invention relates to areas of information technology, relate in particular to a kind of entity attribute abstracting method and system towards online encyclopaedia.

Background technology

Online encyclopaedia, claims again network encyclopaedia, is the open encyclopedia of consulting to online friend on the internet, and network encyclopaedia has two kinds of open and non-openings.User can utilize online encyclopaedia to inquire about easily in time various information resources.Participate in the construction of open encyclopaedia due to netizen, the information of online encyclopaedia is more open transparent, more enriches perfect simultaneously.Famous open network encyclopaedia has: wikipedia, popular encyclopaedia, Baidupedia, interactive encyclopaedia etc.

Online encyclopaedia is used for describing all kinds of entities for user's inquiry.Entity refers to the objective things in real world, be anyly in real world distinguish, discernible things.Entity not only can refer to the objective objects that can touch, can also refer to abstract event.Entity attribute refers to some essential characteristic characteristics of entity, and entity attribute contributes to people to understand comprehensively, objectively entity, and the description of more multipair this entity of entity attribute is just more detailed, and therefore entity attribute extracts important in inhibiting.

Online encyclopaedia is comprehensive and detailed for the description of entity, between the entity in online encyclopaedia and its corresponding description page, has man-to-man relation.In addition, the page structure of online encyclopaedia has certain rule, and each physical page has its part independently entity attribute being described, and entity attribute describes part semi-structured text often, is convenient to extract.At present, mainly use the method for rule-based (template) to extract the entity attribute of online encyclopaedia.But, due to the text structure difference of each online encyclopaedia, the rule that is used for the entity attribute that extracts each online encyclopaedia is also different, and therefore existing entity attribute abstracting method, often only for a certain online encyclopaedia, can not be applicable to other online encyclopaedias.

Summary of the invention

For addressing the above problem, the invention provides a kind of entity attribute abstracting method towards online encyclopaedia, described method comprises:

Step 1), in online encyclopaedia web page text set T to be extracted, select a page, extract the entity attribute of this page and express rule, obtain current regular collection;

Step 2), use current regular collection to carry out entity attribute extraction to described online encyclopaedia web page text set T to be extracted, and the entity attribute obtaining according to extraction extracts the entity attribute of T and expresses rule, the regular collection obtaining with extraction is as current regular collection and repeat this process k time, obtains final regular collection; Wherein k is nonnegative integer;

Step 3), use described final regular collection to carry out entity attribute extraction to T.

In one embodiment, step 1) comprises:

Step 11), in online encyclopaedia web page text set T to be extracted, select a page;

Step 12), mark the entity attribute of this page, obtain entity attribute set U;

Step 13), according to entity attribute set U, extract this page entity attribute express rule, obtain regular collection R.

In one embodiment, step 13) also comprises:

Express according to entity attribute the position that rule occurs in the described page, every entity attribute in R is expressed to rule and compose weight; Wherein, the entity attribute that appears at attribute description part in the described page is expressed regular weight and is greater than the entity attribute that appears at non-attribute description part in the described page and do not appear at attribute description part and expresses regular weight.

In one embodiment, step 2) comprising:

Step 21), service regeulations set R carries out entity attribute extraction to described online encyclopaedia web page text set T to be extracted;

Step 22), the position occurring in the page according to entity attribute and the entity attribute that extracts this entity attribute express regular weight, obtains entity attribute set U' from extracting the entity attribute obtaining;

Step 23), according to entity attribute set U' extract T entity attribute express rule, obtain regular collection R';

Step 24), R is updated to R' and returns to step 21) until this process has repeated k time, obtain final regular collection; Wherein k is nonnegative integer.

In one embodiment, step 22) comprising:

Step a), the position occurring in the page according to entity attribute and the entity attribute that extracts this entity attribute are expressed regular weight, and this entity attribute is composed to weight;

Step b), n the highest entity attribute of selection weighted value, obtain entity attribute set U'; Wherein n is positive integer.

In a further embodiment, step a) comprises:

The entity attribute that appears at attribute description part in the page is composed to weight α ₁* β; And

The entity attribute that appears at non-attribute description part in the page and do not appear at attribute description part is composed to weight α ₂* β;

Wherein, β is that the entity attribute that extracts this entity attribute is expressed regular weight, and α ₂< α ₁.

In one embodiment, step 22) also comprise: entity attribute set U is merged to U'.

In a further embodiment, step 24) also comprise: returning to step 21) time, U is updated to U'.

In one embodiment, step 23) also comprise:

Express according to entity attribute the position that rule occurs in the page, every entity attribute in regular collection R' is expressed to rule and compose weight; Wherein, the entity attribute that appears at attribute description part in the page is expressed regular weight and is greater than the entity attribute that appears at non-attribute description part in the page and do not appear at attribute description part and expresses regular weight.

In a further embodiment, step 24) also comprise:

The entity attribute obtaining in extraction is expressed in rule, and m the highest weighted value entity attribute expressed to rule as final regular collection; Wherein m is positive integer.

According to one embodiment of present invention, also provide a kind of entity attribute extraction system towards online encyclopaedia, described system comprises:

Rule device, for selecting a page at online encyclopaedia web page text set T to be extracted, extracts the entity attribute of this page and expresses rule, obtains current regular collection;

New regulation generating apparatus, for using current regular collection to carry out entity attribute extraction to described online encyclopaedia web page text set T to be extracted, and the entity attribute obtaining according to extraction extracts the entity attribute of T and expresses rule, the regular collection obtaining with extraction is as current regular collection and repeat this process k time, obtains final regular collection; Wherein k is nonnegative integer; And

Entity attribute draw-out device, for using described final regular collection to carry out entity attribute extraction to T.

The present invention can be adaptively to supplementing for the rule that extracts entity attribute and perfect, the rule after recycling this and supplementing and improve extracts all entity attributes of the page.Entity attribute abstracting method provided by the invention can adapt to the variation of text structure, is applicable to various online encyclopaedias, and has recall rate and the higher feature of accuracy rate.

Brief description of the drawings

Fig. 1 is the process flow diagram towards the entity attribute abstracting method of online encyclopaedia according to an embodiment of the invention;

Fig. 2 is according to an embodiment of the invention according to the process flow diagram of the method for entity attribute set decimation rule set;

Fig. 3 is the process flow diagram that generates according to an embodiment of the invention the method for new regulation set; And

Fig. 4 is the process flow diagram that the method for entity attribute set is extracted in service regeulations set according to an embodiment of the invention.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is illustrated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.

According to one embodiment of present invention, provide a kind of entity attribute abstracting method towards online encyclopaedia.With reference to figure 1 and concise and to the point, the method comprises: obtain online encyclopaedia web page text, web page text is carried out pre-service, marks entity attribute, obtains rule, generates new regulation and extracts the entity attribute of online encyclopaedia web page text.These steps will be described respectively below:

Step S101, obtain the online encyclopaedia web page text of some

It will be understood by those skilled in the art that and can apply the web page text (or claiming page text) that API obtains online encyclopaedia with spiders or third party.

Step S102, obtained web page text is carried out to pre-service

In one embodiment, preprocessing process comprises the unwanted data such as the content in space, <ref> label all in removal web page text, obtains the web page text set T to be extracted of certain scale after pre-service.

Step S103, mark entity attribute

In one embodiment, can in web page text set T to be extracted, choose arbitrarily a page μ, in the attribute description part of page μ, entity attribute be marked, obtain entity attribute set U.Entity attribute can comprise attribute-name and property value (for example representing with the form of key-value pair).In one embodiment, the rule of mark is: make entity attribute set U contain as much as possible all entity attributes that attribute description part occurs.

Step S104, Rule

According to the entity attribute set U obtaining in previous step, the entity attribute in page μ is expressed to rule (being called for short rule) and extract, obtain regular collection R.As shown in Figure 2, this step comprises following sub-step:

4-1), utilize entity attribute set U to extract the rule in page μ, and remove the rule of repetition, obtain regular collection R.

It will be understood by those skilled in the art that and can use various prior aries, extract the rule in the page according to entity attribute.

4-2), the position that occurs in the page according to rule, the every rule in regular collection R is composed to weight beta.

Wherein, appear at the regular weight of attribute description part in the page and be greater than the regular weight that only appears at non-attribute description part and do not appear at attribute description part.This is the lower decision of degree of confidence of degree of confidence by attribute description part higher but not attribute description part, as known to those skilled in the art, the regular accuracy that appears at attribute description part (be property box in) is generally higher, and it is generally lower to appear at the regular accuracy of non-attribute description part (being that property box is outer).In one embodiment, give weight to rule and can be divided into following several situation:

If a rule appears at the attribute description part in page μ, this rule is composed to weighted value β ₁;

If a rule appears at the non-attribute description part in page μ, this rule is composed to weight beta ₂;

If a rule appears at attribute description part and non-attribute description part in page μ simultaneously, the attribute description part appearing in page μ according to this rule is processed, and composes weight beta ₁.Wherein, 0 < β ₂< β ₁≤ 1, β ₁and β ₂can be to meet β ₂< β ₁and belong to (0,1] any number.

Step S105, generation new regulation

Generally, first this step is used the regular collection R obtaining from page μ to carry out entity attribute extraction to web page text set T to be extracted, obtain new entity attribute set U', recycling U' extracts the rule in all page T to be extracted, obtains new regular collection R'.R' is used as to U as R, U' and repeats this process (extracting and rule extraction process for the entity attribute of set T).After k wheel (k is nonnegative integer), the regular collection finally obtaining is filtered and obtains new regular collection R _l.As shown in Figure 3, the step of generation new regulation comprises:

5-1), service regeulations set R carries out entity attribute extraction to web page text set T to be extracted.

Wherein, while calculating in the first round, this regular collection R is the regular collection obtaining from step S104.As shown in Figure 4, the extraction process of entity attribute comprises following sub-step:

A), service regeulations set R, web page text set T to be extracted is carried out entity attribute extraction and carries out duplicate removal, obtain entity attribute set U _r.

B), according to the position of entity attribute appearance, to U _rin entity attribute compose weight α.

In one embodiment, give weight to entity attribute and can be divided into following several situation:

If entity attribute appears at the attribute description part (being that entity attribute is described part) in the page, this entity attribute is composed to weight α ₁* β;

If entity attribute appears at the non-attribute description part in the page, this entity attribute is composed to weight α ₂* β;

If entity attribute appears at attribute description part and non-attribute description part in the page simultaneously, appear at attribute description part according to it and process, compose weight α ₁* β.Wherein, β is the weight of the rule (being the rule in regular collection R) that extracts this entity attribute, and 0 < α ₂< α ₁≤ 1.

It will be understood by those skilled in the art that, the reason of setting like this entity attribute weight is: because the degree of confidence of attribute description part in the page is higher, therefore the accuracy of entity attribute that appears at attribute description part is higher, and it is lower to appear at the accuracy of entity attribute of non-attribute description part.Wherein, α ₁and α ₂can be to meet α ₂< α ₁and belong to (0,1] any number.

5-2), at entity attribute set U _rin, n entity attribute choosing degree of confidence the highest (being that weighted value is the highest) according to weight α is as new entity attribute set U'.

In one embodiment, a n the highest weighted value entity attribute and entity attribute set U merging can be become to new entity attribute set U'.

5-3), according to new entity attribute set U', the rule in all pages to be extracted is extracted, comprise following sub-step:

A), utilize U' to carry out rule extraction to web page text set T to be extracted, and carry out duplicate removal to extracting the rule that obtains, obtain new regular collection R'.

B), the rule in regular collection R' is composed to weight beta.

Be similar to step S104, give weight to rule and comprise:

If rule appears at the attribute description part in the page, this rule is composed to weight beta ₁;

If rule appears at the non-attribute description part in the page, this rule is composed to weight beta ₂;

If rule appears at attribute description part and non-attribute description part simultaneously, appear at attribute description part according to it and process, compose weight beta ₁.Wherein, 0 < β ₂< β ₁≤ 1.

5-4), by new entity attribute set U' be used as U, new regular collection R' is used as R, repeating step 5-1) to step 5-3) k wheel, obtain regular collection R _k.

5-5), at regular collection R _kin, m the rule of getting according to weight beta that weighted value is the highest is as final regular collection R _l.

In one embodiment, can be by weighted value by R _kin rule carry out descending sort, get front m% rule as final regular collection R _l.

Step S106, use regular collection R _lweb page text set T to be extracted is carried out to entity attribute extraction, and remove impurity, obtain final entity attribute set.

Should be understood that the impurity here comprises unwanted data, remove the impurity in result attribute-name and property value, and remove the entity attribute of repetition.

According to one embodiment of present invention, also provide a kind of entity attribute extraction system towards online encyclopaedia, comprise Rule device, new regulation generating apparatus and entity attribute draw-out device.

Rule device, for selecting a page at online encyclopaedia web page text set T to be extracted, extracts the entity attribute of this page and expresses rule, obtains current regular collection.

New regulation generating apparatus is for using current regular collection to carry out entity attribute extraction to online encyclopaedia web page text set T to be extracted, and the entity attribute obtaining according to extraction extracts the entity attribute of T and expresses rule, the regular collection obtaining with extraction is as current regular collection and repeat this process k time, obtains final regular collection.

Entity attribute draw-out device is used for using final regular collection to carry out entity attribute extraction to T.

For verifying the entity attribute abstracting method towards online encyclopaedia provided by the invention and the validity of system, inventor tests on the wikipedia Chinese data collection on June 25th, 2013, and this experiment parameter is as follows:

Choose arbitrarily 1000 physical page as page set to be extracted.In the wikipedia page, entity attribute is described the start-up portion that is partly positioned at web page text, finishes with " { { " mark starts, " } } " mark.β ₁value is 1, β ₂value is 0.8, α ₁value is 1, α ₂value is 0.8, n gets 100; Because experimental data collection is less, so k gets 5, m gets 50.

Through experiment, inventor has obtained following result: the number of the entity attribute extracting from each page is all more than the entity attribute number of part of describing attribute this page, and accuracy rate on average can reach more than 90%.

Should be noted that and understand, in the situation that not departing from the desired the spirit and scope of the present invention of accompanying claim, can make various amendments and improvement to the present invention of foregoing detailed description.Therefore, the scope of claimed technical scheme is not subject to the restriction of given any specific exemplary teachings.

Claims

1. towards an entity attribute abstracting method for online encyclopaedia, comprising:

2. method according to claim 1, wherein, step 1) comprises:

3. method according to claim 2, wherein, step 13) also comprises:

4. method according to claim 3, wherein, step 2) comprising:

5. method according to claim 4, wherein, step 22) comprising:

6. method according to claim 5, wherein, step a) comprises:

7. method according to claim 4, wherein, step 22) also comprise:

Entity attribute set U is merged to U'.

8. method according to claim 7, wherein, step 24) also comprise:

Returning to step 21) time, U is updated to U'.

9. method according to claim 4, wherein, step 23) also comprise:

10. method according to claim 9, wherein, step 24) also comprise:

11. 1 kinds of entity attribute extraction systems towards online encyclopaedia, comprising: