CN101226547A

CN101226547A - Web entity recognition method for entity recognition system

Info

Publication number: CN101226547A
Application number: CNA200810056102XA
Authority: CN
Inventors: 孟小峰; 刘伟; 凌妍妍
Original assignee: Individual
Current assignee: Individual
Priority date: 2008-01-11
Filing date: 2008-01-11
Publication date: 2008-07-23

Abstract

The invention relates to a Web entity identification method in an entity identification system, wherein the entity identification system comprises an input module, an attribute analysis module based on domains, an entity identification module and an output module. The method of the invention is characterized by comprising the following steps: a, inputting a recording set; b, analyzing similar computation rules in a given domain and correlation between attributes; c, determining whether any given two records belong to one same entity; d, outputting the entity set.

Description

A kind of Web entity recognition method that is used in the Entity recognition system

Technical field

The present invention relates to Computer Database and field, especially relate to a kind of Web entity recognition method that is used in the Entity recognition system.

Background technology

In application facet,, contained the information of magnanimity among the Web along with the develop rapidly of Web, according to conservative estimation, present whole Web has surpassed 200, the quantity of information of 000TB, and still increasing fast, and these information have covered the every field (such as commerce, amusement, physical culture etc.) of real world.This makes Web become people gradually and obtains one of most important approach of useful information.Yet the information of magnanimity also often makes people can not find the information of oneself wanting rapidly and accurately from Web.How from current huge Web, to obtain Useful Information efficiently and become the new challenge that people face.In order to address this problem, many researchers are being devoted to how to help people to finish with automated method information among the Web are effectively being obtained.Yet, existing the information of a large amount of repetitions among the Web, duplicate message is meant separately the description of different Web data sources to same entity in the real world.For Web data integrated very important meaning is arranged for the identification of the information of these repetitions, the typical application scene is heavily, merges, distinguishes true.

Go heavily to be meant that the duplicate message that a plurality of Web data sources are described same entity only keeps portion.Such as the user to when when and the book of remarkable two Web data sources inquiries relevant " java " of buying books, and hope bought the most cheaply, represents that the record of same book identifies in this record set that just need return two Web data sources, and selects the most cheap.

Merge and to be meant that the information that a plurality of Web data sources are described same entity combines and keeps different separately parts.Such as the information of user from a plurality of Web data source inquiry someones that personal information is provided.Each Web data source provides the information of personnel's different aspect, the Web data source that has provides personnel's job information (name, sex, age, organization, position, mailbox, unit place, unit postcode etc.), the life information that personnel are provided that has (name, sex, age, native place, home phone number, home address, blood group, spouse etc.).This just need identify the same people's of expression record, thereby and it is merged into same write down the full detail that obtains this people.

Distinguish really to refer to certain method description and inequality of each Web data source, therefrom select real same entity.Such as the news at a lot of Web data sources report Yi Jianlian's ages, there are a lot of versions (18,19,24 etc.).We need therefrom pick out which at age is real.

At technical elements, because the data among the Web have highly heterogeneous (the heterogeneous different expression form that is meant data, such as different format write of date, the full name of name and breviary are expressed), characteristics such as scale is big, this makes for the description of same entity, different Web data sources have different expression-forms, thereby cause Entity recognition to exist very big difficulty on accuracy and efficient.

The method of many Entity recognition has been proposed at present, though but these methods have reached higher accuracy, but they mainly are at a spot of particularly two heterogeneous data sources, have serious efficiency for a large amount of highly heterogeneous data sources among the Web.Give an example,, utilize existing entity recognition method to need to carry out once between any therein two data sources, therefore need altogether to carry out if 100 Web data sources are arranged

C_{100}^{2} = 4950

Inferior.

In order to be provided on the extensive Web data source efficiency of Entity recognition, the method that we propose can interior all the Web data sources in field of single treatment (field of reality such as economy, physical culture, music).

Summary of the invention

In order to solve above-mentioned traditional problem, so one object of the present invention is exactly to have proposed a kind of Web entity recognition method that is used in the Entity recognition system.

In one aspect of the invention, a kind of Web entity recognition method that is used in the Entity recognition system, this Entity recognition system comprises load module, the attributive analysis module based on the field, Entity recognition module and output module, it is characterized in that the method comprising the steps of: A, input set of records ends; B, the similarity computation rule of analyzing all properties in the given field and the correlativity between the attribute; Whether C, definite given two records arbitrarily are same entity; And D, output entity sets.

In this aspect of the invention, wherein step B further comprises step: B1, given field of input; B2, carry out collecting based on the attribute in field; B3, carry out attributive classification; B4, carry out attribute similarity computation rule definition; B5, output similarity computation rule; B6, carry out the attribute correlation analysis; And the correlation models between the B7, output attribute.

In this aspect of the invention, wherein step B2 further comprises step: B2-1, carries out the collection of Web data source, obtains the abundant Web data source in this field from specific website; B2-2, carry out the collection of attribute,, extract all properties that is comprised for each Web data source of having collected; And B2-3, carry out the merging of attribute, and the community set that obtains from each Web data source is merged, the attribute of the same semanteme of expression be can be regarded as one between the different Web data sources.

In this aspect of the invention, in step B3, be useful attribute and useless attribute with attributive classification wherein, useful attribute is further divided into primary key attributes, filter attribute, important attribute and secondary attribute.

In this aspect of the invention, wherein in step B4, attribute similarity is meant two similaritys that are recorded on some predicables.Judge whether same entity of two records, need comprehensive their similaritys on each predicable to judge.

In this aspect of the invention, wherein represent by YES, MAYBE or NO ternary value for attribute similarity.

In this aspect of the invention, wherein YES is meant two to be recorded on this attribute value identical; NO is meant that two are recorded in the sure difference of the semanteme that is worth on this attribute; MAYBE is meant that two are recorded in value on this attribute because form of expression difference can't determine whether semantic identical.

In this aspect of the invention, wherein in step B6, the attribute correlation analysis is meant all properties in a given field, and the method by training obtains correlativity between the attribute.

In this aspect of the invention, wherein whether total step C further comprise step: C1, judge these two records determinant attribute, if not go to step C2, if having: identical, judge that then they are same entities; Difference judges that then they are not same entities; C2, judge the whether total filter attribute of these two records, do not go to step C3, if having: difference, judge that then they are not same entities; Identical, go to step C3; C3, two total important attribute of record of investigation are calculated the similarity on each important attribute respectively, and similarity is divided into YES, MAYBE or NO; C4, according to the attribute correlativity, utilize the attribute scale model to improve similarity on value MAYBE attribute, make two to be recorded in that value also can be judged as YES on these attributes; And if two of C5 are recorded in that value all is judged as YES on all important attribute, think two same entities of record expression so.

Description of drawings

In conjunction with accompanying drawing subsequently, what may be obvious that from following detailed description draws above-mentioned and other purpose of the present invention, feature and advantage.In the accompanying drawings:

Fig. 1 has provided the general frame figure according to Entity recognition of the present invention system;

Fig. 2 has provided the process flow diagram according to the property analysis method based on the field of the present invention;

Fig. 3 has provided the process flow diagram according to the attribute collection method based on the field of the present invention;

Fig. 4 has provided the synoptic diagram according to attributive classification of the present invention;

Fig. 5 has provided the synoptic diagram according to attribute similarity rule of the present invention;

Fig. 6 has provided the process flow diagram according to association attributes system of selection of the present invention; And

Fig. 7 has provided the process flow diagram according to entity recognition method of the present invention.

Embodiment

At first, with reference to figure 1, the general frame figure of Entity recognition according to the present invention system is described.Fig. 1 has provided the general frame figure according to Entity recognition of the present invention system.

As shown in Figure 1, this system mainly comprises four modules, promptly based on the other module of attributive analysis module, entity, load module and the output module in field.

Load module is used to import set of records ends.

Attributive analysis module based on the field is used to analyze the similarity computation rule of all properties in the given field and the correlativity between the attribute.

The Entity recognition module is used for determining whether given two records are same entity arbitrarily.

Output module is used to export entity sets.

Respectively attributive analysis module and Entity recognition module based on the field are described in detail below.

Function based on the attributive analysis module in field mainly comprises: the computing method of determining attribute similarity; Determine the correlativity between the attribute.

As shown in Figure 2, Fig. 2 has provided the process flow diagram according to the property analysis method based on the field of the present invention.At step S201, import a given field, such as books, music, film etc.At step S202, carry out collecting based on the attribute in field, with reference to figure 3 it is described in more details subsequently.At step S203, carry out attributive classification, with reference to figure 4 it is described in more details subsequently.At step S204, carry out attribute similarity computation rule definition, and at step S205, output similarity computation rule is described in more details it with reference to figure 5 subsequently.After this, at step S206, carry out the attribute correlation analysis, and at step S207, the correlation models between the output attribute is described in more details it with reference to figure 6 subsequently.

With reference now to Fig. 3, step S202 is described in detail.Fig. 3 has provided the process flow diagram according to the attribute collection method based on the field of the present invention.

At step S301, carry out the collection of Web data source, obtain the abundant Web data source in this field from Completeplanet website (www.Completeplanet.com);

At step S302, carry out the collection of attribute, for each Web data source of having collected, extract all properties that is comprised;

At step S303, carry out the merging of attribute, the community set that obtains from each Web data source is merged, the attribute of the same semanteme of expression be can be regarded as one between the different Web data sources.After this, all properties in this field of collecting is classified, at first be divided into useful attribute and useless attribute, useful attribute is further divided into primary key attributes, filter attribute, important attribute and secondary attribute, as shown in Figure 4.Fig. 4 has provided the synoptic diagram according to attributive classification of the present invention.

Useless attribute: be meant the attribute that Entity recognition is cut little ice.

Useful attribute: be meant the attribute that Entity recognition is worked.

Primary key attributes: only be meant by this attribute and can judge whether same attributes of entities of two records.

Filter attribute: be meant by this attribute and can judge that two records are not same entities, but can not determine it is same entity.

Important attribute: be meant by this attribute possibility that can to improve or reduce by two records are same entities, but can't determine.

Secondary attribute: be meant by this attribute possibility that can to improve two records are same entities, but can't determine.

With reference now to Fig. 5, step S204 is described in detail.Fig. 5 has provided the synoptic diagram according to attribute similarity rule of the present invention.

Attribute similarity is meant two similaritys that are recorded on some predicables.Judge whether same entity of two records, need comprehensive their similaritys on each predicable to judge.Represent by YES, MAYBE or NO ternary value for attribute similarity.YES is meant two, and to be recorded on this attribute value identical; NO is meant that two are recorded in the sure difference of the semanteme that is worth on this attribute; MAYBE is meant that two are recorded in value on this attribute because form of expression difference can't determine whether semantic identical.Two calculating that are recorded in a similarity on the attribute are definite by series of rules, as shown in Figure 5.

◆ the character level rule is meant the similarity of coming 2 property values of comparison from the angle of character.

Other abbreviation of character level has two kinds of forms: prefix, the combination of prefix suffix.

● the prefix rule is meant that a property value is another prefix.For example Univ is the abbreviation of University.

● prefix suffix rule of combination is meant that a property value is the combination of another prefix and suffix.For example Dept is the abbreviation of Department.

◆ plural rule is meant that a property value is another plural form.For example computers is the plural form of computer.

◆ the word level rule is meant the similarity of coming 2 property values of comparison from the angle of word.The abbreviation of word level has two kinds of forms: the series connection of word prefix, the combination of word initial.

● prefix series connection rule is meant that an attribute is the series connection of another attribute word prefix.For example Caltech is the abbreviation of California Institute of Technology.

● the initial rule of combination is meant the initial combination of the word that an attribute is another attribute.For example UCSD is University of California, the abbreviation of San Diege.

◆ add the speech rule and be meant that an attribute is the part of the whole words of another attribute, and keep original order.For example " Computer Science University California, San Diege " with " Depar tment of Computer Science in University of California, San Diege ".

◆ the rule of resetting is meant that the word that an attribute and another attribute comprise is identical, but appearance order difference.For example " Michael Jordan " and " Jordan Michael ".

Calculating for the similarity of each attribute usually need one or more rule.The attribute correlation analysis is meant all properties in a given field, and the method by training obtains correlativity between the attribute.The correlativity of attribute is meant according to an attribute similarity infers the similarity of another attribute.Such as two books records, if they are identical on the title attribute, identical possibility is also very high on author property so.

Select relevant attribute from a given community set, method as shown in Figure 6.At step S601, carry out attribute and filter.At step S602, carry out association attributes and select.At step S603, the output attribute correlation models.

Ins and outs can be referring to document " Searching for Interacting Features " (http://www.ijcai.org/papers07/contents.php).

Below the Entity recognition module is described in detail.The function of this Entity recognition module is: a given set of records ends, utilize the attribute similarity judgment rule to obtain any two similaritys that are recorded on each attribute, further utilize the attribute correlation models to judge whether same entity of these two records, repeat this process and finish up to all recording processing.Flow process as shown in Figure 7, Fig. 7 has provided the process flow diagram according to entity recognition method of the present invention.

At step S701, judge the whether total determinant attribute of these two records, if not go to step S702, if having: identical, judge that then they are same entities; Difference judges that then they are not same entities.

At step S702, judge whether these two records have filter attribute, do not go to step S703, if having: difference, judge that then they are not same entities; Identical, go to step S703.

At step S703, investigate two important attribute that record is total, calculate the similarity on each important attribute respectively, similarity is divided into YES, MAYBE or NO.

At step S704, according to the attribute correlativity, utilize the attribute scale model to improve similarity on value MAYBE attribute, make two to be recorded in that value also can be judged as YES on these attributes.

At step S705, value all is judged as YES on all important attribute if two are recorded in, and thinks that so two are write down the same entities of expression.

From the above description as can be known, proposition is based on the entity recognition method and the system in field.The input of system is the record from different Web data sources that belongs in a large number in the field, an entity of each record expression real world.The output of system is some set of records ends, and same entity represented in the record in each set.Our method is different with the previous methods maximum, can handle any two records from same field exactly, is not limited to specific data source.

What may be obvious that for the person of ordinary skill of the art draws other advantages and modification.Therefore, the present invention with wider aspect is not limited to shown and described specifying and exemplary embodiment here.Therefore, under situation about not breaking away from, can make various modifications to it by the spirit and scope of claim and the defined general inventive concept of equivalents thereof subsequently.

Claims

1. Web entity recognition method that is used in the Entity recognition system, this Entity recognition system comprises load module, the attributive analysis module based on the field, Entity recognition module and output module, it is characterized in that the method comprising the steps of:

A, input set of records ends;

B, the similarity computation rule of analyzing all properties in the given field and the correlativity between the attribute;

Whether C, definite given two records arbitrarily are same entity; And

D, output entity sets.

2. according to the process of claim 1 wherein that step B further comprises step:

B1, given field of input;

B2, carry out collecting based on the attribute in field;

B3, carry out attributive classification;

B4, carry out attribute similarity computation rule definition;

B5, output similarity computation rule;

B6, carry out the attribute correlation analysis; And

Correlation models between B7, the output attribute.

3. according to the method for claim 2, wherein step B2 further comprises step:

B2-1, carry out the collection of Web data source, obtain the abundant Web data source in this field from specific website;

B2-2, carry out the collection of attribute,, extract all properties that is comprised for each Web data source of having collected; And

B2-3, carry out the merging of attribute, the community set that obtains from each Web data source is merged, the attribute of the same semanteme of expression be can be regarded as one between the different Web data sources.

4. according to the method for claim 2, in step B3, be useful attribute and useless attribute with attributive classification wherein, useful attribute is further divided into primary key attributes, filter attribute, important attribute and secondary attribute.

5. according to the method for claim 2, wherein in step B4, attribute similarity is meant two similaritys that are recorded on some predicables.Judge whether same entity of two records, need comprehensive their similaritys on each predicable to judge.

6. according to the method for claim 5, wherein represent by YES, MAYBE or NO ternary value for attribute similarity.

7. according to the method for claim 6, wherein

YES is meant two, and to be recorded on this attribute value identical;

NO is meant that two are recorded in the sure difference of the semanteme that is worth on this attribute;

MAYBE is meant that two are recorded in value on this attribute because form of expression difference can't determine whether semantic identical.

8. according to the method for claim 2, wherein in step B6, the attribute correlation analysis is meant all properties in a given field, and the method by training obtains correlativity between the attribute.

9. according to the method for claim 2, wherein step C further comprises step:

C1, judge the whether total determinant attribute of these two records, if not go to step C2, if having: identical, judge that then they are same entities; Difference judges that then they are not same entities;

C2, judge the whether total filter attribute of these two records, do not go to step C3, if having: difference, judge that then they are not same entities; Identical, go to step C3;

C3, two total important attribute of record of investigation are calculated the similarity on each important attribute respectively, and similarity is divided into YES, MAYBE or NO;

C4, according to the attribute correlativity, utilize the attribute scale model to improve similarity on value MAYBE attribute, make two to be recorded in that value also can be judged as YES on these attributes; And

Value all is judged as YES on all important attribute if two of C5 are recorded in, and thinks that so two are write down the same entities of expression.