CN101226547A - Web entity recognition method for entity recognition system - Google Patents

Web entity recognition method for entity recognition system Download PDF

Info

Publication number
CN101226547A
CN101226547A CNA200810056102XA CN200810056102A CN101226547A CN 101226547 A CN101226547 A CN 101226547A CN A200810056102X A CNA200810056102X A CN A200810056102XA CN 200810056102 A CN200810056102 A CN 200810056102A CN 101226547 A CN101226547 A CN 101226547A
Authority
CN
China
Prior art keywords
attribute
similarity
meant
entity
carry out
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA200810056102XA
Other languages
Chinese (zh)
Inventor
孟小峰
刘伟
凌妍妍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CNA200810056102XA priority Critical patent/CN101226547A/en
Publication of CN101226547A publication Critical patent/CN101226547A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to a Web entity identification method in an entity identification system, wherein the entity identification system comprises an input module, an attribute analysis module based on domains, an entity identification module and an output module. The method of the invention is characterized by comprising the following steps: a, inputting a recording set; b, analyzing similar computation rules in a given domain and correlation between attributes; c, determining whether any given two records belong to one same entity; d, outputting the entity set.

Description

A kind of Web entity recognition method that is used in the Entity recognition system
Technical field
The present invention relates to Computer Database and field, especially relate to a kind of Web entity recognition method that is used in the Entity recognition system.
Background technology
In application facet,, contained the information of magnanimity among the Web along with the develop rapidly of Web, according to conservative estimation, present whole Web has surpassed 200, the quantity of information of 000TB, and still increasing fast, and these information have covered the every field (such as commerce, amusement, physical culture etc.) of real world.This makes Web become people gradually and obtains one of most important approach of useful information.Yet the information of magnanimity also often makes people can not find the information of oneself wanting rapidly and accurately from Web.How from current huge Web, to obtain Useful Information efficiently and become the new challenge that people face.In order to address this problem, many researchers are being devoted to how to help people to finish with automated method information among the Web are effectively being obtained.Yet, existing the information of a large amount of repetitions among the Web, duplicate message is meant separately the description of different Web data sources to same entity in the real world.For Web data integrated very important meaning is arranged for the identification of the information of these repetitions, the typical application scene is heavily, merges, distinguishes true.
Go heavily to be meant that the duplicate message that a plurality of Web data sources are described same entity only keeps portion.Such as the user to when when and the book of remarkable two Web data sources inquiries relevant " java " of buying books, and hope bought the most cheaply, represents that the record of same book identifies in this record set that just need return two Web data sources, and selects the most cheap.
Merge and to be meant that the information that a plurality of Web data sources are described same entity combines and keeps different separately parts.Such as the information of user from a plurality of Web data source inquiry someones that personal information is provided.Each Web data source provides the information of personnel's different aspect, the Web data source that has provides personnel's job information (name, sex, age, organization, position, mailbox, unit place, unit postcode etc.), the life information that personnel are provided that has (name, sex, age, native place, home phone number, home address, blood group, spouse etc.).This just need identify the same people's of expression record, thereby and it is merged into same write down the full detail that obtains this people.
Distinguish really to refer to certain method description and inequality of each Web data source, therefrom select real same entity.Such as the news at a lot of Web data sources report Yi Jianlian's ages, there are a lot of versions (18,19,24 etc.).We need therefrom pick out which at age is real.
At technical elements, because the data among the Web have highly heterogeneous (the heterogeneous different expression form that is meant data, such as different format write of date, the full name of name and breviary are expressed), characteristics such as scale is big, this makes for the description of same entity, different Web data sources have different expression-forms, thereby cause Entity recognition to exist very big difficulty on accuracy and efficient.
The method of many Entity recognition has been proposed at present, though but these methods have reached higher accuracy, but they mainly are at a spot of particularly two heterogeneous data sources, have serious efficiency for a large amount of highly heterogeneous data sources among the Web.Give an example,, utilize existing entity recognition method to need to carry out once between any therein two data sources, therefore need altogether to carry out if 100 Web data sources are arranged C 100 2 = 4950 Inferior.
In order to be provided on the extensive Web data source efficiency of Entity recognition, the method that we propose can interior all the Web data sources in field of single treatment (field of reality such as economy, physical culture, music).
Summary of the invention
In order to solve above-mentioned traditional problem, so one object of the present invention is exactly to have proposed a kind of Web entity recognition method that is used in the Entity recognition system.
In one aspect of the invention, a kind of Web entity recognition method that is used in the Entity recognition system, this Entity recognition system comprises load module, the attributive analysis module based on the field, Entity recognition module and output module, it is characterized in that the method comprising the steps of: A, input set of records ends; B, the similarity computation rule of analyzing all properties in the given field and the correlativity between the attribute; Whether C, definite given two records arbitrarily are same entity; And D, output entity sets.
In this aspect of the invention, wherein step B further comprises step: B1, given field of input; B2, carry out collecting based on the attribute in field; B3, carry out attributive classification; B4, carry out attribute similarity computation rule definition; B5, output similarity computation rule; B6, carry out the attribute correlation analysis; And the correlation models between the B7, output attribute.
In this aspect of the invention, wherein step B2 further comprises step: B2-1, carries out the collection of Web data source, obtains the abundant Web data source in this field from specific website; B2-2, carry out the collection of attribute,, extract all properties that is comprised for each Web data source of having collected; And B2-3, carry out the merging of attribute, and the community set that obtains from each Web data source is merged, the attribute of the same semanteme of expression be can be regarded as one between the different Web data sources.
In this aspect of the invention, in step B3, be useful attribute and useless attribute with attributive classification wherein, useful attribute is further divided into primary key attributes, filter attribute, important attribute and secondary attribute.
In this aspect of the invention, wherein in step B4, attribute similarity is meant two similaritys that are recorded on some predicables.Judge whether same entity of two records, need comprehensive their similaritys on each predicable to judge.
In this aspect of the invention, wherein represent by YES, MAYBE or NO ternary value for attribute similarity.
In this aspect of the invention, wherein YES is meant two to be recorded on this attribute value identical; NO is meant that two are recorded in the sure difference of the semanteme that is worth on this attribute; MAYBE is meant that two are recorded in value on this attribute because form of expression difference can't determine whether semantic identical.
In this aspect of the invention, wherein in step B6, the attribute correlation analysis is meant all properties in a given field, and the method by training obtains correlativity between the attribute.
In this aspect of the invention, wherein whether total step C further comprise step: C1, judge these two records determinant attribute, if not go to step C2, if having: identical, judge that then they are same entities; Difference judges that then they are not same entities; C2, judge the whether total filter attribute of these two records, do not go to step C3, if having: difference, judge that then they are not same entities; Identical, go to step C3; C3, two total important attribute of record of investigation are calculated the similarity on each important attribute respectively, and similarity is divided into YES, MAYBE or NO; C4, according to the attribute correlativity, utilize the attribute scale model to improve similarity on value MAYBE attribute, make two to be recorded in that value also can be judged as YES on these attributes; And if two of C5 are recorded in that value all is judged as YES on all important attribute, think two same entities of record expression so.
Description of drawings
In conjunction with accompanying drawing subsequently, what may be obvious that from following detailed description draws above-mentioned and other purpose of the present invention, feature and advantage.In the accompanying drawings:
Fig. 1 has provided the general frame figure according to Entity recognition of the present invention system;
Fig. 2 has provided the process flow diagram according to the property analysis method based on the field of the present invention;
Fig. 3 has provided the process flow diagram according to the attribute collection method based on the field of the present invention;
Fig. 4 has provided the synoptic diagram according to attributive classification of the present invention;
Fig. 5 has provided the synoptic diagram according to attribute similarity rule of the present invention;
Fig. 6 has provided the process flow diagram according to association attributes system of selection of the present invention; And
Fig. 7 has provided the process flow diagram according to entity recognition method of the present invention.
Embodiment
At first, with reference to figure 1, the general frame figure of Entity recognition according to the present invention system is described.Fig. 1 has provided the general frame figure according to Entity recognition of the present invention system.
As shown in Figure 1, this system mainly comprises four modules, promptly based on the other module of attributive analysis module, entity, load module and the output module in field.
Load module is used to import set of records ends.
Attributive analysis module based on the field is used to analyze the similarity computation rule of all properties in the given field and the correlativity between the attribute.
The Entity recognition module is used for determining whether given two records are same entity arbitrarily.
Output module is used to export entity sets.
Respectively attributive analysis module and Entity recognition module based on the field are described in detail below.
Function based on the attributive analysis module in field mainly comprises: the computing method of determining attribute similarity; Determine the correlativity between the attribute.
As shown in Figure 2, Fig. 2 has provided the process flow diagram according to the property analysis method based on the field of the present invention.At step S201, import a given field, such as books, music, film etc.At step S202, carry out collecting based on the attribute in field, with reference to figure 3 it is described in more details subsequently.At step S203, carry out attributive classification, with reference to figure 4 it is described in more details subsequently.At step S204, carry out attribute similarity computation rule definition, and at step S205, output similarity computation rule is described in more details it with reference to figure 5 subsequently.After this, at step S206, carry out the attribute correlation analysis, and at step S207, the correlation models between the output attribute is described in more details it with reference to figure 6 subsequently.
With reference now to Fig. 3, step S202 is described in detail.Fig. 3 has provided the process flow diagram according to the attribute collection method based on the field of the present invention.
At step S301, carry out the collection of Web data source, obtain the abundant Web data source in this field from Completeplanet website (www.Completeplanet.com);
At step S302, carry out the collection of attribute, for each Web data source of having collected, extract all properties that is comprised;
At step S303, carry out the merging of attribute, the community set that obtains from each Web data source is merged, the attribute of the same semanteme of expression be can be regarded as one between the different Web data sources.After this, all properties in this field of collecting is classified, at first be divided into useful attribute and useless attribute, useful attribute is further divided into primary key attributes, filter attribute, important attribute and secondary attribute, as shown in Figure 4.Fig. 4 has provided the synoptic diagram according to attributive classification of the present invention.
Useless attribute: be meant the attribute that Entity recognition is cut little ice.
Useful attribute: be meant the attribute that Entity recognition is worked.
Primary key attributes: only be meant by this attribute and can judge whether same attributes of entities of two records.
Filter attribute: be meant by this attribute and can judge that two records are not same entities, but can not determine it is same entity.
Important attribute: be meant by this attribute possibility that can to improve or reduce by two records are same entities, but can't determine.
Secondary attribute: be meant by this attribute possibility that can to improve two records are same entities, but can't determine.
With reference now to Fig. 5, step S204 is described in detail.Fig. 5 has provided the synoptic diagram according to attribute similarity rule of the present invention.
Attribute similarity is meant two similaritys that are recorded on some predicables.Judge whether same entity of two records, need comprehensive their similaritys on each predicable to judge.Represent by YES, MAYBE or NO ternary value for attribute similarity.YES is meant two, and to be recorded on this attribute value identical; NO is meant that two are recorded in the sure difference of the semanteme that is worth on this attribute; MAYBE is meant that two are recorded in value on this attribute because form of expression difference can't determine whether semantic identical.Two calculating that are recorded in a similarity on the attribute are definite by series of rules, as shown in Figure 5.
◆ the character level rule is meant the similarity of coming 2 property values of comparison from the angle of character.
Other abbreviation of character level has two kinds of forms: prefix, the combination of prefix suffix.
● the prefix rule is meant that a property value is another prefix.For example Univ is the abbreviation of University.
● prefix suffix rule of combination is meant that a property value is the combination of another prefix and suffix.For example Dept is the abbreviation of Department.
◆ plural rule is meant that a property value is another plural form.For example computers is the plural form of computer.
◆ the word level rule is meant the similarity of coming 2 property values of comparison from the angle of word.The abbreviation of word level has two kinds of forms: the series connection of word prefix, the combination of word initial.
● prefix series connection rule is meant that an attribute is the series connection of another attribute word prefix.For example Caltech is the abbreviation of California Institute of Technology.
● the initial rule of combination is meant the initial combination of the word that an attribute is another attribute.For example UCSD is University of California, the abbreviation of San Diege.
◆ add the speech rule and be meant that an attribute is the part of the whole words of another attribute, and keep original order.For example " Computer Science University California, San Diege " with " Depar tment of Computer Science in University of California, San Diege ".
◆ the rule of resetting is meant that the word that an attribute and another attribute comprise is identical, but appearance order difference.For example " Michael Jordan " and " Jordan Michael ".
Calculating for the similarity of each attribute usually need one or more rule.The attribute correlation analysis is meant all properties in a given field, and the method by training obtains correlativity between the attribute.The correlativity of attribute is meant according to an attribute similarity infers the similarity of another attribute.Such as two books records, if they are identical on the title attribute, identical possibility is also very high on author property so.
Select relevant attribute from a given community set, method as shown in Figure 6.At step S601, carry out attribute and filter.At step S602, carry out association attributes and select.At step S603, the output attribute correlation models.
Ins and outs can be referring to document " Searching for Interacting Features " (http://www.ijcai.org/papers07/contents.php).
Below the Entity recognition module is described in detail.The function of this Entity recognition module is: a given set of records ends, utilize the attribute similarity judgment rule to obtain any two similaritys that are recorded on each attribute, further utilize the attribute correlation models to judge whether same entity of these two records, repeat this process and finish up to all recording processing.Flow process as shown in Figure 7, Fig. 7 has provided the process flow diagram according to entity recognition method of the present invention.
At step S701, judge the whether total determinant attribute of these two records, if not go to step S702, if having: identical, judge that then they are same entities; Difference judges that then they are not same entities.
At step S702, judge whether these two records have filter attribute, do not go to step S703, if having: difference, judge that then they are not same entities; Identical, go to step S703.
At step S703, investigate two important attribute that record is total, calculate the similarity on each important attribute respectively, similarity is divided into YES, MAYBE or NO.
At step S704, according to the attribute correlativity, utilize the attribute scale model to improve similarity on value MAYBE attribute, make two to be recorded in that value also can be judged as YES on these attributes.
At step S705, value all is judged as YES on all important attribute if two are recorded in, and thinks that so two are write down the same entities of expression.
From the above description as can be known, proposition is based on the entity recognition method and the system in field.The input of system is the record from different Web data sources that belongs in a large number in the field, an entity of each record expression real world.The output of system is some set of records ends, and same entity represented in the record in each set.Our method is different with the previous methods maximum, can handle any two records from same field exactly, is not limited to specific data source.
What may be obvious that for the person of ordinary skill of the art draws other advantages and modification.Therefore, the present invention with wider aspect is not limited to shown and described specifying and exemplary embodiment here.Therefore, under situation about not breaking away from, can make various modifications to it by the spirit and scope of claim and the defined general inventive concept of equivalents thereof subsequently.

Claims (9)

1. Web entity recognition method that is used in the Entity recognition system, this Entity recognition system comprises load module, the attributive analysis module based on the field, Entity recognition module and output module, it is characterized in that the method comprising the steps of:
A, input set of records ends;
B, the similarity computation rule of analyzing all properties in the given field and the correlativity between the attribute;
Whether C, definite given two records arbitrarily are same entity; And
D, output entity sets.
2. according to the process of claim 1 wherein that step B further comprises step:
B1, given field of input;
B2, carry out collecting based on the attribute in field;
B3, carry out attributive classification;
B4, carry out attribute similarity computation rule definition;
B5, output similarity computation rule;
B6, carry out the attribute correlation analysis; And
Correlation models between B7, the output attribute.
3. according to the method for claim 2, wherein step B2 further comprises step:
B2-1, carry out the collection of Web data source, obtain the abundant Web data source in this field from specific website;
B2-2, carry out the collection of attribute,, extract all properties that is comprised for each Web data source of having collected; And
B2-3, carry out the merging of attribute, the community set that obtains from each Web data source is merged, the attribute of the same semanteme of expression be can be regarded as one between the different Web data sources.
4. according to the method for claim 2, in step B3, be useful attribute and useless attribute with attributive classification wherein, useful attribute is further divided into primary key attributes, filter attribute, important attribute and secondary attribute.
5. according to the method for claim 2, wherein in step B4, attribute similarity is meant two similaritys that are recorded on some predicables.Judge whether same entity of two records, need comprehensive their similaritys on each predicable to judge.
6. according to the method for claim 5, wherein represent by YES, MAYBE or NO ternary value for attribute similarity.
7. according to the method for claim 6, wherein
YES is meant two, and to be recorded on this attribute value identical;
NO is meant that two are recorded in the sure difference of the semanteme that is worth on this attribute;
MAYBE is meant that two are recorded in value on this attribute because form of expression difference can't determine whether semantic identical.
8. according to the method for claim 2, wherein in step B6, the attribute correlation analysis is meant all properties in a given field, and the method by training obtains correlativity between the attribute.
9. according to the method for claim 2, wherein step C further comprises step:
C1, judge the whether total determinant attribute of these two records, if not go to step C2, if having: identical, judge that then they are same entities; Difference judges that then they are not same entities;
C2, judge the whether total filter attribute of these two records, do not go to step C3, if having: difference, judge that then they are not same entities; Identical, go to step C3;
C3, two total important attribute of record of investigation are calculated the similarity on each important attribute respectively, and similarity is divided into YES, MAYBE or NO;
C4, according to the attribute correlativity, utilize the attribute scale model to improve similarity on value MAYBE attribute, make two to be recorded in that value also can be judged as YES on these attributes; And
Value all is judged as YES on all important attribute if two of C5 are recorded in, and thinks that so two are write down the same entities of expression.
CNA200810056102XA 2008-01-11 2008-01-11 Web entity recognition method for entity recognition system Pending CN101226547A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA200810056102XA CN101226547A (en) 2008-01-11 2008-01-11 Web entity recognition method for entity recognition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA200810056102XA CN101226547A (en) 2008-01-11 2008-01-11 Web entity recognition method for entity recognition system

Publications (1)

Publication Number Publication Date
CN101226547A true CN101226547A (en) 2008-07-23

Family

ID=39858542

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA200810056102XA Pending CN101226547A (en) 2008-01-11 2008-01-11 Web entity recognition method for entity recognition system

Country Status (1)

Country Link
CN (1) CN101226547A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236635A (en) * 2010-04-22 2011-11-09 上海百果信息科技有限公司 Method for realizing multi-system information association by capturing and comparing key elements
CN105138636A (en) * 2015-08-21 2015-12-09 浪潮软件集团有限公司 Graph construction method and device for entity relationship
CN105550336A (en) * 2015-12-22 2016-05-04 北京搜狗科技发展有限公司 Mining method and device of single entity instance
CN103257983B (en) * 2012-09-10 2016-06-15 苏州大学 A kind of Deep web entity recognition methods based on uniqueness constraint
CN105989080A (en) * 2015-02-11 2016-10-05 富士通株式会社 Apparatus and method for determining entity attribute values
CN106940702A (en) * 2016-01-05 2017-07-11 富士通株式会社 Entity refers to the method and apparatus with entity in semantic knowledge-base in connection short text
CN107423359A (en) * 2017-06-16 2017-12-01 兴业数字金融服务(上海)股份有限公司 A kind of financial product pictorial information recognition methods based on domain analysis

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236635A (en) * 2010-04-22 2011-11-09 上海百果信息科技有限公司 Method for realizing multi-system information association by capturing and comparing key elements
CN103257983B (en) * 2012-09-10 2016-06-15 苏州大学 A kind of Deep web entity recognition methods based on uniqueness constraint
CN105989080A (en) * 2015-02-11 2016-10-05 富士通株式会社 Apparatus and method for determining entity attribute values
CN105138636A (en) * 2015-08-21 2015-12-09 浪潮软件集团有限公司 Graph construction method and device for entity relationship
CN105138636B (en) * 2015-08-21 2018-07-24 浪潮软件集团有限公司 Graph construction method and device for entity relationship
CN105550336A (en) * 2015-12-22 2016-05-04 北京搜狗科技发展有限公司 Mining method and device of single entity instance
CN105550336B (en) * 2015-12-22 2018-12-18 北京搜狗科技发展有限公司 The method for digging and device of single entities example
CN106940702A (en) * 2016-01-05 2017-07-11 富士通株式会社 Entity refers to the method and apparatus with entity in semantic knowledge-base in connection short text
CN107423359A (en) * 2017-06-16 2017-12-01 兴业数字金融服务(上海)股份有限公司 A kind of financial product pictorial information recognition methods based on domain analysis

Similar Documents

Publication Publication Date Title
Lv et al. Learning to model relatedness for news recommendation
Elmeleegy et al. Mashup advisor: A recommendation tool for mashup development
CN101692223B (en) Refined Search space is inputted in response to user
KR101078864B1 (en) The query/document topic category transition analysis system and method and the query expansion based information retrieval system and method
CN103514183B (en) Information search method and system based on interactive document clustering
JP5391633B2 (en) Term recommendation to define the ontology space
KR102080362B1 (en) Query expansion
CN103593425B (en) Preference-based intelligent retrieval method and system
CN104573130B (en) The entity resolution method and device calculated based on colony
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
US20130054638A1 (en) System for detecting and tracking topic based on opinion and social-influencer for each topic and method thereof
CN101226547A (en) Web entity recognition method for entity recognition system
Yin et al. Facto: a fact lookup engine based on web tables
JP2012234522A (en) Improved similar document detecting method, device, and computer-readable recording medium
CN105426514A (en) Personalized mobile APP recommendation method
CN105512333A (en) Product comment theme searching method based on emotional tendency
CN110543595A (en) in-station search system and method
Guo et al. An opinion feature extraction approach based on a multidimensional sentence analysis model
Lóscio et al. Using information quality for the identification of relevant web data sources: a proposal
CN109947914A (en) A kind of software defect automatic question-answering method based on template
CN103034709B (en) Retrieving result reordering system and method
Das et al. Opinion based on polarity and clustering for product feature extraction
Kalita et al. An extractive approach of text summarization of Assamese using WordNet
Lai et al. Question routing by modeling user expertise and activity in cQA services
Zhang et al. A semantics-based method for clustering of Chinese web search results

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20080723