CN108363688A

CN108363688A - A kind of name entity link method of fusion prior information

Info

Publication number: CN108363688A
Application number: CN201810103629.7A
Authority: CN
Inventors: 汤斯亮; 杨希远; 陈博; 林升; 吴飞; 庄越挺
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-02-01
Filing date: 2018-02-01
Publication date: 2018-08-03
Anticipated expiration: 2038-02-01
Also published as: CN108363688B

Abstract

The invention discloses a kind of name entity link methods of fusion prior information.This method comprises the following steps：(1) from Wikipedia data dump, Freebase data dump extraction character string candidate's entity table, name list, list of file names；(2) every article in Wikipedia data dump is expressed as word frequency/inverse document frequency tf idf features, and extracts versatility feature of each character string relative to candidate entity；(3) entity is referred to and carries out inquiry expansion, use character string candidate's entity table in (1), the candidate entity of generation is referred to for entity；(4) feature for extracting article where entity refers to, obtains the inverse document frequency and primary word collision rate of article；(5) feature for using (2), (4) to be extracted, computational entity refer to the correlation degree between its each candidate entity, and be used as entity link result by correlation degree is highest.The present invention breaches the limitation of language material shortage, has provided reliable entity link recommendation results to the user, wherein entity versatility feature adds prior information.

Description

A kind of name entity link method of fusion prior information

Technical field

The present invention relates to natural language processing more particularly to a kind of name entity link methods of fusion prior information.

Background technology

Natural language processing (Nature Language Processing, abbreviation NLP) is a collection linguistics and calculating The cross discipline that machine science is integrated.It is at natural language to name entity link (Named Entity Linking, abbreviation NEL) A basic task in reason, it is intended to eliminate the ambiguity caused due to linguistics phenomenons such as alias, reference, a words mostly meaning, establish Correspondence between the entity that its in the proper noun (physical name) occurred in text and knowledge base is referred to.The definition of problem It is:It gives in one section of text and text and refers to (mention, that is, character string to be linked), the knowledge specified from one These are found out in library refers to referred to entity.

This technology that can be established the link between text and knowledge base of entity link has very in information extraction Important role.Wherein Relation extraction (Relation Extraction) is exactly that entity link is needed in information extraction technique One exemplary.The purpose of Relation extraction problem thus is found out from the incidence relation extracted in text between different entities It is the premise being further analyzed that entity in text, which refers to and finds their corresponding entities by entity link,.

In addition, entity link, which is equivalent to, increases additional information for original text again, thus can also be used in In natural language processing and text mining problem, be conducive to be more fully understood from text, obtain better effect.

Entity link generally point multistep is realized to complete, two step of most important one is candidate generation (Candidate Generation) (Candidate Ranking) is disambiguated with candidate.Candidate generation step is according to currently referring to (name used Find out which entity it may refer to, as candidate；Then, the candidate step that disambiguates is according to the context referred to and candidate's reality Some features of body itself select optimal candidate as final link result.

The common practice that candidate generates this step is one dictionary of prior construction, stores which reality is each name may correspond to Body, when entity link to be executed, so that it may which candidate entity is found out from dictionary according to the name for currently referring to used.General profit The dictionary is built with the information provided in knowledge base.

Step is disambiguated in candidate, common practice includes the method for collaboration and miscoordination.The method of collaboration is carrying out entity When link, it is general consider simultaneously in same context multiple refer to, it is desirable to link in result between their own target entity Correlation degree is big as possible.And the method for miscoordination then individually considers each refer to.Method used in us is the side of miscoordination Method.The method of miscoordination carrys out Synergistic method general speed comparatively fast slightly inferior properties relatively.

Traditional miscoordination method, can design series of features includes：Name feature (Surface Features), for weighing Measure the similarity between the name character string that refers to and candidate physical name, the name character string such as referred to in candidate physical name Same words number；Contextual feature (Context Features) weighs candidate entity and refers to context in semantically With degree, document where such as referring to and the TF-IDF similarities of candidate entity description, the Wikipedia page titles of candidate entity In word whether all referring in a document occurring；Other features, such as appear in document where referring to jointly and candidate is real National concrete number in body description, appear in jointly all alias neutralization of candidate entity refer to national concrete number in a document etc..

Since feature didactic in this way is highly desirable to expertise, once knowledge base or language material change, and it is original Feature Engineering with regard to invalid, it is intended that seek to obtain preferable effect by feature few as possible.

Invention content

The purpose of the invention is to by the entity link identified in natural text to object knowledge library (Freebase), to provide basis for follow-up works such as information extractions, a kind of name entity link of fusion prior information is proposed Method.

Therefore we have invented IWHR (Important Word Hit Rate, primary word impact rate) this features, and will Itself and two kinds of features of commoness, tf/idf combine the matching degree for going to judge entity Yu refer to, original name feature then to lead to The process of candidate generation is crossed to ensure, the method that commoness is characterized as us adds prior information.

Again because common name entity link model needs to determine that parameter, training corpus obtain by training corpus Take extremely difficult, the method that we invent just is combined three in a manner of non-training, while giving the ginseng of suggestion Number setting.

In addition, not considering the problems of that context refers to make up miscoordination method, we are before carrying out entity link Inquiry extension is added, the name of the same entity is directed toward for the possibility in same piece article and place name has done special optimization.

Tf/idf (word frequency/inverse document frequency feature, i.e. Tf-idf) is common and weighs the similarity degree between article, we Here this feature is introduced for weighing the similarity degree for referring to context and candidate entity context.

Commonness has reacted candidate entity as the probability referred to, and introducing this feature, which is equivalent to, adds priori letter Breath, the direction that can be used for judge refer to insufficient in context.Calculation formula is as follows：

IfTo be shown as name character string s and being linked to the Anchor Text set of entity e corresponding pages, A^sTo be shown as word The Anchor Text set of face character string s.Then have：

IWHR compensates for the deficiency of tf/idf, more lays particular emphasis on considering for the important word occurred in contexts, meter It is as follows to calculate formula：

If e is the candidate entity of a wikipedia, m is character string to be identified, W_dThe set of letters of article, W where m_e For the set of letters of the page of e, then IWHR (e, m) can be calculated according to formula (2), and in formula (2), T is manually set This is set as by threshold value, this method

The present invention is realized especially by following technical solution：

The name entity link method for merging prior information, includes the following steps：

S1：Character string-candidate's entity table, people are extracted from Wikipedia data dump, Freebase data dump List of file names and ground list of file names；

S2：Every article in Wikipedia data dump is expressed as term frequency-inverse document frequency tf-idf features, Each of which character string is extracted relative to the versatility commoness features of candidate entity and is stored；

S3：Using the name list obtained in S1, list of file names make inquiry expansion, use the character string-time obtained in S1 Entity table is selected, the candidate entity of generation is referred to for entity；

S4：Computational entity refers to the primary word collision rate IWHR relative to candidate entity；

S5：According to the tf-idf features extracted in S2, S4, commoness features, IWHR feature calculation entities refer to and it Correlation degree between each candidate's entity, and correlation degree is highest as entity link result.

Above steps can specifically use following realization method：

S1 is specifically comprised the steps of：

S11：Wikipedia data dump are parsed, the article D for including entity in wikipedia is extracted_e, anchor in article Text A_e, the corresponding entity number W of article_id, redirect page Repages, disambiguate page dispages, and then generate character String-candidate's entity table str2entity；

S12：All names, place name in extraction Freebase data dump form name list Pname and ground ranks Table Plocation.

S2 is specifically comprised the steps of：

S21：Using natural language processing tool StanfoldCoreNLP to every article participle of Wikipedia, make simultaneously Stop words is removed with stop words dictionary, obtains vocabulary；

S22：Based on vocabulary, the inverse document frequency idf of all words in every article is calculated, wherein word word's Idf calculation formula are as follows：

Wherein the number of files of corpus is the article sum in Wikipedia；

S23：Based on vocabulary, the word frequency tf of all words in every article is calculated, the tf of wherein word word calculates public affairs Formula is as follows：

S24：According to S22, S23 as a result, calculating the term frequency-inverse document frequency of all words in every article of Wikipedia The tf-idf vectors of rate tf-idf vectors, wherein word word indicate as follows：

S25：Based on the tfidf obtained by S24_word(word), retain the inverse text of word frequency-for arranging preceding 20 words from big to small Shelves frequency values are denoted as tfidf (document) as the tf-idf features of this article；

S26：Versatility commoness feature of each character string relative to candidate entity is calculated according to following formula：

Wherein e is a candidate entity, and m is character string,For the entitled m in surface and link the Anchor Text collection that entity is e It closes, A^mFor the Anchor Text set of the entitled m in surface, | | indicate the element number in set；

S27：By the tf-idf features for every article being calculated and each character string relative to candidate entity Commoness features are stored.

S3 is specifically comprised the steps of：

S31：S is referred to according to entity, Pname, Plocation is inquired, if the character string is transferred to step in either table Otherwise S32 is transferred to step S33；

S32：Check that above whether have character string s', the wherein s of s are the character substrings of s', and if it exists, then replace s To be transferred to S33 after s', S33 is directly transferred to if being not present；

S33：Str2entity is inquired using character string s, obtains all candidate entities of the character string.

S4 is specifically comprised the steps of：

S41：The article where entity refers to is segmented using StanfoldCoreNLP tools, while removing and deactivating Word obtains vocabulary；

S42：The idf values of each word in the article where entity refers to are obtained according to the idf calculation formula described in S22 idf(w)；

S43：Primary word collision rate IWHR (e, m) is calculated according to following formula：

Wherein, e is a candidate entity, and m is that entity refers to, W_dThe set of letters of article, W where m_eThe article where e Set of letters, T be setting idf threshold values.The idf threshold values may be set to

In S5, computational entity refer to the correlation degree between its each candidate entity, and by the highest work of correlation degree It is as follows for entity link result：

S51：Extraction entity refers to the place article d of m_mAnd its place article d of a certain candidate entity e_e；

S52：Article d is obtained according to S2 storage results_mWith article d_eTf-idf features；

S53：To each candidate entity e, computational entity refers to and the tf-idf similitudes between its candidate entity：

Wherein：| | | | indicate vector field homoemorphism；

S54：From the result of calculation of S2 and S4 obtain entity refer to m relative to candidate entity e commoness features and IWHR features；

S55：To each candidate entity e, the similarity between candidate entity is referred to according to following formula computational entity：

Similarity (e, m)=a × log (commoness (e, m))+b × log (tfidfsimilarity (e, m))+ c×log(IWHR(e,m))

Wherein a, b, c are constant；

S56：Calculate last entity link result e_result：

e_result=argmax_e(similarity(e,m))。

A, b, c can be arrived by neural network learning, can also manual debugging and setting, it is proposed that 1.0 are respectively set to, 6.0,1.0.

Entity versatility, word frequency/inverse document frequency is used only in the present invention, and three kinds of features such as primary word collision rate breach The limitation that language material lacks, has provided reliable entity link recommendation results to the user, wherein entity versatility feature adds elder generation Test information.

Description of the drawings

Fig. 1 is the work flow diagram using Wikipedia data dump, Freebase data dump extraction resources；

Fig. 2 is the key step work flow diagram for the name entity link method for merging prior information.

Specific implementation mode

The present invention is further elaborated with reference to the accompanying drawings and detailed description.

Present invention is generally directed to order entity link tasks by commoness, these three features of tf/idf, IWHR combine, A kind of name entity link method of fusion prior information is realized, this method, which has considered, refers to that context, entity are popular The importance of the priori of degree, the keyword of text, realizes higher accuracy rate, has also taken into account the efficiency of entity link.By It is less in the feature used, finally need the parameter being fitted less, it is also more square in the case where migrating knowledge base and language material Just.

As illustrated in fig. 1 and 2, a kind of name entity link method of fusion prior information, includes the following steps：

S1：Character string-candidate's entity table, people are extracted from Wikipedia data dump, Freebase data dump List of file names and ground list of file names；This step concrete methods of realizing is as follows：

S2：Every article in Wikipedia data dump is expressed as term frequency-inverse document frequency tf-idf features, Each of which character string is extracted relative to the versatility commoness features of candidate entity and is stored；This step implements Method is as follows：

Wherein the number of files of corpus is the article sum in Wikipedia；

S3：Using the name list obtained in S1, list of file names make inquiry expansion, use the character string-time obtained in S1 Entity table is selected, the candidate entity of generation is referred to for entity；This step concrete methods of realizing is as follows：

S4：Computational entity refers to the primary word collision rate IWHR relative to candidate entity；This step concrete methods of realizing is such as Under：

Wherein, e is a candidate entity, and m is that entity refers to, W_dThe set of letters of article, W where m_eThe article where e Set of letters, T be setting idf threshold values.Idf threshold values can be set as

S5：According to the tf-idf features extracted in S2, S4, commoness features, IWHR feature calculation entities refer to and it Correlation degree between each candidate's entity, and correlation degree is highest as entity link result.This step implements Method is as follows：

Wherein：| | | | indicate vector field homoemorphism；

Wherein a, b, c are constant；

S56：Calculate last entity link result e_result：

e_result=argmax_e(similarity(e,m))。

A, b, c can manually be respectively set to 1.0,6.0,1.0.

This method is applied to following embodiments below, so that those skilled in the art more fully understand that the present invention's is specific It realizes.

Embodiment

It is found with the entity of text analyzing meeting in 2017 (text analysis conference) and links subtask Document for, by the above method be applied to text in carry out text name entity link, (Resource Access process is no longer described in detail, mistake Journey is complex) design parameter and way are as follows in each step：

1. by original document set using name Entity recognition tool or artificial mark obtain it is to be linked refer to, specifically To provide article where the character string referred to, initial position, the triple of last letter position；

2. writing all the elements (removing xml labels) in script abstracting document set, every article is as a file；

3. being segmented using natural language processing tool StanfoldCoreNLP to every article, while removing and deactivating Word counts total word number of every article；

4. the occurrence number of pair each word of every paper statistics calculates the tf values of each word in every article, calculates Formula is as follows：

5. the idf values according to the vocabulary and word that are counted to Wikipedia and obtained tf values calculated above, The tf-idf values of each word in every article are calculated, calculation formula is as follows：

6. in every article, tf-idf values according to arrangement from big to small take first 20 with corresponding word as this article Tf-idf features；

7. each of identify and to refer in pair 1, according to whether be name, place name carries out inquiry expansion, be to be asked It askes and expands, judgment mode is：

If refer to name list, be taken as name, place name inside list of file names；

Mode is expanded in inquiry：

Whether there is character string s''s, s' to be abbreviated as s before s in article where determining s；Or the part that s is s', such as S' is

Hilary Clinton, s Clinton；If there are such case, then s is replaced with into s'；

8. pair each refer to inquiry string-candidate's entity-commoness lists, obtain the candidate entity of character string with And corresponding commoness features；

9. pair each referring to, article where referring to and the tf-idf phases between their corresponding candidate entity articles are calculated Like property, calculation formula is as follows：

10. pair each referring to, article where referring to and the IWHR similitudes between their corresponding candidate entities are calculated, Calculation formula is as follows：

If e is the candidate entity of a wikipedia, m is character string to be identified, W_dThe set of letters of article, W where m_e For the set of letters of the page of e, then IWHR (e, m) can be calculated according to following equalities (2), and T is the threshold value manually set, we This is set as by method

11. it is pair each refer to and they each of candidate entity calculating carry with the entity degree of correlation, calculation formula is as follows：

Similarity (e, m)=a × log (commoness (e, m))+b × log (tfidfsimilarity (e, m))+ c×log(IWHR(e,m))(5)

(a, b, c) is set as (1.0,6.0,1.0)；

The maximum e of above formula is enabled 12. taking, the link as m is as a result, namely following equalities：

e_result=argmax_e(similarity(e,m)) (6)

Following table is that the part of selected document finally links result.

WORD	Beg	End	KBid
				Turkey	2279	2284	m.01znc_
Microsoft	2620	2628	m.04sv4
				Nam Dinh	3703	3710	m.07m1dj
the Beatles	2078	2088	m.07c0j
				Gaisano mall	2642	2653	m.09rxbx2

Claims

1. a kind of name entity link method of fusion prior information, it is characterised in that include the following steps：

S1：Character string-candidate's entity table, name row are extracted from Wikipedia data dump, Freebase data dump Table and ground list of file names；

S2：Every article in Wikipedia data dump is expressed as term frequency-inverse document frequency tf-idf features, is extracted Each of which character string relative to candidate entity versatility commoness features and stored；

S3：Using the name list obtained in S1, list of file names make inquiry expansion, use the character string obtained in S1-candidate real Body surface refers to the candidate entity of generation for entity；

S5：It is referred to according to the tf-idf features, commoness features, IWHR feature calculation entities extracted in S2, S4 each with it Correlation degree between candidate entity, and correlation degree is highest as entity link result.

2. a kind of name entity link method of fusion prior information according to claim 1, it is characterised in that S1 is specific It comprises the steps of：

S11：Wikipedia data dump are parsed, the article D for including entity in wikipedia is extracted_e, Anchor Text in article A_e, the corresponding entity number W of article_id, redirect page Repages, disambiguate page dispages, and then generate character string-time Select entity table str2entity；

S12：Extract all names in Freebase data dump, place name forms name list Pname and ground list of file names Plocation。

3. a kind of name entity link method of fusion prior information according to claim 1, it is characterised in that S2 is specific It comprises the steps of：

S21：Using natural language processing tool StanfoldCoreNLP to every article participle of Wikipedia, while use stops Word dictionary removes stop words, obtains vocabulary；

S22：Based on vocabulary, the inverse document frequency idf of all words in every article, the idf meters of wherein word word are calculated It is as follows to calculate formula：

Wherein the number of files of corpus is the article sum in Wikipedia；

S23：Based on vocabulary, the word frequency tf of all words in every article, wherein the tf calculation formula of word word are calculated such as Under：

S24：According to S22, S23 as a result, calculating the term frequency-inverse document frequency tf- of all words in every article of Wikipedia The tf-idf vectors of idf vectors, wherein word word indicate as follows：

S25：Based on the tfidf obtained by S24_word(word), retain the term frequency-inverse document frequency for arranging preceding 20 words from big to small Rate value is denoted as tfidf (document) as the tf-idf features of this article；

Wherein e is a candidate entity, and m is character string,For the entitled m in surface and link the Anchor Text set that entity is e, A^m For the Anchor Text set of the entitled m in surface, | | indicate the element number in set；

4. a kind of name entity link method of fusion prior information according to claim 1, it is characterised in that S3 is specific It comprises the steps of：

S31：S is referred to according to entity, inquires Pname, Plocation, if the character string in either table, is transferred to step S32, Otherwise it is transferred to step S33；

S32：Check that above whether have character string s', the wherein s of s are the character substrings of s', and if it exists, s is then replaced with into s' After be transferred to S33, if there is no if be directly transferred to S33；

5. a kind of name entity link method of fusion prior information according to claim 1, it is characterised in that S4 is specific It comprises the steps of：

S41：The article where entity refers to is segmented using StanfoldCoreNLP tools, while removing stop words, is obtained To vocabulary；

S42：The idf values idf of each word in the article where entity refers to is obtained according to the idf calculation formula described in S22 (w)；

Wherein, e is a candidate entity, and m is that entity refers to, W_dThe set of letters of article, W where m_eThe list of article where e Set of words, T are the idf threshold values of setting.

6. a kind of name entity link method of fusion prior information according to claim 5, it is characterised in that described Idf threshold values are set as

7. a kind of name entity link method of fusion prior information according to claim 1, it is characterised in that described In S5, computational entity refer to the correlation degree between its each candidate entity, and be used as chain of entities by correlation degree is highest Binding fruit is as follows：

Wherein：| | | | indicate vector field homoemorphism；

S54：Entity is obtained from the result of calculation of S2 and S4 refers to commoness feature and IWHR of the m relative to candidate entity e Feature；

Similarity (e, m)=a × log (commoness (e, m))+b × log (tfidfsimilarity (e, m))+c × log(IWHR(e,m))

Wherein a, b, c are constant；

S56：Calculate last entity link result e_result：

e_result=argmax_e(similarity(e,m))。

8. a kind of name entity link method of fusion prior information according to claim 7, it is characterised in that described A, b, c are respectively set to 1.0,6.0,1.0.