CN108363688B

CN108363688B - Named entity linking method fusing prior information

Info

Publication number: CN108363688B
Application number: CN201810103629.7A
Authority: CN
Inventors: 汤斯亮; 杨希远; 陈博; 林升; 吴飞; 庄越挺
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-02-01
Filing date: 2018-02-01
Publication date: 2020-04-28
Anticipated expiration: 2038-02-01
Also published as: CN108363688A

Abstract

The invention discloses a named entity linking method fusing prior information. The method comprises the following steps: (1) extracting character strings, namely a candidate entity table, a human name list and a place name list, from Wikipedia data dump and Freebase data dump; (2) representing each article in the Wikipedia data dump as word frequency/inverse document frequency tf-idf characteristics, and extracting the universality characteristic of each character string relative to a candidate entity; (3) inquiring and expanding the entity mentions, and generating candidate entities for the entity mentions by using the character string-candidate entity table in the step (1); (4) extracting the characteristics of the article where the entity mentions to obtain the inverse document frequency and the important word collision rate of the article; (5) and (4) calculating the association degrees between the entity and each candidate entity thereof by using the extracted features in the steps (2) and (4), and taking the highest association degree as an entity link result. The invention breaks through the limitation of lack of the corpus and provides a reliable entity link recommendation result for the user, wherein prior information is added into the entity universality characteristic.

Description

Named entity linking method fusing prior information

Technical Field

The invention relates to natural language processing, in particular to a named entity linking method fusing prior information.

Background

Natural Language Processing (NLP) is a cross discipline integrating linguistics and computer disciplines. Named Entity Linking (NEL) is a basic task in natural language processing, and aims to eliminate ambiguity caused by linguistic phenomena such as alias, reference, one-word-multiple meaning and the like and establish a corresponding relation between proper nouns (Entity names) appearing in a text and entities referred by the proper nouns in a knowledge base. The problem is defined as finding the entities to which the mentions refer from a specified knowledge base, given a piece of text and the mentions in the text (the mentions, i.e. the strings to be linked).

Entity linking this technique of establishing links between text and knowledge bases plays a very important role in information extraction. A relationship Extraction (relationship Extraction) is a typical example of the information Extraction technology that requires entity linkage. The purpose of the relation extraction problem is to extract the association relation between different entities from the text, so that the premise that the entity mention in the text is found and the entity corresponding to the entity mention is found through the entity link is further analysis.

In addition, the entity link is equivalent to adding additional information for the original text, so that the entity link can also be used in some natural language processing and text mining problems, is favorable for further understanding the text and obtains better effect.

Implementing entity linking is typically done in multiple steps, the most important two of which are Candidate generation (CandidateGeneration) and Candidate disambiguation (candidateranking). The candidate generation step finds, based on the current mention (the name used to find which entities it may refer to as candidates; then the candidate disambiguation step selects the best candidate as the final link result based on the context of the mention and some characteristics of the candidate entities themselves.

The common practice for this step of candidate generation is to construct a dictionary in advance, store which entities each name may correspond to, and when entity linking is to be performed, find the candidate entities from the dictionary based on the name currently mentioned. The dictionary is typically built using information provided in a knowledge base.

In the candidate disambiguation step, common practice includes cooperative and non-cooperative approaches. In the collaborative method, when entity linking is performed, multiple references in the same context are generally considered at the same time, and it is desirable that the degree of association between their respective target entities in the linking result is as large as possible. While non-cooperative approaches are each referred to individually. The approach we used is a non-synergistic approach. Non-synergistic methods are generally faster and have less than optimal performance compared to synergistic methods.

Traditional non-collaborative methods may design a series of features including: name Features (Surface Features) for measuring similarity between the name string and the candidate entity name, such as the number of words in the name string and the candidate entity name; context Features (Context Features) for measuring semantic matching degree between the candidate entity and the mention Context, such as TF-IDF similarity between the document where the mention is located and the description of the candidate entity, whether words in the Wikipedia page title of the candidate entity appear in the document where the mention is located, and the like; other features such as the number of country names co-occurring in the document and candidate entity description of mention, the number of country names co-occurring in all aliases of candidate entities and in documents of mention, etc.

Since such heuristic features require expert knowledge, the original feature engineering is ineffective once the knowledge base or corpus is changed, and we hope to find a better effect by using as few features as possible.

Disclosure of Invention

The invention aims to provide a named entity linking method fusing prior information, which aims to link recognized entities in natural texts to a target knowledge base (Freebase) so as to provide a basis for subsequent tasks such as information extraction and the like.

Therefore, the IWHR (Important Word Hit Rate) feature is invented and combined with two features of common and tf/idf to judge the matching degree between the entity and the object, the original name feature is ensured by a candidate generation process, and the common feature adds prior information into the method.

Because the parameters of the general named entity link model need to be determined through the training corpora, the training corpora are difficult to obtain, and the three are combined together in a non-training mode by the method provided by the invention, and the suggested parameter setting is given.

In addition, in order to compensate the problem that the non-cooperative method does not consider the context, a query expansion is added before entity linking, and the query expansion is specially optimized for the names of people and places which possibly point to the same entity in the same article.

Tf/idf (term frequency/inverse document frequency feature, i.e., Tf-idf) is commonly used and measures the degree of similarity between articles, and we introduce this feature here to measure the degree of similarity between a reference context and a candidate entity context.

Common, reflecting the probability of candidate entities as mentions, the introduction of this feature amounts to the addition of a priori information that, in the case of insufficient context, can be used to judge the direction of mention. The calculation formula is as follows:

is provided with

For the anchor text set, shown as name string s and linked to the page corresponding to entity e, A^sIs a set of anchor text displayed as a literal string s. Then there are:

IWHR remedies the deficiency of tf/idf, and focuses more on the consideration of important words appearing in the context of articles, and the calculation formula is as follows:

let e be a candidate entity of Wikipedia, m be the character string to be recognized, W_dSet of words for the article in which m is located, W_eE, then IWHR (e, m) may be calculated according to equation (2) where T is a manually set threshold, which the method sets to

The invention is realized by the following technical scheme:

the named entity linking method fusing prior information comprises the following steps:

s1: extracting character strings, namely a candidate entity table, a human name list and a place name list from the Wikipedia data dump and the Freebase data dump;

s2: representing each article in the Wikipedia data dump as a word frequency-inverse document frequency tf-idf characteristic, extracting and storing a commonality characteristic of each character string relative to a candidate entity;

s3: inquiring and expanding the name list and the place name list obtained in the S1, and generating candidate entities for entity mentions by using the character string-candidate entity table obtained in the S1;

s4: calculating the important word collision rate IWHR of the entity relative to the candidate entity;

s5: and calculating the association degrees of the entity and each candidate entity thereof according to the tf-idf feature, common feature and IWHR feature extracted in the S2 and the S4, and taking the highest association degree as an entity link result.

The steps can be realized in the following way:

s1 specifically includes the following steps:

s11: analyzing the Wikipedia data dump and extracting an article D containing entities in Wikipedia_eAnchor text a in an article_eEntity number W corresponding to the article_idRedirecting pages and disambiguating pages, and further generating a character string-candidate entity table str2 entry;

s12: all the person names and place names in the Freebase data dump are extracted to form a person name list Pname and a place name list Plocation.

S2 specifically includes the following steps:

s21: segmenting words of each article of Wikipedia by adopting a natural language processing tool StanfoldCoreNLP, and removing stop words by using a stop word dictionary to obtain a word list;

s22: and calculating the inverse document frequency idf of all words in each article based on the word list, wherein the idf calculation formula of the word is as follows:

the document number of the corpus is the total number of articles in Wikipedia;

s23: calculating the word frequency tf of all words in each article based on the word list, wherein the tf calculation formula of the word is as follows:

s24: and calculating word frequency-inverse document frequency tf-idf vectors of all words in each article of Wikipedia according to results of S22 and S23, wherein the tf-idf vector of the word is expressed as follows:

s25: tfidf obtained based on S24_word(word), the word frequency-inverse document frequency values of the first 20 words arranged from big to small are kept as tf-idf characteristics of the article and are marked as tfidf (document);

s26: the commonality characteristic of each string relative to the candidate entity is calculated according to the following formula:

where e is a candidate entity, m is a string,

set of anchor text with surface name m and linking entity e, A^mFor an anchor text set with a surface name m, | · | represents the number of elements in the set;

s27: and storing the tf-idf characteristic of each article and the common characteristic of each character string relative to the candidate entity.

S3 specifically includes the following steps:

s31: inquiring Pname and Plocation according to the entity mention S, if the character string is in any table, then turning to step S32, otherwise, turning to step S33;

s32: checking whether a character string S ' exists in the text of S, wherein S is the character sub-string of S ', if so, replacing S with S ' and then switching to S33, and if not, directly switching to S33;

s33: and querying str2 entry by using the character string s to obtain all candidate entities of the character string.

S4 specifically includes the following steps:

s41: using a StanfoldCoreNLP tool to perform word segmentation on the article where the entity is mentioned, and removing stop words to obtain a word list;

s42: obtaining an idf value idf (w) of each word in the article where the entity is mentioned according to the idf calculation formula in S22;

s43: the important word collision rate IWHR (e, m) is calculated according to the following formula:

wherein e is a candidate entity, m is an entity mention, W_dSet of words for the article in which m is located, W_eT is the set of words in the article where e is located, and T is the set idf threshold. The idf threshold can be set to

In S5, the specific steps of calculating the association degrees between the entity and its respective candidate entities, and using the highest association degree as the entity link result are as follows:

s51: extracting article d of entity mentioning m_mArticle d of the location of a candidate entity e_e；

S52: obtaining article d from the stored result of S2_mAnd article d_eTf-idf feature of (a);

s53: for each candidate entity e, computing the tf-idf similarity between the entity mention and its candidate entity:

wherein: | represents the modulus of the vector;

s54: obtaining common characteristics and IWHR characteristics of the entity mention m relative to the candidate entity e from the calculation results of S2 and S4;

s55: for each candidate entity e, calculating the similarity between the entity mention and the candidate entity according to the following formula:

similarity(e,m)＝a×log(commoness(e,m))+b×log(tfidfsimilarity(e,m))+c×log(IWHR(e,m))

wherein a, b and c are constants;

s56: calculate the final entity linking result e_result：

e_result＝argmax_e(similarity(e,m))。

a, b and c can be learned through a neural network, and can also be manually debugged and set, and the suggestions are respectively set to 1.0,6.0 and 1.0.

The invention only uses three characteristics of entity universality, word frequency/inverse document frequency, important word collision rate and the like, breaks through the limitation of lack of linguistic data, and provides a reliable entity link recommendation result for a user, wherein prior information is added into the entity universality characteristic.

Drawings

FIG. 1 is a flowchart of a work flow of extracting resources using Wikipedia data dump and Freebase data dump;

fig. 2 is a workflow diagram of the main steps of the named entity linking method with the prior information.

Detailed Description

The invention is further described with reference to the following figures and detailed description.

The invention mainly combines common, tf/idf and IWHR three characteristics aiming at the command entity link task, and realizes a named entity link method fusing prior information. Because the used characteristics are less, the parameters needing fitting are less finally, and the method is more convenient under the condition of migrating the knowledge base and the linguistic data.

As shown in fig. 1 and 2, a named entity linking method fusing prior information includes the following steps:

s1: extracting character strings, namely a candidate entity table, a human name list and a place name list from the Wikipedia data dump and the Freebase data dump; the method is implemented as follows:

S2: representing each article in the Wikipedia data dump as a word frequency-inverse document frequency tf-idf characteristic, extracting and storing a commonality characteristic of each character string relative to a candidate entity; the method is implemented as follows:

the document number of the corpus is the total number of articles in Wikipedia;

where e is a candidate entity, m is a string,

S3: inquiring and expanding the name list and the place name list obtained in the S1, and generating candidate entities for entity mentions by using the character string-candidate entity table obtained in the S1; the method is implemented as follows:

S4: calculating the important word collision rate IWHR of the entity relative to the candidate entity; the method is implemented as follows:

wherein e is a candidate entity, m is an entity mention, W_dSet of words for the article in which m is located, W_eT is the set of words in the article where e is located, and T is the set idf threshold. The idf threshold may be set to

S5: and calculating the association degrees of the entity and each candidate entity thereof according to the tf-idf feature, common feature and IWHR feature extracted in the S2 and the S4, and taking the highest association degree as an entity link result. The method is implemented as follows:

wherein: | represents the modulus of the vector;

wherein a, b and c are constants;

s56: calculate the final entity linking result e_result：

e_result＝argmax_e(similarity(e,m))。

a, b and c can be manually set to 1.0,6.0 and 1.0 respectively.

The method is applied to the following examples in order that those skilled in the art will better understand the specific implementation of the present invention.

Examples

Taking the document of the entity discovery and linking subtask of the text analysis conference (text analysis conference) in 2017 as an example, the method is applied to text naming entity linking (the resource extraction process is not detailed, and the process is more complex), and specific parameters and methods in each step are as follows:

1. obtaining a reference to be linked by using a named entity recognition tool or manual marking on an original document set, wherein the reference is specifically a triple of an article, a first letter position and a tail letter position of a given character string;

2. writing a script to extract all contents in the document set (removing xml tags), wherein each article is used as a file;

3. performing word segmentation on each article by using a natural language processing tool StanfoldCoreNLP, removing stop words, and counting the total word number of each article;

4. counting the occurrence times of each word for each article, and calculating the tf value of each word in each article, wherein the calculation formula is as follows:

5. and (3) calculating the tf-idf value of each word in each article according to the word list obtained by the Wikipedia statistics, the idf value of the word and the tf value obtained by the calculation, wherein the calculation formula is as follows:

6. in each article, the tf-idf value takes the first 20 words and corresponding words as the tf-idf characteristics of the article according to the descending order;

7. inquiring and expanding each mention identified in the step 1 according to whether the name of the person and the name of the place are inquired and expanded, if so, inquiring and expanding are carried out, and the judgment mode is as follows:

if the name is referred to in the name list and the place name list, the name is considered as the name and the place name;

the inquiry expanding mode is as follows:

determining whether a character string s 'is in the article where s is located before s, wherein s' is an abbreviation of s; or s is part of s ', e.g. s' is

Hilary Clinton, s is Clinton; if this is the case, s is replaced by s';

8. for each extracted query character string-candidate entity-common list, obtaining candidate entities of the character string and corresponding common characteristics;

9. for each mention, calculating the tf-idf similarity between the article in which the mention is made and their corresponding candidate entity article by the following formula:

10. for each mention, calculating the IWHR similarity between the article in which the mention is made and their corresponding candidate entities by the following formula:

let e be a candidate entity of Wikipedia, m be the character string to be recognized, W_dSet of words for the article in which m is located, W_eE, then IWHR (e, m) may be calculated according to the following equation (2), T being a manually set threshold value, which the method sets to

11. For each mention, and for each candidate entity for them, calculating the mention entity relevance by the following formula:

similarity(e,m)＝a×log(commoness(e,m))+b×log(tfidfsimilarity(e,m))+c×log(IWHR(e,m))(5)

(a, b, c) is set to (1.0,6.0, 1.0);

12. take e, which maximizes the above equation, as the result of the linkage for m, i.e., the following equation:

e_result＝argmax_e(similarity(e,m)) (6)

the following table is the partial final link result for the selected document.

WORD	Beg	End	KBid
				Turkey	2279	2284	m.01znc_
Microsoft	2620	2628	m.04sv4
				Nam Dinh	3703	3710	m.07m1dj
the Beatles	2078	2088	m.07c0j
				Gaisano mall	2642	2653	m.09rxbx2

Claims

1. A named entity linking method fusing prior information is characterized by comprising the following steps:

s5: calculating the association degrees of the entity and each candidate entity thereof according to the tf-idf feature, common feature and IWHR feature extracted in S2 and S4, and taking the highest association degree as an entity link result;

s1 specifically includes the following steps:

s12: extracting all the person names and place names in the Freebase data dump to form a person name list Pname and a place name list Plocation;

s2 specifically includes the following steps:

the document number of the corpus is the total number of articles in Wikipedia;

where e is a candidate entity, m is a string,

s27: storing the tf-idf characteristic of each article obtained by calculation and the common characteristic of each character string relative to the candidate entity;

s3 specifically includes the following steps:

s31: inquiring Pname and Plocation according to the corresponding character string S mentioned by the entity, if the character string S is in any table, then turning to step S32, otherwise, turning to step S33;

s33: querying str2 entry by using the character string s to obtain all candidate entities of the character string s;

s4 specifically includes the following steps:

wherein e is a candidate entity, m is an entity mention, W_dSet of words for the article in which m is located, W_eThe word set of the article where e is located, and T is a set idf threshold value;

in S5, the specific steps of calculating the association degrees between the entity and its candidate entities, and using the highest association degree as the entity link result are as follows:

wherein: | represents the modulus of the vector;

wherein a, b and c are constants;

s56: calculate the final entity linking result e_result：

e_result＝arg max_e(similarity(e,m))。

2. The method as claimed in claim 1, wherein the idf threshold is set to be a threshold value

3. The method as claimed in claim 1, wherein a, b, and c are respectively set to 1.0,6.0, and 1.0.