CN108363688B - Named entity linking method fusing prior information - Google Patents

Named entity linking method fusing prior information Download PDF

Info

Publication number
CN108363688B
CN108363688B CN201810103629.7A CN201810103629A CN108363688B CN 108363688 B CN108363688 B CN 108363688B CN 201810103629 A CN201810103629 A CN 201810103629A CN 108363688 B CN108363688 B CN 108363688B
Authority
CN
China
Prior art keywords
entity
article
word
candidate
idf
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810103629.7A
Other languages
Chinese (zh)
Other versions
CN108363688A (en
Inventor
汤斯亮
杨希远
陈博
林升
吴飞
庄越挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201810103629.7A priority Critical patent/CN108363688B/en
Publication of CN108363688A publication Critical patent/CN108363688A/en
Application granted granted Critical
Publication of CN108363688B publication Critical patent/CN108363688B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a named entity linking method fusing prior information. The method comprises the following steps: (1) extracting character strings, namely a candidate entity table, a human name list and a place name list, from Wikipedia data dump and Freebase data dump; (2) representing each article in the Wikipedia data dump as word frequency/inverse document frequency tf-idf characteristics, and extracting the universality characteristic of each character string relative to a candidate entity; (3) inquiring and expanding the entity mentions, and generating candidate entities for the entity mentions by using the character string-candidate entity table in the step (1); (4) extracting the characteristics of the article where the entity mentions to obtain the inverse document frequency and the important word collision rate of the article; (5) and (4) calculating the association degrees between the entity and each candidate entity thereof by using the extracted features in the steps (2) and (4), and taking the highest association degree as an entity link result. The invention breaks through the limitation of lack of the corpus and provides a reliable entity link recommendation result for the user, wherein prior information is added into the entity universality characteristic.

Description

Named entity linking method fusing prior information
Technical Field
The invention relates to natural language processing, in particular to a named entity linking method fusing prior information.
Background
Natural Language Processing (NLP) is a cross discipline integrating linguistics and computer disciplines. Named Entity Linking (NEL) is a basic task in natural language processing, and aims to eliminate ambiguity caused by linguistic phenomena such as alias, reference, one-word-multiple meaning and the like and establish a corresponding relation between proper nouns (Entity names) appearing in a text and entities referred by the proper nouns in a knowledge base. The problem is defined as finding the entities to which the mentions refer from a specified knowledge base, given a piece of text and the mentions in the text (the mentions, i.e. the strings to be linked).
Entity linking this technique of establishing links between text and knowledge bases plays a very important role in information extraction. A relationship Extraction (relationship Extraction) is a typical example of the information Extraction technology that requires entity linkage. The purpose of the relation extraction problem is to extract the association relation between different entities from the text, so that the premise that the entity mention in the text is found and the entity corresponding to the entity mention is found through the entity link is further analysis.
In addition, the entity link is equivalent to adding additional information for the original text, so that the entity link can also be used in some natural language processing and text mining problems, is favorable for further understanding the text and obtains better effect.
Implementing entity linking is typically done in multiple steps, the most important two of which are Candidate generation (CandidateGeneration) and Candidate disambiguation (candidateranking). The candidate generation step finds, based on the current mention (the name used to find which entities it may refer to as candidates; then the candidate disambiguation step selects the best candidate as the final link result based on the context of the mention and some characteristics of the candidate entities themselves.
The common practice for this step of candidate generation is to construct a dictionary in advance, store which entities each name may correspond to, and when entity linking is to be performed, find the candidate entities from the dictionary based on the name currently mentioned. The dictionary is typically built using information provided in a knowledge base.
In the candidate disambiguation step, common practice includes cooperative and non-cooperative approaches. In the collaborative method, when entity linking is performed, multiple references in the same context are generally considered at the same time, and it is desirable that the degree of association between their respective target entities in the linking result is as large as possible. While non-cooperative approaches are each referred to individually. The approach we used is a non-synergistic approach. Non-synergistic methods are generally faster and have less than optimal performance compared to synergistic methods.
Traditional non-collaborative methods may design a series of features including: name Features (Surface Features) for measuring similarity between the name string and the candidate entity name, such as the number of words in the name string and the candidate entity name; context Features (Context Features) for measuring semantic matching degree between the candidate entity and the mention Context, such as TF-IDF similarity between the document where the mention is located and the description of the candidate entity, whether words in the Wikipedia page title of the candidate entity appear in the document where the mention is located, and the like; other features such as the number of country names co-occurring in the document and candidate entity description of mention, the number of country names co-occurring in all aliases of candidate entities and in documents of mention, etc.
Since such heuristic features require expert knowledge, the original feature engineering is ineffective once the knowledge base or corpus is changed, and we hope to find a better effect by using as few features as possible.
Disclosure of Invention
The invention aims to provide a named entity linking method fusing prior information, which aims to link recognized entities in natural texts to a target knowledge base (Freebase) so as to provide a basis for subsequent tasks such as information extraction and the like.
Therefore, the IWHR (Important Word Hit Rate) feature is invented and combined with two features of common and tf/idf to judge the matching degree between the entity and the object, the original name feature is ensured by a candidate generation process, and the common feature adds prior information into the method.
Because the parameters of the general named entity link model need to be determined through the training corpora, the training corpora are difficult to obtain, and the three are combined together in a non-training mode by the method provided by the invention, and the suggested parameter setting is given.
In addition, in order to compensate the problem that the non-cooperative method does not consider the context, a query expansion is added before entity linking, and the query expansion is specially optimized for the names of people and places which possibly point to the same entity in the same article.
Tf/idf (term frequency/inverse document frequency feature, i.e., Tf-idf) is commonly used and measures the degree of similarity between articles, and we introduce this feature here to measure the degree of similarity between a reference context and a candidate entity context.
Common, reflecting the probability of candidate entities as mentions, the introduction of this feature amounts to the addition of a priori information that, in the case of insufficient context, can be used to judge the direction of mention. The calculation formula is as follows:
is provided with
Figure BDA0001567131550000021
For the anchor text set, shown as name string s and linked to the page corresponding to entity e, AsIs a set of anchor text displayed as a literal string s. Then there are:
Figure BDA0001567131550000022
IWHR remedies the deficiency of tf/idf, and focuses more on the consideration of important words appearing in the context of articles, and the calculation formula is as follows:
let e be a candidate entity of Wikipedia, m be the character string to be recognized, WdSet of words for the article in which m is located, WeE, then IWHR (e, m) may be calculated according to equation (2) where T is a manually set threshold, which the method sets to
Figure BDA0001567131550000031
Figure BDA0001567131550000032
The invention is realized by the following technical scheme:
the named entity linking method fusing prior information comprises the following steps:
s1: extracting character strings, namely a candidate entity table, a human name list and a place name list from the Wikipedia data dump and the Freebase data dump;
s2: representing each article in the Wikipedia data dump as a word frequency-inverse document frequency tf-idf characteristic, extracting and storing a commonality characteristic of each character string relative to a candidate entity;
s3: inquiring and expanding the name list and the place name list obtained in the S1, and generating candidate entities for entity mentions by using the character string-candidate entity table obtained in the S1;
s4: calculating the important word collision rate IWHR of the entity relative to the candidate entity;
s5: and calculating the association degrees of the entity and each candidate entity thereof according to the tf-idf feature, common feature and IWHR feature extracted in the S2 and the S4, and taking the highest association degree as an entity link result.
The steps can be realized in the following way:
s1 specifically includes the following steps:
s11: analyzing the Wikipedia data dump and extracting an article D containing entities in WikipediaeAnchor text a in an articleeEntity number W corresponding to the articleidRedirecting pages and disambiguating pages, and further generating a character string-candidate entity table str2 entry;
s12: all the person names and place names in the Freebase data dump are extracted to form a person name list Pname and a place name list Plocation.
S2 specifically includes the following steps:
s21: segmenting words of each article of Wikipedia by adopting a natural language processing tool StanfoldCoreNLP, and removing stop words by using a stop word dictionary to obtain a word list;
s22: and calculating the inverse document frequency idf of all words in each article based on the word list, wherein the idf calculation formula of the word is as follows:
Figure BDA0001567131550000033
the document number of the corpus is the total number of articles in Wikipedia;
s23: calculating the word frequency tf of all words in each article based on the word list, wherein the tf calculation formula of the word is as follows:
Figure BDA0001567131550000041
s24: and calculating word frequency-inverse document frequency tf-idf vectors of all words in each article of Wikipedia according to results of S22 and S23, wherein the tf-idf vector of the word is expressed as follows:
Figure BDA0001567131550000042
s25: tfidf obtained based on S24word(word), the word frequency-inverse document frequency values of the first 20 words arranged from big to small are kept as tf-idf characteristics of the article and are marked as tfidf (document);
s26: the commonality characteristic of each string relative to the candidate entity is calculated according to the following formula:
Figure BDA0001567131550000043
where e is a candidate entity, m is a string,
Figure BDA0001567131550000044
set of anchor text with surface name m and linking entity e, AmFor an anchor text set with a surface name m, | · | represents the number of elements in the set;
s27: and storing the tf-idf characteristic of each article and the common characteristic of each character string relative to the candidate entity.
S3 specifically includes the following steps:
s31: inquiring Pname and Plocation according to the entity mention S, if the character string is in any table, then turning to step S32, otherwise, turning to step S33;
s32: checking whether a character string S ' exists in the text of S, wherein S is the character sub-string of S ', if so, replacing S with S ' and then switching to S33, and if not, directly switching to S33;
s33: and querying str2 entry by using the character string s to obtain all candidate entities of the character string.
S4 specifically includes the following steps:
s41: using a StanfoldCoreNLP tool to perform word segmentation on the article where the entity is mentioned, and removing stop words to obtain a word list;
s42: obtaining an idf value idf (w) of each word in the article where the entity is mentioned according to the idf calculation formula in S22;
s43: the important word collision rate IWHR (e, m) is calculated according to the following formula:
Figure BDA0001567131550000051
wherein e is a candidate entity, m is an entity mention, WdSet of words for the article in which m is located, WeT is the set of words in the article where e is located, and T is the set idf threshold. The idf threshold can be set to
Figure BDA0001567131550000052
In S5, the specific steps of calculating the association degrees between the entity and its respective candidate entities, and using the highest association degree as the entity link result are as follows:
s51: extracting article d of entity mentioning mmArticle d of the location of a candidate entity ee
S52: obtaining article d from the stored result of S2mAnd article deTf-idf feature of (a);
s53: for each candidate entity e, computing the tf-idf similarity between the entity mention and its candidate entity:
Figure BDA0001567131550000053
wherein: | represents the modulus of the vector;
s54: obtaining common characteristics and IWHR characteristics of the entity mention m relative to the candidate entity e from the calculation results of S2 and S4;
s55: for each candidate entity e, calculating the similarity between the entity mention and the candidate entity according to the following formula:
similarity(e,m)=a×log(commoness(e,m))+b×log(tfidfsimilarity(e,m))+c×log(IWHR(e,m))
wherein a, b and c are constants;
s56: calculate the final entity linking result eresult
eresult=argmaxe(similarity(e,m))。
a, b and c can be learned through a neural network, and can also be manually debugged and set, and the suggestions are respectively set to 1.0,6.0 and 1.0.
The invention only uses three characteristics of entity universality, word frequency/inverse document frequency, important word collision rate and the like, breaks through the limitation of lack of linguistic data, and provides a reliable entity link recommendation result for a user, wherein prior information is added into the entity universality characteristic.
Drawings
FIG. 1 is a flowchart of a work flow of extracting resources using Wikipedia data dump and Freebase data dump;
fig. 2 is a workflow diagram of the main steps of the named entity linking method with the prior information.
Detailed Description
The invention is further described with reference to the following figures and detailed description.
The invention mainly combines common, tf/idf and IWHR three characteristics aiming at the command entity link task, and realizes a named entity link method fusing prior information. Because the used characteristics are less, the parameters needing fitting are less finally, and the method is more convenient under the condition of migrating the knowledge base and the linguistic data.
As shown in fig. 1 and 2, a named entity linking method fusing prior information includes the following steps:
s1: extracting character strings, namely a candidate entity table, a human name list and a place name list from the Wikipedia data dump and the Freebase data dump; the method is implemented as follows:
s11: analyzing the Wikipedia data dump and extracting an article D containing entities in WikipediaeAnchor text a in an articleeEntity number W corresponding to the articleidRedirecting pages and disambiguating pages, and further generating a character string-candidate entity table str2 entry;
s12: all the person names and place names in the Freebase data dump are extracted to form a person name list Pname and a place name list Plocation.
S2: representing each article in the Wikipedia data dump as a word frequency-inverse document frequency tf-idf characteristic, extracting and storing a commonality characteristic of each character string relative to a candidate entity; the method is implemented as follows:
s21: segmenting words of each article of Wikipedia by adopting a natural language processing tool StanfoldCoreNLP, and removing stop words by using a stop word dictionary to obtain a word list;
s22: and calculating the inverse document frequency idf of all words in each article based on the word list, wherein the idf calculation formula of the word is as follows:
Figure BDA0001567131550000061
the document number of the corpus is the total number of articles in Wikipedia;
s23: calculating the word frequency tf of all words in each article based on the word list, wherein the tf calculation formula of the word is as follows:
Figure BDA0001567131550000062
s24: and calculating word frequency-inverse document frequency tf-idf vectors of all words in each article of Wikipedia according to results of S22 and S23, wherein the tf-idf vector of the word is expressed as follows:
Figure BDA0001567131550000063
s25: tfidf obtained based on S24word(word), the word frequency-inverse document frequency values of the first 20 words arranged from big to small are kept as tf-idf characteristics of the article and are marked as tfidf (document);
s26: the commonality characteristic of each string relative to the candidate entity is calculated according to the following formula:
Figure BDA0001567131550000071
where e is a candidate entity, m is a string,
Figure BDA0001567131550000072
set of anchor text with surface name m and linking entity e, AmFor an anchor text set with a surface name m, | · | represents the number of elements in the set;
s27: and storing the tf-idf characteristic of each article and the common characteristic of each character string relative to the candidate entity.
S3: inquiring and expanding the name list and the place name list obtained in the S1, and generating candidate entities for entity mentions by using the character string-candidate entity table obtained in the S1; the method is implemented as follows:
s31: inquiring Pname and Plocation according to the entity mention S, if the character string is in any table, then turning to step S32, otherwise, turning to step S33;
s32: checking whether a character string S ' exists in the text of S, wherein S is the character sub-string of S ', if so, replacing S with S ' and then switching to S33, and if not, directly switching to S33;
s33: and querying str2 entry by using the character string s to obtain all candidate entities of the character string.
S4: calculating the important word collision rate IWHR of the entity relative to the candidate entity; the method is implemented as follows:
s41: using a StanfoldCoreNLP tool to perform word segmentation on the article where the entity is mentioned, and removing stop words to obtain a word list;
s42: obtaining an idf value idf (w) of each word in the article where the entity is mentioned according to the idf calculation formula in S22;
s43: the important word collision rate IWHR (e, m) is calculated according to the following formula:
Figure BDA0001567131550000073
wherein e is a candidate entity, m is an entity mention, WdSet of words for the article in which m is located, WeT is the set of words in the article where e is located, and T is the set idf threshold. The idf threshold may be set to
Figure BDA0001567131550000074
S5: and calculating the association degrees of the entity and each candidate entity thereof according to the tf-idf feature, common feature and IWHR feature extracted in the S2 and the S4, and taking the highest association degree as an entity link result. The method is implemented as follows:
s51: extracting article d of entity mentioning mmArticle d of the location of a candidate entity ee
S52: obtaining article d from the stored result of S2mAnd article deTf-idf feature of (a);
s53: for each candidate entity e, computing the tf-idf similarity between the entity mention and its candidate entity:
Figure BDA0001567131550000081
wherein: | represents the modulus of the vector;
s54: obtaining common characteristics and IWHR characteristics of the entity mention m relative to the candidate entity e from the calculation results of S2 and S4;
s55: for each candidate entity e, calculating the similarity between the entity mention and the candidate entity according to the following formula:
similarity(e,m)=a×log(commoness(e,m))+b×log(tfidfsimilarity(e,m))+c×log(IWHR(e,m))
wherein a, b and c are constants;
s56: calculate the final entity linking result eresult
eresult=argmaxe(similarity(e,m))。
a, b and c can be manually set to 1.0,6.0 and 1.0 respectively.
The method is applied to the following examples in order that those skilled in the art will better understand the specific implementation of the present invention.
Examples
Taking the document of the entity discovery and linking subtask of the text analysis conference (text analysis conference) in 2017 as an example, the method is applied to text naming entity linking (the resource extraction process is not detailed, and the process is more complex), and specific parameters and methods in each step are as follows:
1. obtaining a reference to be linked by using a named entity recognition tool or manual marking on an original document set, wherein the reference is specifically a triple of an article, a first letter position and a tail letter position of a given character string;
2. writing a script to extract all contents in the document set (removing xml tags), wherein each article is used as a file;
3. performing word segmentation on each article by using a natural language processing tool StanfoldCoreNLP, removing stop words, and counting the total word number of each article;
4. counting the occurrence times of each word for each article, and calculating the tf value of each word in each article, wherein the calculation formula is as follows:
Figure BDA0001567131550000082
5. and (3) calculating the tf-idf value of each word in each article according to the word list obtained by the Wikipedia statistics, the idf value of the word and the tf value obtained by the calculation, wherein the calculation formula is as follows:
Figure BDA0001567131550000091
6. in each article, the tf-idf value takes the first 20 words and corresponding words as the tf-idf characteristics of the article according to the descending order;
7. inquiring and expanding each mention identified in the step 1 according to whether the name of the person and the name of the place are inquired and expanded, if so, inquiring and expanding are carried out, and the judgment mode is as follows:
if the name is referred to in the name list and the place name list, the name is considered as the name and the place name;
the inquiry expanding mode is as follows:
determining whether a character string s 'is in the article where s is located before s, wherein s' is an abbreviation of s; or s is part of s ', e.g. s' is
Hilary Clinton, s is Clinton; if this is the case, s is replaced by s';
8. for each extracted query character string-candidate entity-common list, obtaining candidate entities of the character string and corresponding common characteristics;
9. for each mention, calculating the tf-idf similarity between the article in which the mention is made and their corresponding candidate entity article by the following formula:
Figure BDA0001567131550000092
10. for each mention, calculating the IWHR similarity between the article in which the mention is made and their corresponding candidate entities by the following formula:
let e be a candidate entity of Wikipedia, m be the character string to be recognized, WdSet of words for the article in which m is located, WeE, then IWHR (e, m) may be calculated according to the following equation (2), T being a manually set threshold value, which the method sets to
Figure BDA0001567131550000093
Figure BDA0001567131550000094
11. For each mention, and for each candidate entity for them, calculating the mention entity relevance by the following formula:
similarity(e,m)=a×log(commoness(e,m))+b×log(tfidfsimilarity(e,m))+c×log(IWHR(e,m))(5)
(a, b, c) is set to (1.0,6.0, 1.0);
12. take e, which maximizes the above equation, as the result of the linkage for m, i.e., the following equation:
eresult=argmaxe(similarity(e,m)) (6)
the following table is the partial final link result for the selected document.
WORD Beg End KBid
Turkey 2279 2284 m.01znc_
Microsoft 2620 2628 m.04sv4
Nam Dinh 3703 3710 m.07m1dj
the Beatles 2078 2088 m.07c0j
Gaisano mall 2642 2653 m.09rxbx2

Claims (3)

1. A named entity linking method fusing prior information is characterized by comprising the following steps:
s1: extracting character strings, namely a candidate entity table, a human name list and a place name list from the Wikipedia data dump and the Freebase data dump;
s2: representing each article in the Wikipedia data dump as a word frequency-inverse document frequency tf-idf characteristic, extracting and storing a commonality characteristic of each character string relative to a candidate entity;
s3: inquiring and expanding the name list and the place name list obtained in the S1, and generating candidate entities for entity mentions by using the character string-candidate entity table obtained in the S1;
s4: calculating the important word collision rate IWHR of the entity relative to the candidate entity;
s5: calculating the association degrees of the entity and each candidate entity thereof according to the tf-idf feature, common feature and IWHR feature extracted in S2 and S4, and taking the highest association degree as an entity link result;
s1 specifically includes the following steps:
s11: analyzing the Wikipedia data dump and extracting an article D containing entities in WikipediaeAnchor text a in an articleeEntity number W corresponding to the articleidRedirecting pages and disambiguating pages, and further generating a character string-candidate entity table str2 entry;
s12: extracting all the person names and place names in the Freebase data dump to form a person name list Pname and a place name list Plocation;
s2 specifically includes the following steps:
s21: segmenting words of each article of Wikipedia by adopting a natural language processing tool StanfoldCoreNLP, and removing stop words by using a stop word dictionary to obtain a word list;
s22: and calculating the inverse document frequency idf of all words in each article based on the word list, wherein the idf calculation formula of the word is as follows:
Figure FDA0002297900600000011
the document number of the corpus is the total number of articles in Wikipedia;
s23: calculating the word frequency tf of all words in each article based on the word list, wherein the tf calculation formula of the word is as follows:
Figure FDA0002297900600000012
s24: and calculating word frequency-inverse document frequency tf-idf vectors of all words in each article of Wikipedia according to results of S22 and S23, wherein the tf-idf vector of the word is expressed as follows:
Figure FDA0002297900600000021
s25: tfidf obtained based on S24word(word), the word frequency-inverse document frequency values of the first 20 words arranged from big to small are kept as tf-idf characteristics of the article and are marked as tfidf (document);
s26: the commonality characteristic of each string relative to the candidate entity is calculated according to the following formula:
Figure FDA0002297900600000022
where e is a candidate entity, m is a string,
Figure FDA0002297900600000023
set of anchor text with surface name m and linking entity e, AmFor an anchor text set with a surface name m, | · | represents the number of elements in the set;
s27: storing the tf-idf characteristic of each article obtained by calculation and the common characteristic of each character string relative to the candidate entity;
s3 specifically includes the following steps:
s31: inquiring Pname and Plocation according to the corresponding character string S mentioned by the entity, if the character string S is in any table, then turning to step S32, otherwise, turning to step S33;
s32: checking whether a character string S ' exists in the text of S, wherein S is the character sub-string of S ', if so, replacing S with S ' and then switching to S33, and if not, directly switching to S33;
s33: querying str2 entry by using the character string s to obtain all candidate entities of the character string s;
s4 specifically includes the following steps:
s41: using a StanfoldCoreNLP tool to perform word segmentation on the article where the entity is mentioned, and removing stop words to obtain a word list;
s42: obtaining an idf value idf (w) of each word in the article where the entity is mentioned according to the idf calculation formula in S22;
s43: the important word collision rate IWHR (e, m) is calculated according to the following formula:
Figure FDA0002297900600000024
wherein e is a candidate entity, m is an entity mention, WdSet of words for the article in which m is located, WeThe word set of the article where e is located, and T is a set idf threshold value;
in S5, the specific steps of calculating the association degrees between the entity and its candidate entities, and using the highest association degree as the entity link result are as follows:
s51: extracting article d of entity mentioning mmArticle d of the location of a candidate entity ee
S52: obtaining article d from the stored result of S2mAnd article deTf-idf feature of (a);
s53: for each candidate entity e, computing the tf-idf similarity between the entity mention and its candidate entity:
Figure FDA0002297900600000031
wherein: | represents the modulus of the vector;
s54: obtaining common characteristics and IWHR characteristics of the entity mention m relative to the candidate entity e from the calculation results of S2 and S4;
s55: for each candidate entity e, calculating the similarity between the entity mention and the candidate entity according to the following formula:
similarity(e,m)=a×log(commoness(e,m))+b×log(tfidfsimilarity(e,m))+c×log(IWHR(e,m))
wherein a, b and c are constants;
s56: calculate the final entity linking result eresult
eresult=arg maxe(similarity(e,m))。
2. The method as claimed in claim 1, wherein the idf threshold is set to be a threshold value
Figure FDA0002297900600000032
3. The method as claimed in claim 1, wherein a, b, and c are respectively set to 1.0,6.0, and 1.0.
CN201810103629.7A 2018-02-01 2018-02-01 Named entity linking method fusing prior information Active CN108363688B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810103629.7A CN108363688B (en) 2018-02-01 2018-02-01 Named entity linking method fusing prior information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810103629.7A CN108363688B (en) 2018-02-01 2018-02-01 Named entity linking method fusing prior information

Publications (2)

Publication Number Publication Date
CN108363688A CN108363688A (en) 2018-08-03
CN108363688B true CN108363688B (en) 2020-04-28

Family

ID=63004109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810103629.7A Active CN108363688B (en) 2018-02-01 2018-02-01 Named entity linking method fusing prior information

Country Status (1)

Country Link
CN (1) CN108363688B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866385B (en) * 2018-08-17 2024-04-05 阿里巴巴(中国)有限公司 Method and device for publishing outside of electronic book and readable storage medium
CN109325230B (en) * 2018-09-21 2021-06-15 广西师范大学 Word semantic relevance judging method based on wikipedia bidirectional link
CN110147401A (en) * 2019-05-22 2019-08-20 苏州大学 Merge the knowledge base abstracting method of priori knowledge and context-sensitive degree
CN111814477B (en) * 2020-07-06 2022-06-21 重庆邮电大学 Dispute focus discovery method and device based on dispute focus entity and terminal
CN113392220B (en) * 2020-10-23 2024-03-26 腾讯科技(深圳)有限公司 Knowledge graph generation method and device, computer equipment and storage medium
CN113157861B (en) * 2021-04-12 2022-05-24 山东浪潮科学研究院有限公司 Entity alignment method fusing Wikipedia

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2251795A2 (en) * 2009-05-12 2010-11-17 Comcast Interactive Media, LLC Disambiguation and tagging of entities
CN104462126A (en) * 2013-09-22 2015-03-25 富士通株式会社 Entity linkage method and device
CN107608960A (en) * 2017-09-08 2018-01-19 北京奇艺世纪科技有限公司 A kind of method and apparatus for naming entity link

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9871702B2 (en) * 2016-02-17 2018-01-16 CENX, Inc. Service information model for managing a telecommunications network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2251795A2 (en) * 2009-05-12 2010-11-17 Comcast Interactive Media, LLC Disambiguation and tagging of entities
CN104462126A (en) * 2013-09-22 2015-03-25 富士通株式会社 Entity linkage method and device
CN107608960A (en) * 2017-09-08 2018-01-19 北京奇艺世纪科技有限公司 A kind of method and apparatus for naming entity link

Also Published As

Publication number Publication date
CN108363688A (en) 2018-08-03

Similar Documents

Publication Publication Date Title
CN108363688B (en) Named entity linking method fusing prior information
CN109508414B (en) Synonym mining method and device
US7346487B2 (en) Method and apparatus for identifying translations
CN107870901B (en) Method, recording medium, apparatus and system for generating similar text from translation source text
CN108681537A (en) Chinese entity linking method based on neural network and word vector
JP6187877B2 (en) Synonym extraction system, method and recording medium
CN108920633B (en) Paper similarity detection method
US20140289238A1 (en) Document creation support apparatus, method and program
CN111930929A (en) Article title generation method and device and computing equipment
CN109033212B (en) Text classification method based on similarity matching
US11537795B2 (en) Document processing device, document processing method, and document processing program
WO2014002774A1 (en) Synonym extraction system, method, and recording medium
CN111276149A (en) Voice recognition method, device, equipment and readable storage medium
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN112015907A (en) Method and device for quickly constructing discipline knowledge graph and storage medium
CN115033773A (en) Chinese text error correction method based on online search assistance
JP6867963B2 (en) Summary Evaluation device, method, program, and storage medium
Sagcan et al. Toponym recognition in social media for estimating the location of events
CN110008312A (en) A kind of document writing assistant implementation method, system and electronic equipment
JP6495124B2 (en) Term semantic code determination device, term semantic code determination model learning device, method, and program
JP2010097239A (en) Dictionary creation device, dictionary creation method, and dictionary creation program
Gaizauskas et al. Extracting bilingual terms from the Web
Khodak et al. Extending and improving wordnet via unsupervised word embeddings
CN109002508B (en) Text information crawling method based on web crawler
CN110083817B (en) Naming disambiguation method, device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant