CN108363688A - A kind of name entity link method of fusion prior information - Google Patents

A kind of name entity link method of fusion prior information Download PDF

Info

Publication number
CN108363688A
CN108363688A CN201810103629.7A CN201810103629A CN108363688A CN 108363688 A CN108363688 A CN 108363688A CN 201810103629 A CN201810103629 A CN 201810103629A CN 108363688 A CN108363688 A CN 108363688A
Authority
CN
China
Prior art keywords
entity
article
candidate
idf
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810103629.7A
Other languages
Chinese (zh)
Other versions
CN108363688B (en
Inventor
汤斯亮
杨希远
陈博
林升
吴飞
庄越挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201810103629.7A priority Critical patent/CN108363688B/en
Publication of CN108363688A publication Critical patent/CN108363688A/en
Application granted granted Critical
Publication of CN108363688B publication Critical patent/CN108363688B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of name entity link methods of fusion prior information.This method comprises the following steps:(1) from Wikipedia data dump, Freebase data dump extraction character string candidate's entity table, name list, list of file names;(2) every article in Wikipedia data dump is expressed as word frequency/inverse document frequency tf idf features, and extracts versatility feature of each character string relative to candidate entity;(3) entity is referred to and carries out inquiry expansion, use character string candidate's entity table in (1), the candidate entity of generation is referred to for entity;(4) feature for extracting article where entity refers to, obtains the inverse document frequency and primary word collision rate of article;(5) feature for using (2), (4) to be extracted, computational entity refer to the correlation degree between its each candidate entity, and be used as entity link result by correlation degree is highest.The present invention breaches the limitation of language material shortage, has provided reliable entity link recommendation results to the user, wherein entity versatility feature adds prior information.

Description

A kind of name entity link method of fusion prior information
Technical field
The present invention relates to natural language processing more particularly to a kind of name entity link methods of fusion prior information.
Background technology
Natural language processing (Nature Language Processing, abbreviation NLP) is a collection linguistics and calculating The cross discipline that machine science is integrated.It is at natural language to name entity link (Named Entity Linking, abbreviation NEL) A basic task in reason, it is intended to eliminate the ambiguity caused due to linguistics phenomenons such as alias, reference, a words mostly meaning, establish Correspondence between the entity that its in the proper noun (physical name) occurred in text and knowledge base is referred to.The definition of problem It is:It gives in one section of text and text and refers to (mention, that is, character string to be linked), the knowledge specified from one These are found out in library refers to referred to entity.
This technology that can be established the link between text and knowledge base of entity link has very in information extraction Important role.Wherein Relation extraction (Relation Extraction) is exactly that entity link is needed in information extraction technique One exemplary.The purpose of Relation extraction problem thus is found out from the incidence relation extracted in text between different entities It is the premise being further analyzed that entity in text, which refers to and finds their corresponding entities by entity link,.
In addition, entity link, which is equivalent to, increases additional information for original text again, thus can also be used in In natural language processing and text mining problem, be conducive to be more fully understood from text, obtain better effect.
Entity link generally point multistep is realized to complete, two step of most important one is candidate generation (Candidate Generation) (Candidate Ranking) is disambiguated with candidate.Candidate generation step is according to currently referring to (name used Find out which entity it may refer to, as candidate;Then, the candidate step that disambiguates is according to the context referred to and candidate's reality Some features of body itself select optimal candidate as final link result.
The common practice that candidate generates this step is one dictionary of prior construction, stores which reality is each name may correspond to Body, when entity link to be executed, so that it may which candidate entity is found out from dictionary according to the name for currently referring to used.General profit The dictionary is built with the information provided in knowledge base.
Step is disambiguated in candidate, common practice includes the method for collaboration and miscoordination.The method of collaboration is carrying out entity When link, it is general consider simultaneously in same context multiple refer to, it is desirable to link in result between their own target entity Correlation degree is big as possible.And the method for miscoordination then individually considers each refer to.Method used in us is the side of miscoordination Method.The method of miscoordination carrys out Synergistic method general speed comparatively fast slightly inferior properties relatively.
Traditional miscoordination method, can design series of features includes:Name feature (Surface Features), for weighing Measure the similarity between the name character string that refers to and candidate physical name, the name character string such as referred to in candidate physical name Same words number;Contextual feature (Context Features) weighs candidate entity and refers to context in semantically With degree, document where such as referring to and the TF-IDF similarities of candidate entity description, the Wikipedia page titles of candidate entity In word whether all referring in a document occurring;Other features, such as appear in document where referring to jointly and candidate is real National concrete number in body description, appear in jointly all alias neutralization of candidate entity refer to national concrete number in a document etc..
Since feature didactic in this way is highly desirable to expertise, once knowledge base or language material change, and it is original Feature Engineering with regard to invalid, it is intended that seek to obtain preferable effect by feature few as possible.
Invention content
The purpose of the invention is to by the entity link identified in natural text to object knowledge library (Freebase), to provide basis for follow-up works such as information extractions, a kind of name entity link of fusion prior information is proposed Method.
Therefore we have invented IWHR (Important Word Hit Rate, primary word impact rate) this features, and will Itself and two kinds of features of commoness, tf/idf combine the matching degree for going to judge entity Yu refer to, original name feature then to lead to The process of candidate generation is crossed to ensure, the method that commoness is characterized as us adds prior information.
Again because common name entity link model needs to determine that parameter, training corpus obtain by training corpus Take extremely difficult, the method that we invent just is combined three in a manner of non-training, while giving the ginseng of suggestion Number setting.
In addition, not considering the problems of that context refers to make up miscoordination method, we are before carrying out entity link Inquiry extension is added, the name of the same entity is directed toward for the possibility in same piece article and place name has done special optimization.
Tf/idf (word frequency/inverse document frequency feature, i.e. Tf-idf) is common and weighs the similarity degree between article, we Here this feature is introduced for weighing the similarity degree for referring to context and candidate entity context.
Commonness has reacted candidate entity as the probability referred to, and introducing this feature, which is equivalent to, adds priori letter Breath, the direction that can be used for judge refer to insufficient in context.Calculation formula is as follows:
IfTo be shown as name character string s and being linked to the Anchor Text set of entity e corresponding pages, AsTo be shown as word The Anchor Text set of face character string s.Then have:
IWHR compensates for the deficiency of tf/idf, more lays particular emphasis on considering for the important word occurred in contexts, meter It is as follows to calculate formula:
If e is the candidate entity of a wikipedia, m is character string to be identified, WdThe set of letters of article, W where me For the set of letters of the page of e, then IWHR (e, m) can be calculated according to formula (2), and in formula (2), T is manually set This is set as by threshold value, this method
The present invention is realized especially by following technical solution:
The name entity link method for merging prior information, includes the following steps:
S1:Character string-candidate's entity table, people are extracted from Wikipedia data dump, Freebase data dump List of file names and ground list of file names;
S2:Every article in Wikipedia data dump is expressed as term frequency-inverse document frequency tf-idf features, Each of which character string is extracted relative to the versatility commoness features of candidate entity and is stored;
S3:Using the name list obtained in S1, list of file names make inquiry expansion, use the character string-time obtained in S1 Entity table is selected, the candidate entity of generation is referred to for entity;
S4:Computational entity refers to the primary word collision rate IWHR relative to candidate entity;
S5:According to the tf-idf features extracted in S2, S4, commoness features, IWHR feature calculation entities refer to and it Correlation degree between each candidate's entity, and correlation degree is highest as entity link result.
Above steps can specifically use following realization method:
S1 is specifically comprised the steps of:
S11:Wikipedia data dump are parsed, the article D for including entity in wikipedia is extractede, anchor in article Text Ae, the corresponding entity number W of articleid, redirect page Repages, disambiguate page dispages, and then generate character String-candidate's entity table str2entity;
S12:All names, place name in extraction Freebase data dump form name list Pname and ground ranks Table Plocation.
S2 is specifically comprised the steps of:
S21:Using natural language processing tool StanfoldCoreNLP to every article participle of Wikipedia, make simultaneously Stop words is removed with stop words dictionary, obtains vocabulary;
S22:Based on vocabulary, the inverse document frequency idf of all words in every article is calculated, wherein word word's Idf calculation formula are as follows:
Wherein the number of files of corpus is the article sum in Wikipedia;
S23:Based on vocabulary, the word frequency tf of all words in every article is calculated, the tf of wherein word word calculates public affairs Formula is as follows:
S24:According to S22, S23 as a result, calculating the term frequency-inverse document frequency of all words in every article of Wikipedia The tf-idf vectors of rate tf-idf vectors, wherein word word indicate as follows:
S25:Based on the tfidf obtained by S24word(word), retain the inverse text of word frequency-for arranging preceding 20 words from big to small Shelves frequency values are denoted as tfidf (document) as the tf-idf features of this article;
S26:Versatility commoness feature of each character string relative to candidate entity is calculated according to following formula:
Wherein e is a candidate entity, and m is character string,For the entitled m in surface and link the Anchor Text collection that entity is e It closes, AmFor the Anchor Text set of the entitled m in surface, | | indicate the element number in set;
S27:By the tf-idf features for every article being calculated and each character string relative to candidate entity Commoness features are stored.
S3 is specifically comprised the steps of:
S31:S is referred to according to entity, Pname, Plocation is inquired, if the character string is transferred to step in either table Otherwise S32 is transferred to step S33;
S32:Check that above whether have character string s', the wherein s of s are the character substrings of s', and if it exists, then replace s To be transferred to S33 after s', S33 is directly transferred to if being not present;
S33:Str2entity is inquired using character string s, obtains all candidate entities of the character string.
S4 is specifically comprised the steps of:
S41:The article where entity refers to is segmented using StanfoldCoreNLP tools, while removing and deactivating Word obtains vocabulary;
S42:The idf values of each word in the article where entity refers to are obtained according to the idf calculation formula described in S22 idf(w);
S43:Primary word collision rate IWHR (e, m) is calculated according to following formula:
Wherein, e is a candidate entity, and m is that entity refers to, WdThe set of letters of article, W where meThe article where e Set of letters, T be setting idf threshold values.The idf threshold values may be set to
In S5, computational entity refer to the correlation degree between its each candidate entity, and by the highest work of correlation degree It is as follows for entity link result:
S51:Extraction entity refers to the place article d of mmAnd its place article d of a certain candidate entity ee
S52:Article d is obtained according to S2 storage resultsmWith article deTf-idf features;
S53:To each candidate entity e, computational entity refers to and the tf-idf similitudes between its candidate entity:
Wherein:| | | | indicate vector field homoemorphism;
S54:From the result of calculation of S2 and S4 obtain entity refer to m relative to candidate entity e commoness features and IWHR features;
S55:To each candidate entity e, the similarity between candidate entity is referred to according to following formula computational entity:
Similarity (e, m)=a × log (commoness (e, m))+b × log (tfidfsimilarity (e, m))+ c×log(IWHR(e,m))
Wherein a, b, c are constant;
S56:Calculate last entity link result eresult
eresult=argmaxe(similarity(e,m))。
A, b, c can be arrived by neural network learning, can also manual debugging and setting, it is proposed that 1.0 are respectively set to, 6.0,1.0.
Entity versatility, word frequency/inverse document frequency is used only in the present invention, and three kinds of features such as primary word collision rate breach The limitation that language material lacks, has provided reliable entity link recommendation results to the user, wherein entity versatility feature adds elder generation Test information.
Description of the drawings
Fig. 1 is the work flow diagram using Wikipedia data dump, Freebase data dump extraction resources;
Fig. 2 is the key step work flow diagram for the name entity link method for merging prior information.
Specific implementation mode
The present invention is further elaborated with reference to the accompanying drawings and detailed description.
Present invention is generally directed to order entity link tasks by commoness, these three features of tf/idf, IWHR combine, A kind of name entity link method of fusion prior information is realized, this method, which has considered, refers to that context, entity are popular The importance of the priori of degree, the keyword of text, realizes higher accuracy rate, has also taken into account the efficiency of entity link.By It is less in the feature used, finally need the parameter being fitted less, it is also more square in the case where migrating knowledge base and language material Just.
As illustrated in fig. 1 and 2, a kind of name entity link method of fusion prior information, includes the following steps:
S1:Character string-candidate's entity table, people are extracted from Wikipedia data dump, Freebase data dump List of file names and ground list of file names;This step concrete methods of realizing is as follows:
S11:Wikipedia data dump are parsed, the article D for including entity in wikipedia is extractede, anchor in article Text Ae, the corresponding entity number W of articleid, redirect page Repages, disambiguate page dispages, and then generate character String-candidate's entity table str2entity;
S12:All names, place name in extraction Freebase data dump form name list Pname and ground ranks Table Plocation.
S2:Every article in Wikipedia data dump is expressed as term frequency-inverse document frequency tf-idf features, Each of which character string is extracted relative to the versatility commoness features of candidate entity and is stored;This step implements Method is as follows:
S21:Using natural language processing tool StanfoldCoreNLP to every article participle of Wikipedia, make simultaneously Stop words is removed with stop words dictionary, obtains vocabulary;
S22:Based on vocabulary, the inverse document frequency idf of all words in every article is calculated, wherein word word's Idf calculation formula are as follows:
Wherein the number of files of corpus is the article sum in Wikipedia;
S23:Based on vocabulary, the word frequency tf of all words in every article is calculated, the tf of wherein word word calculates public affairs Formula is as follows:
S24:According to S22, S23 as a result, calculating the term frequency-inverse document frequency of all words in every article of Wikipedia The tf-idf vectors of rate tf-idf vectors, wherein word word indicate as follows:
S25:Based on the tfidf obtained by S24word(word), retain the inverse text of word frequency-for arranging preceding 20 words from big to small Shelves frequency values are denoted as tfidf (document) as the tf-idf features of this article;
S26:Versatility commoness feature of each character string relative to candidate entity is calculated according to following formula:
Wherein e is a candidate entity, and m is character string,For the entitled m in surface and link the Anchor Text collection that entity is e It closes, AmFor the Anchor Text set of the entitled m in surface, | | indicate the element number in set;
S27:By the tf-idf features for every article being calculated and each character string relative to candidate entity Commoness features are stored.
S3:Using the name list obtained in S1, list of file names make inquiry expansion, use the character string-time obtained in S1 Entity table is selected, the candidate entity of generation is referred to for entity;This step concrete methods of realizing is as follows:
S31:S is referred to according to entity, Pname, Plocation is inquired, if the character string is transferred to step in either table Otherwise S32 is transferred to step S33;
S32:Check that above whether have character string s', the wherein s of s are the character substrings of s', and if it exists, then replace s To be transferred to S33 after s', S33 is directly transferred to if being not present;
S33:Str2entity is inquired using character string s, obtains all candidate entities of the character string.
S4:Computational entity refers to the primary word collision rate IWHR relative to candidate entity;This step concrete methods of realizing is such as Under:
S41:The article where entity refers to is segmented using StanfoldCoreNLP tools, while removing and deactivating Word obtains vocabulary;
S42:The idf values of each word in the article where entity refers to are obtained according to the idf calculation formula described in S22 idf(w);
S43:Primary word collision rate IWHR (e, m) is calculated according to following formula:
Wherein, e is a candidate entity, and m is that entity refers to, WdThe set of letters of article, W where meThe article where e Set of letters, T be setting idf threshold values.Idf threshold values can be set as
S5:According to the tf-idf features extracted in S2, S4, commoness features, IWHR feature calculation entities refer to and it Correlation degree between each candidate's entity, and correlation degree is highest as entity link result.This step implements Method is as follows:
S51:Extraction entity refers to the place article d of mmAnd its place article d of a certain candidate entity ee
S52:Article d is obtained according to S2 storage resultsmWith article deTf-idf features;
S53:To each candidate entity e, computational entity refers to and the tf-idf similitudes between its candidate entity:
Wherein:| | | | indicate vector field homoemorphism;
S54:From the result of calculation of S2 and S4 obtain entity refer to m relative to candidate entity e commoness features and IWHR features;
S55:To each candidate entity e, the similarity between candidate entity is referred to according to following formula computational entity:
Similarity (e, m)=a × log (commoness (e, m))+b × log (tfidfsimilarity (e, m))+ c×log(IWHR(e,m))
Wherein a, b, c are constant;
S56:Calculate last entity link result eresult
eresult=argmaxe(similarity(e,m))。
A, b, c can manually be respectively set to 1.0,6.0,1.0.
This method is applied to following embodiments below, so that those skilled in the art more fully understand that the present invention's is specific It realizes.
Embodiment
It is found with the entity of text analyzing meeting in 2017 (text analysis conference) and links subtask Document for, by the above method be applied to text in carry out text name entity link, (Resource Access process is no longer described in detail, mistake Journey is complex) design parameter and way are as follows in each step:
1. by original document set using name Entity recognition tool or artificial mark obtain it is to be linked refer to, specifically To provide article where the character string referred to, initial position, the triple of last letter position;
2. writing all the elements (removing xml labels) in script abstracting document set, every article is as a file;
3. being segmented using natural language processing tool StanfoldCoreNLP to every article, while removing and deactivating Word counts total word number of every article;
4. the occurrence number of pair each word of every paper statistics calculates the tf values of each word in every article, calculates Formula is as follows:
5. the idf values according to the vocabulary and word that are counted to Wikipedia and obtained tf values calculated above, The tf-idf values of each word in every article are calculated, calculation formula is as follows:
6. in every article, tf-idf values according to arrangement from big to small take first 20 with corresponding word as this article Tf-idf features;
7. each of identify and to refer in pair 1, according to whether be name, place name carries out inquiry expansion, be to be asked It askes and expands, judgment mode is:
If refer to name list, be taken as name, place name inside list of file names;
Mode is expanded in inquiry:
Whether there is character string s''s, s' to be abbreviated as s before s in article where determining s;Or the part that s is s', such as S' is
Hilary Clinton, s Clinton;If there are such case, then s is replaced with into s';
8. pair each refer to inquiry string-candidate's entity-commoness lists, obtain the candidate entity of character string with And corresponding commoness features;
9. pair each referring to, article where referring to and the tf-idf phases between their corresponding candidate entity articles are calculated Like property, calculation formula is as follows:
10. pair each referring to, article where referring to and the IWHR similitudes between their corresponding candidate entities are calculated, Calculation formula is as follows:
If e is the candidate entity of a wikipedia, m is character string to be identified, WdThe set of letters of article, W where me For the set of letters of the page of e, then IWHR (e, m) can be calculated according to following equalities (2), and T is the threshold value manually set, we This is set as by method
11. it is pair each refer to and they each of candidate entity calculating carry with the entity degree of correlation, calculation formula is as follows:
Similarity (e, m)=a × log (commoness (e, m))+b × log (tfidfsimilarity (e, m))+ c×log(IWHR(e,m))(5)
(a, b, c) is set as (1.0,6.0,1.0);
The maximum e of above formula is enabled 12. taking, the link as m is as a result, namely following equalities:
eresult=argmaxe(similarity(e,m)) (6)
Following table is that the part of selected document finally links result.
WORD Beg End KBid
Turkey 2279 2284 m.01znc_
Microsoft 2620 2628 m.04sv4
Nam Dinh 3703 3710 m.07m1dj
the Beatles 2078 2088 m.07c0j
Gaisano mall 2642 2653 m.09rxbx2

Claims (8)

1. a kind of name entity link method of fusion prior information, it is characterised in that include the following steps:
S1:Character string-candidate's entity table, name row are extracted from Wikipedia data dump, Freebase data dump Table and ground list of file names;
S2:Every article in Wikipedia data dump is expressed as term frequency-inverse document frequency tf-idf features, is extracted Each of which character string relative to candidate entity versatility commoness features and stored;
S3:Using the name list obtained in S1, list of file names make inquiry expansion, use the character string obtained in S1-candidate real Body surface refers to the candidate entity of generation for entity;
S4:Computational entity refers to the primary word collision rate IWHR relative to candidate entity;
S5:It is referred to according to the tf-idf features, commoness features, IWHR feature calculation entities extracted in S2, S4 each with it Correlation degree between candidate entity, and correlation degree is highest as entity link result.
2. a kind of name entity link method of fusion prior information according to claim 1, it is characterised in that S1 is specific It comprises the steps of:
S11:Wikipedia data dump are parsed, the article D for including entity in wikipedia is extractede, Anchor Text in article Ae, the corresponding entity number W of articleid, redirect page Repages, disambiguate page dispages, and then generate character string-time Select entity table str2entity;
S12:Extract all names in Freebase data dump, place name forms name list Pname and ground list of file names Plocation。
3. a kind of name entity link method of fusion prior information according to claim 1, it is characterised in that S2 is specific It comprises the steps of:
S21:Using natural language processing tool StanfoldCoreNLP to every article participle of Wikipedia, while use stops Word dictionary removes stop words, obtains vocabulary;
S22:Based on vocabulary, the inverse document frequency idf of all words in every article, the idf meters of wherein word word are calculated It is as follows to calculate formula:
Wherein the number of files of corpus is the article sum in Wikipedia;
S23:Based on vocabulary, the word frequency tf of all words in every article, wherein the tf calculation formula of word word are calculated such as Under:
S24:According to S22, S23 as a result, calculating the term frequency-inverse document frequency tf- of all words in every article of Wikipedia The tf-idf vectors of idf vectors, wherein word word indicate as follows:
S25:Based on the tfidf obtained by S24word(word), retain the term frequency-inverse document frequency for arranging preceding 20 words from big to small Rate value is denoted as tfidf (document) as the tf-idf features of this article;
S26:Versatility commoness feature of each character string relative to candidate entity is calculated according to following formula:
Wherein e is a candidate entity, and m is character string,For the entitled m in surface and link the Anchor Text set that entity is e, Am For the Anchor Text set of the entitled m in surface, | | indicate the element number in set;
S27:By the tf-idf features for every article being calculated and each character string relative to candidate entity Commoness features are stored.
4. a kind of name entity link method of fusion prior information according to claim 1, it is characterised in that S3 is specific It comprises the steps of:
S31:S is referred to according to entity, inquires Pname, Plocation, if the character string in either table, is transferred to step S32, Otherwise it is transferred to step S33;
S32:Check that above whether have character string s', the wherein s of s are the character substrings of s', and if it exists, s is then replaced with into s' After be transferred to S33, if there is no if be directly transferred to S33;
S33:Str2entity is inquired using character string s, obtains all candidate entities of the character string.
5. a kind of name entity link method of fusion prior information according to claim 1, it is characterised in that S4 is specific It comprises the steps of:
S41:The article where entity refers to is segmented using StanfoldCoreNLP tools, while removing stop words, is obtained To vocabulary;
S42:The idf values idf of each word in the article where entity refers to is obtained according to the idf calculation formula described in S22 (w);
S43:Primary word collision rate IWHR (e, m) is calculated according to following formula:
Wherein, e is a candidate entity, and m is that entity refers to, WdThe set of letters of article, W where meThe list of article where e Set of words, T are the idf threshold values of setting.
6. a kind of name entity link method of fusion prior information according to claim 5, it is characterised in that described Idf threshold values are set as
7. a kind of name entity link method of fusion prior information according to claim 1, it is characterised in that described In S5, computational entity refer to the correlation degree between its each candidate entity, and be used as chain of entities by correlation degree is highest Binding fruit is as follows:
S51:Extraction entity refers to the place article d of mmAnd its place article d of a certain candidate entity ee
S52:Article d is obtained according to S2 storage resultsmWith article deTf-idf features;
S53:To each candidate entity e, computational entity refers to and the tf-idf similitudes between its candidate entity:
Wherein:| | | | indicate vector field homoemorphism;
S54:Entity is obtained from the result of calculation of S2 and S4 refers to commoness feature and IWHR of the m relative to candidate entity e Feature;
S55:To each candidate entity e, the similarity between candidate entity is referred to according to following formula computational entity:
Similarity (e, m)=a × log (commoness (e, m))+b × log (tfidfsimilarity (e, m))+c × log(IWHR(e,m))
Wherein a, b, c are constant;
S56:Calculate last entity link result eresult
eresult=argmaxe(similarity(e,m))。
8. a kind of name entity link method of fusion prior information according to claim 7, it is characterised in that described A, b, c are respectively set to 1.0,6.0,1.0.
CN201810103629.7A 2018-02-01 2018-02-01 Named entity linking method fusing prior information Active CN108363688B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810103629.7A CN108363688B (en) 2018-02-01 2018-02-01 Named entity linking method fusing prior information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810103629.7A CN108363688B (en) 2018-02-01 2018-02-01 Named entity linking method fusing prior information

Publications (2)

Publication Number Publication Date
CN108363688A true CN108363688A (en) 2018-08-03
CN108363688B CN108363688B (en) 2020-04-28

Family

ID=63004109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810103629.7A Active CN108363688B (en) 2018-02-01 2018-02-01 Named entity linking method fusing prior information

Country Status (1)

Country Link
CN (1) CN108363688B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325230A (en) * 2018-09-21 2019-02-12 广西师范大学 A kind of phrase semantic degree of correlation judgment method based on wikipedia bi-directional chaining
CN110147401A (en) * 2019-05-22 2019-08-20 苏州大学 Merge the knowledge base abstracting method of priori knowledge and context-sensitive degree
CN110866385A (en) * 2018-08-17 2020-03-06 广州阿里巴巴文学信息技术有限公司 Method and device for releasing external piece of electronic book and readable storage medium
CN111814477A (en) * 2020-07-06 2020-10-23 重庆邮电大学 Dispute focus discovery method and device based on dispute focus entity and terminal
CN113157861A (en) * 2021-04-12 2021-07-23 山东新一代信息产业技术研究院有限公司 Entity alignment method fusing Wikipedia
CN113392220A (en) * 2020-10-23 2021-09-14 腾讯科技(深圳)有限公司 Knowledge graph generation method and device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2251795A2 (en) * 2009-05-12 2010-11-17 Comcast Interactive Media, LLC Disambiguation and tagging of entities
CN104462126A (en) * 2013-09-22 2015-03-25 富士通株式会社 Entity linkage method and device
US20170237628A1 (en) * 2016-02-17 2017-08-17 CENX, Inc. Service information model for managing a telecommunications network
CN107608960A (en) * 2017-09-08 2018-01-19 北京奇艺世纪科技有限公司 A kind of method and apparatus for naming entity link

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2251795A2 (en) * 2009-05-12 2010-11-17 Comcast Interactive Media, LLC Disambiguation and tagging of entities
CN104462126A (en) * 2013-09-22 2015-03-25 富士通株式会社 Entity linkage method and device
US20170237628A1 (en) * 2016-02-17 2017-08-17 CENX, Inc. Service information model for managing a telecommunications network
CN107608960A (en) * 2017-09-08 2018-01-19 北京奇艺世纪科技有限公司 A kind of method and apparatus for naming entity link

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866385A (en) * 2018-08-17 2020-03-06 广州阿里巴巴文学信息技术有限公司 Method and device for releasing external piece of electronic book and readable storage medium
CN110866385B (en) * 2018-08-17 2024-04-05 阿里巴巴(中国)有限公司 Method and device for publishing outside of electronic book and readable storage medium
CN109325230A (en) * 2018-09-21 2019-02-12 广西师范大学 A kind of phrase semantic degree of correlation judgment method based on wikipedia bi-directional chaining
CN110147401A (en) * 2019-05-22 2019-08-20 苏州大学 Merge the knowledge base abstracting method of priori knowledge and context-sensitive degree
CN111814477A (en) * 2020-07-06 2020-10-23 重庆邮电大学 Dispute focus discovery method and device based on dispute focus entity and terminal
CN111814477B (en) * 2020-07-06 2022-06-21 重庆邮电大学 Dispute focus discovery method and device based on dispute focus entity and terminal
CN113392220A (en) * 2020-10-23 2021-09-14 腾讯科技(深圳)有限公司 Knowledge graph generation method and device, computer equipment and storage medium
CN113392220B (en) * 2020-10-23 2024-03-26 腾讯科技(深圳)有限公司 Knowledge graph generation method and device, computer equipment and storage medium
CN113157861A (en) * 2021-04-12 2021-07-23 山东新一代信息产业技术研究院有限公司 Entity alignment method fusing Wikipedia
CN113157861B (en) * 2021-04-12 2022-05-24 山东浪潮科学研究院有限公司 Entity alignment method fusing Wikipedia

Also Published As

Publication number Publication date
CN108363688B (en) 2020-04-28

Similar Documents

Publication Publication Date Title
CN108363688A (en) A kind of name entity link method of fusion prior information
CN104376406B (en) A kind of enterprise innovation resource management and analysis method based on big data
CN108959258B (en) Specific field integrated entity linking method based on representation learning
Sunilkumar et al. A survey on semantic similarity
CN103399901A (en) Keyword extraction method
CN104679728A (en) Text similarity detection device
CN104063387A (en) Device and method abstracting keywords in text
CN101782898A (en) Method for analyzing tendentiousness of affective words
CN110162630A (en) A kind of method, device and equipment of text duplicate removal
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
Das et al. Part of speech tagging in odia using support vector machine
CN106611041A (en) New text similarity solution method
Gahbiche-Braham et al. Joint Segmentation and POS Tagging for Arabic Using a CRF-based Classifier.
CN102609424A (en) Method and equipment for extracting assessment information
Shajalal et al. Semantic textual similarity in bengali text
Thattinaphanich et al. Thai named entity recognition using Bi-LSTM-CRF with word and character representation
Rahman et al. NLP-based automatic answer script evaluation
Popescu et al. HASKER: An efficient algorithm for string kernels. Application to polarity classification in various languages
CN112559711A (en) Synonymous text prompting method and device and electronic equipment
Rahman et al. An Automated Approach for Answer Script Evaluation Using Natural Language Processing
Pal et al. Word sense disambiguation in Bengali: An unsupervised approach
CN111259661A (en) New emotion word extraction method based on commodity comments
Bloodgood et al. Using global constraints and reranking to improve cognates detection
CN114912446A (en) Keyword extraction method and device and storage medium
Li et al. Chinese frame identification using t-crf model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant