CN108363688A - A kind of name entity link method of fusion prior information - Google Patents
A kind of name entity link method of fusion prior information Download PDFInfo
- Publication number
- CN108363688A CN108363688A CN201810103629.7A CN201810103629A CN108363688A CN 108363688 A CN108363688 A CN 108363688A CN 201810103629 A CN201810103629 A CN 201810103629A CN 108363688 A CN108363688 A CN 108363688A
- Authority
- CN
- China
- Prior art keywords
- entity
- article
- candidate
- idf
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of name entity link methods of fusion prior information.This method comprises the following steps:(1) from Wikipedia data dump, Freebase data dump extraction character string candidate's entity table, name list, list of file names;(2) every article in Wikipedia data dump is expressed as word frequency/inverse document frequency tf idf features, and extracts versatility feature of each character string relative to candidate entity;(3) entity is referred to and carries out inquiry expansion, use character string candidate's entity table in (1), the candidate entity of generation is referred to for entity;(4) feature for extracting article where entity refers to, obtains the inverse document frequency and primary word collision rate of article;(5) feature for using (2), (4) to be extracted, computational entity refer to the correlation degree between its each candidate entity, and be used as entity link result by correlation degree is highest.The present invention breaches the limitation of language material shortage, has provided reliable entity link recommendation results to the user, wherein entity versatility feature adds prior information.
Description
Technical field
The present invention relates to natural language processing more particularly to a kind of name entity link methods of fusion prior information.
Background technology
Natural language processing (Nature Language Processing, abbreviation NLP) is a collection linguistics and calculating
The cross discipline that machine science is integrated.It is at natural language to name entity link (Named Entity Linking, abbreviation NEL)
A basic task in reason, it is intended to eliminate the ambiguity caused due to linguistics phenomenons such as alias, reference, a words mostly meaning, establish
Correspondence between the entity that its in the proper noun (physical name) occurred in text and knowledge base is referred to.The definition of problem
It is:It gives in one section of text and text and refers to (mention, that is, character string to be linked), the knowledge specified from one
These are found out in library refers to referred to entity.
This technology that can be established the link between text and knowledge base of entity link has very in information extraction
Important role.Wherein Relation extraction (Relation Extraction) is exactly that entity link is needed in information extraction technique
One exemplary.The purpose of Relation extraction problem thus is found out from the incidence relation extracted in text between different entities
It is the premise being further analyzed that entity in text, which refers to and finds their corresponding entities by entity link,.
In addition, entity link, which is equivalent to, increases additional information for original text again, thus can also be used in
In natural language processing and text mining problem, be conducive to be more fully understood from text, obtain better effect.
Entity link generally point multistep is realized to complete, two step of most important one is candidate generation (Candidate
Generation) (Candidate Ranking) is disambiguated with candidate.Candidate generation step is according to currently referring to (name used
Find out which entity it may refer to, as candidate;Then, the candidate step that disambiguates is according to the context referred to and candidate's reality
Some features of body itself select optimal candidate as final link result.
The common practice that candidate generates this step is one dictionary of prior construction, stores which reality is each name may correspond to
Body, when entity link to be executed, so that it may which candidate entity is found out from dictionary according to the name for currently referring to used.General profit
The dictionary is built with the information provided in knowledge base.
Step is disambiguated in candidate, common practice includes the method for collaboration and miscoordination.The method of collaboration is carrying out entity
When link, it is general consider simultaneously in same context multiple refer to, it is desirable to link in result between their own target entity
Correlation degree is big as possible.And the method for miscoordination then individually considers each refer to.Method used in us is the side of miscoordination
Method.The method of miscoordination carrys out Synergistic method general speed comparatively fast slightly inferior properties relatively.
Traditional miscoordination method, can design series of features includes:Name feature (Surface Features), for weighing
Measure the similarity between the name character string that refers to and candidate physical name, the name character string such as referred to in candidate physical name
Same words number;Contextual feature (Context Features) weighs candidate entity and refers to context in semantically
With degree, document where such as referring to and the TF-IDF similarities of candidate entity description, the Wikipedia page titles of candidate entity
In word whether all referring in a document occurring;Other features, such as appear in document where referring to jointly and candidate is real
National concrete number in body description, appear in jointly all alias neutralization of candidate entity refer to national concrete number in a document etc..
Since feature didactic in this way is highly desirable to expertise, once knowledge base or language material change, and it is original
Feature Engineering with regard to invalid, it is intended that seek to obtain preferable effect by feature few as possible.
Invention content
The purpose of the invention is to by the entity link identified in natural text to object knowledge library
(Freebase), to provide basis for follow-up works such as information extractions, a kind of name entity link of fusion prior information is proposed
Method.
Therefore we have invented IWHR (Important Word Hit Rate, primary word impact rate) this features, and will
Itself and two kinds of features of commoness, tf/idf combine the matching degree for going to judge entity Yu refer to, original name feature then to lead to
The process of candidate generation is crossed to ensure, the method that commoness is characterized as us adds prior information.
Again because common name entity link model needs to determine that parameter, training corpus obtain by training corpus
Take extremely difficult, the method that we invent just is combined three in a manner of non-training, while giving the ginseng of suggestion
Number setting.
In addition, not considering the problems of that context refers to make up miscoordination method, we are before carrying out entity link
Inquiry extension is added, the name of the same entity is directed toward for the possibility in same piece article and place name has done special optimization.
Tf/idf (word frequency/inverse document frequency feature, i.e. Tf-idf) is common and weighs the similarity degree between article, we
Here this feature is introduced for weighing the similarity degree for referring to context and candidate entity context.
Commonness has reacted candidate entity as the probability referred to, and introducing this feature, which is equivalent to, adds priori letter
Breath, the direction that can be used for judge refer to insufficient in context.Calculation formula is as follows:
IfTo be shown as name character string s and being linked to the Anchor Text set of entity e corresponding pages, AsTo be shown as word
The Anchor Text set of face character string s.Then have:
IWHR compensates for the deficiency of tf/idf, more lays particular emphasis on considering for the important word occurred in contexts, meter
It is as follows to calculate formula:
If e is the candidate entity of a wikipedia, m is character string to be identified, WdThe set of letters of article, W where me
For the set of letters of the page of e, then IWHR (e, m) can be calculated according to formula (2), and in formula (2), T is manually set
This is set as by threshold value, this method
The present invention is realized especially by following technical solution:
The name entity link method for merging prior information, includes the following steps:
S1:Character string-candidate's entity table, people are extracted from Wikipedia data dump, Freebase data dump
List of file names and ground list of file names;
S2:Every article in Wikipedia data dump is expressed as term frequency-inverse document frequency tf-idf features,
Each of which character string is extracted relative to the versatility commoness features of candidate entity and is stored;
S3:Using the name list obtained in S1, list of file names make inquiry expansion, use the character string-time obtained in S1
Entity table is selected, the candidate entity of generation is referred to for entity;
S4:Computational entity refers to the primary word collision rate IWHR relative to candidate entity;
S5:According to the tf-idf features extracted in S2, S4, commoness features, IWHR feature calculation entities refer to and it
Correlation degree between each candidate's entity, and correlation degree is highest as entity link result.
Above steps can specifically use following realization method:
S1 is specifically comprised the steps of:
S11:Wikipedia data dump are parsed, the article D for including entity in wikipedia is extractede, anchor in article
Text Ae, the corresponding entity number W of articleid, redirect page Repages, disambiguate page dispages, and then generate character
String-candidate's entity table str2entity;
S12:All names, place name in extraction Freebase data dump form name list Pname and ground ranks
Table Plocation.
S2 is specifically comprised the steps of:
S21:Using natural language processing tool StanfoldCoreNLP to every article participle of Wikipedia, make simultaneously
Stop words is removed with stop words dictionary, obtains vocabulary;
S22:Based on vocabulary, the inverse document frequency idf of all words in every article is calculated, wherein word word's
Idf calculation formula are as follows:
Wherein the number of files of corpus is the article sum in Wikipedia;
S23:Based on vocabulary, the word frequency tf of all words in every article is calculated, the tf of wherein word word calculates public affairs
Formula is as follows:
S24:According to S22, S23 as a result, calculating the term frequency-inverse document frequency of all words in every article of Wikipedia
The tf-idf vectors of rate tf-idf vectors, wherein word word indicate as follows:
S25:Based on the tfidf obtained by S24word(word), retain the inverse text of word frequency-for arranging preceding 20 words from big to small
Shelves frequency values are denoted as tfidf (document) as the tf-idf features of this article;
S26:Versatility commoness feature of each character string relative to candidate entity is calculated according to following formula:
Wherein e is a candidate entity, and m is character string,For the entitled m in surface and link the Anchor Text collection that entity is e
It closes, AmFor the Anchor Text set of the entitled m in surface, | | indicate the element number in set;
S27:By the tf-idf features for every article being calculated and each character string relative to candidate entity
Commoness features are stored.
S3 is specifically comprised the steps of:
S31:S is referred to according to entity, Pname, Plocation is inquired, if the character string is transferred to step in either table
Otherwise S32 is transferred to step S33;
S32:Check that above whether have character string s', the wherein s of s are the character substrings of s', and if it exists, then replace s
To be transferred to S33 after s', S33 is directly transferred to if being not present;
S33:Str2entity is inquired using character string s, obtains all candidate entities of the character string.
S4 is specifically comprised the steps of:
S41:The article where entity refers to is segmented using StanfoldCoreNLP tools, while removing and deactivating
Word obtains vocabulary;
S42:The idf values of each word in the article where entity refers to are obtained according to the idf calculation formula described in S22
idf(w);
S43:Primary word collision rate IWHR (e, m) is calculated according to following formula:
Wherein, e is a candidate entity, and m is that entity refers to, WdThe set of letters of article, W where meThe article where e
Set of letters, T be setting idf threshold values.The idf threshold values may be set to
In S5, computational entity refer to the correlation degree between its each candidate entity, and by the highest work of correlation degree
It is as follows for entity link result:
S51:Extraction entity refers to the place article d of mmAnd its place article d of a certain candidate entity ee;
S52:Article d is obtained according to S2 storage resultsmWith article deTf-idf features;
S53:To each candidate entity e, computational entity refers to and the tf-idf similitudes between its candidate entity:
Wherein:| | | | indicate vector field homoemorphism;
S54:From the result of calculation of S2 and S4 obtain entity refer to m relative to candidate entity e commoness features and
IWHR features;
S55:To each candidate entity e, the similarity between candidate entity is referred to according to following formula computational entity:
Similarity (e, m)=a × log (commoness (e, m))+b × log (tfidfsimilarity (e, m))+
c×log(IWHR(e,m))
Wherein a, b, c are constant;
S56:Calculate last entity link result eresult:
eresult=argmaxe(similarity(e,m))。
A, b, c can be arrived by neural network learning, can also manual debugging and setting, it is proposed that 1.0 are respectively set to,
6.0,1.0.
Entity versatility, word frequency/inverse document frequency is used only in the present invention, and three kinds of features such as primary word collision rate breach
The limitation that language material lacks, has provided reliable entity link recommendation results to the user, wherein entity versatility feature adds elder generation
Test information.
Description of the drawings
Fig. 1 is the work flow diagram using Wikipedia data dump, Freebase data dump extraction resources;
Fig. 2 is the key step work flow diagram for the name entity link method for merging prior information.
Specific implementation mode
The present invention is further elaborated with reference to the accompanying drawings and detailed description.
Present invention is generally directed to order entity link tasks by commoness, these three features of tf/idf, IWHR combine,
A kind of name entity link method of fusion prior information is realized, this method, which has considered, refers to that context, entity are popular
The importance of the priori of degree, the keyword of text, realizes higher accuracy rate, has also taken into account the efficiency of entity link.By
It is less in the feature used, finally need the parameter being fitted less, it is also more square in the case where migrating knowledge base and language material
Just.
As illustrated in fig. 1 and 2, a kind of name entity link method of fusion prior information, includes the following steps:
S1:Character string-candidate's entity table, people are extracted from Wikipedia data dump, Freebase data dump
List of file names and ground list of file names;This step concrete methods of realizing is as follows:
S11:Wikipedia data dump are parsed, the article D for including entity in wikipedia is extractede, anchor in article
Text Ae, the corresponding entity number W of articleid, redirect page Repages, disambiguate page dispages, and then generate character
String-candidate's entity table str2entity;
S12:All names, place name in extraction Freebase data dump form name list Pname and ground ranks
Table Plocation.
S2:Every article in Wikipedia data dump is expressed as term frequency-inverse document frequency tf-idf features,
Each of which character string is extracted relative to the versatility commoness features of candidate entity and is stored;This step implements
Method is as follows:
S21:Using natural language processing tool StanfoldCoreNLP to every article participle of Wikipedia, make simultaneously
Stop words is removed with stop words dictionary, obtains vocabulary;
S22:Based on vocabulary, the inverse document frequency idf of all words in every article is calculated, wherein word word's
Idf calculation formula are as follows:
Wherein the number of files of corpus is the article sum in Wikipedia;
S23:Based on vocabulary, the word frequency tf of all words in every article is calculated, the tf of wherein word word calculates public affairs
Formula is as follows:
S24:According to S22, S23 as a result, calculating the term frequency-inverse document frequency of all words in every article of Wikipedia
The tf-idf vectors of rate tf-idf vectors, wherein word word indicate as follows:
S25:Based on the tfidf obtained by S24word(word), retain the inverse text of word frequency-for arranging preceding 20 words from big to small
Shelves frequency values are denoted as tfidf (document) as the tf-idf features of this article;
S26:Versatility commoness feature of each character string relative to candidate entity is calculated according to following formula:
Wherein e is a candidate entity, and m is character string,For the entitled m in surface and link the Anchor Text collection that entity is e
It closes, AmFor the Anchor Text set of the entitled m in surface, | | indicate the element number in set;
S27:By the tf-idf features for every article being calculated and each character string relative to candidate entity
Commoness features are stored.
S3:Using the name list obtained in S1, list of file names make inquiry expansion, use the character string-time obtained in S1
Entity table is selected, the candidate entity of generation is referred to for entity;This step concrete methods of realizing is as follows:
S31:S is referred to according to entity, Pname, Plocation is inquired, if the character string is transferred to step in either table
Otherwise S32 is transferred to step S33;
S32:Check that above whether have character string s', the wherein s of s are the character substrings of s', and if it exists, then replace s
To be transferred to S33 after s', S33 is directly transferred to if being not present;
S33:Str2entity is inquired using character string s, obtains all candidate entities of the character string.
S4:Computational entity refers to the primary word collision rate IWHR relative to candidate entity;This step concrete methods of realizing is such as
Under:
S41:The article where entity refers to is segmented using StanfoldCoreNLP tools, while removing and deactivating
Word obtains vocabulary;
S42:The idf values of each word in the article where entity refers to are obtained according to the idf calculation formula described in S22
idf(w);
S43:Primary word collision rate IWHR (e, m) is calculated according to following formula:
Wherein, e is a candidate entity, and m is that entity refers to, WdThe set of letters of article, W where meThe article where e
Set of letters, T be setting idf threshold values.Idf threshold values can be set as
S5:According to the tf-idf features extracted in S2, S4, commoness features, IWHR feature calculation entities refer to and it
Correlation degree between each candidate's entity, and correlation degree is highest as entity link result.This step implements
Method is as follows:
S51:Extraction entity refers to the place article d of mmAnd its place article d of a certain candidate entity ee;
S52:Article d is obtained according to S2 storage resultsmWith article deTf-idf features;
S53:To each candidate entity e, computational entity refers to and the tf-idf similitudes between its candidate entity:
Wherein:| | | | indicate vector field homoemorphism;
S54:From the result of calculation of S2 and S4 obtain entity refer to m relative to candidate entity e commoness features and
IWHR features;
S55:To each candidate entity e, the similarity between candidate entity is referred to according to following formula computational entity:
Similarity (e, m)=a × log (commoness (e, m))+b × log (tfidfsimilarity (e, m))+
c×log(IWHR(e,m))
Wherein a, b, c are constant;
S56:Calculate last entity link result eresult:
eresult=argmaxe(similarity(e,m))。
A, b, c can manually be respectively set to 1.0,6.0,1.0.
This method is applied to following embodiments below, so that those skilled in the art more fully understand that the present invention's is specific
It realizes.
Embodiment
It is found with the entity of text analyzing meeting in 2017 (text analysis conference) and links subtask
Document for, by the above method be applied to text in carry out text name entity link, (Resource Access process is no longer described in detail, mistake
Journey is complex) design parameter and way are as follows in each step:
1. by original document set using name Entity recognition tool or artificial mark obtain it is to be linked refer to, specifically
To provide article where the character string referred to, initial position, the triple of last letter position;
2. writing all the elements (removing xml labels) in script abstracting document set, every article is as a file;
3. being segmented using natural language processing tool StanfoldCoreNLP to every article, while removing and deactivating
Word counts total word number of every article;
4. the occurrence number of pair each word of every paper statistics calculates the tf values of each word in every article, calculates
Formula is as follows:
5. the idf values according to the vocabulary and word that are counted to Wikipedia and obtained tf values calculated above,
The tf-idf values of each word in every article are calculated, calculation formula is as follows:
6. in every article, tf-idf values according to arrangement from big to small take first 20 with corresponding word as this article
Tf-idf features;
7. each of identify and to refer in pair 1, according to whether be name, place name carries out inquiry expansion, be to be asked
It askes and expands, judgment mode is:
If refer to name list, be taken as name, place name inside list of file names;
Mode is expanded in inquiry:
Whether there is character string s''s, s' to be abbreviated as s before s in article where determining s;Or the part that s is s', such as
S' is
Hilary Clinton, s Clinton;If there are such case, then s is replaced with into s';
8. pair each refer to inquiry string-candidate's entity-commoness lists, obtain the candidate entity of character string with
And corresponding commoness features;
9. pair each referring to, article where referring to and the tf-idf phases between their corresponding candidate entity articles are calculated
Like property, calculation formula is as follows:
10. pair each referring to, article where referring to and the IWHR similitudes between their corresponding candidate entities are calculated,
Calculation formula is as follows:
If e is the candidate entity of a wikipedia, m is character string to be identified, WdThe set of letters of article, W where me
For the set of letters of the page of e, then IWHR (e, m) can be calculated according to following equalities (2), and T is the threshold value manually set, we
This is set as by method
11. it is pair each refer to and they each of candidate entity calculating carry with the entity degree of correlation, calculation formula is as follows:
Similarity (e, m)=a × log (commoness (e, m))+b × log (tfidfsimilarity (e, m))+
c×log(IWHR(e,m))(5)
(a, b, c) is set as (1.0,6.0,1.0);
The maximum e of above formula is enabled 12. taking, the link as m is as a result, namely following equalities:
eresult=argmaxe(similarity(e,m)) (6)
Following table is that the part of selected document finally links result.
WORD | Beg | End | KBid |
Turkey | 2279 | 2284 | m.01znc_ |
Microsoft | 2620 | 2628 | m.04sv4 |
Nam Dinh | 3703 | 3710 | m.07m1dj |
the Beatles | 2078 | 2088 | m.07c0j |
Gaisano mall | 2642 | 2653 | m.09rxbx2 |
Claims (8)
1. a kind of name entity link method of fusion prior information, it is characterised in that include the following steps:
S1:Character string-candidate's entity table, name row are extracted from Wikipedia data dump, Freebase data dump
Table and ground list of file names;
S2:Every article in Wikipedia data dump is expressed as term frequency-inverse document frequency tf-idf features, is extracted
Each of which character string relative to candidate entity versatility commoness features and stored;
S3:Using the name list obtained in S1, list of file names make inquiry expansion, use the character string obtained in S1-candidate real
Body surface refers to the candidate entity of generation for entity;
S4:Computational entity refers to the primary word collision rate IWHR relative to candidate entity;
S5:It is referred to according to the tf-idf features, commoness features, IWHR feature calculation entities extracted in S2, S4 each with it
Correlation degree between candidate entity, and correlation degree is highest as entity link result.
2. a kind of name entity link method of fusion prior information according to claim 1, it is characterised in that S1 is specific
It comprises the steps of:
S11:Wikipedia data dump are parsed, the article D for including entity in wikipedia is extractede, Anchor Text in article
Ae, the corresponding entity number W of articleid, redirect page Repages, disambiguate page dispages, and then generate character string-time
Select entity table str2entity;
S12:Extract all names in Freebase data dump, place name forms name list Pname and ground list of file names
Plocation。
3. a kind of name entity link method of fusion prior information according to claim 1, it is characterised in that S2 is specific
It comprises the steps of:
S21:Using natural language processing tool StanfoldCoreNLP to every article participle of Wikipedia, while use stops
Word dictionary removes stop words, obtains vocabulary;
S22:Based on vocabulary, the inverse document frequency idf of all words in every article, the idf meters of wherein word word are calculated
It is as follows to calculate formula:
Wherein the number of files of corpus is the article sum in Wikipedia;
S23:Based on vocabulary, the word frequency tf of all words in every article, wherein the tf calculation formula of word word are calculated such as
Under:
S24:According to S22, S23 as a result, calculating the term frequency-inverse document frequency tf- of all words in every article of Wikipedia
The tf-idf vectors of idf vectors, wherein word word indicate as follows:
S25:Based on the tfidf obtained by S24word(word), retain the term frequency-inverse document frequency for arranging preceding 20 words from big to small
Rate value is denoted as tfidf (document) as the tf-idf features of this article;
S26:Versatility commoness feature of each character string relative to candidate entity is calculated according to following formula:
Wherein e is a candidate entity, and m is character string,For the entitled m in surface and link the Anchor Text set that entity is e, Am
For the Anchor Text set of the entitled m in surface, | | indicate the element number in set;
S27:By the tf-idf features for every article being calculated and each character string relative to candidate entity
Commoness features are stored.
4. a kind of name entity link method of fusion prior information according to claim 1, it is characterised in that S3 is specific
It comprises the steps of:
S31:S is referred to according to entity, inquires Pname, Plocation, if the character string in either table, is transferred to step S32,
Otherwise it is transferred to step S33;
S32:Check that above whether have character string s', the wherein s of s are the character substrings of s', and if it exists, s is then replaced with into s'
After be transferred to S33, if there is no if be directly transferred to S33;
S33:Str2entity is inquired using character string s, obtains all candidate entities of the character string.
5. a kind of name entity link method of fusion prior information according to claim 1, it is characterised in that S4 is specific
It comprises the steps of:
S41:The article where entity refers to is segmented using StanfoldCoreNLP tools, while removing stop words, is obtained
To vocabulary;
S42:The idf values idf of each word in the article where entity refers to is obtained according to the idf calculation formula described in S22
(w);
S43:Primary word collision rate IWHR (e, m) is calculated according to following formula:
Wherein, e is a candidate entity, and m is that entity refers to, WdThe set of letters of article, W where meThe list of article where e
Set of words, T are the idf threshold values of setting.
6. a kind of name entity link method of fusion prior information according to claim 5, it is characterised in that described
Idf threshold values are set as
7. a kind of name entity link method of fusion prior information according to claim 1, it is characterised in that described
In S5, computational entity refer to the correlation degree between its each candidate entity, and be used as chain of entities by correlation degree is highest
Binding fruit is as follows:
S51:Extraction entity refers to the place article d of mmAnd its place article d of a certain candidate entity ee;
S52:Article d is obtained according to S2 storage resultsmWith article deTf-idf features;
S53:To each candidate entity e, computational entity refers to and the tf-idf similitudes between its candidate entity:
Wherein:| | | | indicate vector field homoemorphism;
S54:Entity is obtained from the result of calculation of S2 and S4 refers to commoness feature and IWHR of the m relative to candidate entity e
Feature;
S55:To each candidate entity e, the similarity between candidate entity is referred to according to following formula computational entity:
Similarity (e, m)=a × log (commoness (e, m))+b × log (tfidfsimilarity (e, m))+c ×
log(IWHR(e,m))
Wherein a, b, c are constant;
S56:Calculate last entity link result eresult:
eresult=argmaxe(similarity(e,m))。
8. a kind of name entity link method of fusion prior information according to claim 7, it is characterised in that described
A, b, c are respectively set to 1.0,6.0,1.0.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810103629.7A CN108363688B (en) | 2018-02-01 | 2018-02-01 | Named entity linking method fusing prior information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810103629.7A CN108363688B (en) | 2018-02-01 | 2018-02-01 | Named entity linking method fusing prior information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108363688A true CN108363688A (en) | 2018-08-03 |
CN108363688B CN108363688B (en) | 2020-04-28 |
Family
ID=63004109
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810103629.7A Active CN108363688B (en) | 2018-02-01 | 2018-02-01 | Named entity linking method fusing prior information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108363688B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109325230A (en) * | 2018-09-21 | 2019-02-12 | 广西师范大学 | A kind of phrase semantic degree of correlation judgment method based on wikipedia bi-directional chaining |
CN110147401A (en) * | 2019-05-22 | 2019-08-20 | 苏州大学 | Merge the knowledge base abstracting method of priori knowledge and context-sensitive degree |
CN110866385A (en) * | 2018-08-17 | 2020-03-06 | 广州阿里巴巴文学信息技术有限公司 | Method and device for releasing external piece of electronic book and readable storage medium |
CN111814477A (en) * | 2020-07-06 | 2020-10-23 | 重庆邮电大学 | Dispute focus discovery method and device based on dispute focus entity and terminal |
CN113157861A (en) * | 2021-04-12 | 2021-07-23 | 山东新一代信息产业技术研究院有限公司 | Entity alignment method fusing Wikipedia |
CN113392220A (en) * | 2020-10-23 | 2021-09-14 | 腾讯科技(深圳)有限公司 | Knowledge graph generation method and device, computer equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2251795A2 (en) * | 2009-05-12 | 2010-11-17 | Comcast Interactive Media, LLC | Disambiguation and tagging of entities |
CN104462126A (en) * | 2013-09-22 | 2015-03-25 | 富士通株式会社 | Entity linkage method and device |
US20170237628A1 (en) * | 2016-02-17 | 2017-08-17 | CENX, Inc. | Service information model for managing a telecommunications network |
CN107608960A (en) * | 2017-09-08 | 2018-01-19 | 北京奇艺世纪科技有限公司 | A kind of method and apparatus for naming entity link |
-
2018
- 2018-02-01 CN CN201810103629.7A patent/CN108363688B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2251795A2 (en) * | 2009-05-12 | 2010-11-17 | Comcast Interactive Media, LLC | Disambiguation and tagging of entities |
CN104462126A (en) * | 2013-09-22 | 2015-03-25 | 富士通株式会社 | Entity linkage method and device |
US20170237628A1 (en) * | 2016-02-17 | 2017-08-17 | CENX, Inc. | Service information model for managing a telecommunications network |
CN107608960A (en) * | 2017-09-08 | 2018-01-19 | 北京奇艺世纪科技有限公司 | A kind of method and apparatus for naming entity link |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110866385A (en) * | 2018-08-17 | 2020-03-06 | 广州阿里巴巴文学信息技术有限公司 | Method and device for releasing external piece of electronic book and readable storage medium |
CN110866385B (en) * | 2018-08-17 | 2024-04-05 | 阿里巴巴(中国)有限公司 | Method and device for publishing outside of electronic book and readable storage medium |
CN109325230A (en) * | 2018-09-21 | 2019-02-12 | 广西师范大学 | A kind of phrase semantic degree of correlation judgment method based on wikipedia bi-directional chaining |
CN110147401A (en) * | 2019-05-22 | 2019-08-20 | 苏州大学 | Merge the knowledge base abstracting method of priori knowledge and context-sensitive degree |
CN111814477A (en) * | 2020-07-06 | 2020-10-23 | 重庆邮电大学 | Dispute focus discovery method and device based on dispute focus entity and terminal |
CN111814477B (en) * | 2020-07-06 | 2022-06-21 | 重庆邮电大学 | Dispute focus discovery method and device based on dispute focus entity and terminal |
CN113392220A (en) * | 2020-10-23 | 2021-09-14 | 腾讯科技(深圳)有限公司 | Knowledge graph generation method and device, computer equipment and storage medium |
CN113392220B (en) * | 2020-10-23 | 2024-03-26 | 腾讯科技(深圳)有限公司 | Knowledge graph generation method and device, computer equipment and storage medium |
CN113157861A (en) * | 2021-04-12 | 2021-07-23 | 山东新一代信息产业技术研究院有限公司 | Entity alignment method fusing Wikipedia |
CN113157861B (en) * | 2021-04-12 | 2022-05-24 | 山东浪潮科学研究院有限公司 | Entity alignment method fusing Wikipedia |
Also Published As
Publication number | Publication date |
---|---|
CN108363688B (en) | 2020-04-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108363688A (en) | A kind of name entity link method of fusion prior information | |
CN104376406B (en) | A kind of enterprise innovation resource management and analysis method based on big data | |
CN108959258B (en) | Specific field integrated entity linking method based on representation learning | |
Sunilkumar et al. | A survey on semantic similarity | |
CN103399901A (en) | Keyword extraction method | |
CN104679728A (en) | Text similarity detection device | |
CN104063387A (en) | Device and method abstracting keywords in text | |
CN101782898A (en) | Method for analyzing tendentiousness of affective words | |
CN110162630A (en) | A kind of method, device and equipment of text duplicate removal | |
CN110362678A (en) | A kind of method and apparatus automatically extracting Chinese text keyword | |
Das et al. | Part of speech tagging in odia using support vector machine | |
CN106611041A (en) | New text similarity solution method | |
Gahbiche-Braham et al. | Joint Segmentation and POS Tagging for Arabic Using a CRF-based Classifier. | |
CN102609424A (en) | Method and equipment for extracting assessment information | |
Shajalal et al. | Semantic textual similarity in bengali text | |
Thattinaphanich et al. | Thai named entity recognition using Bi-LSTM-CRF with word and character representation | |
Rahman et al. | NLP-based automatic answer script evaluation | |
Popescu et al. | HASKER: An efficient algorithm for string kernels. Application to polarity classification in various languages | |
CN112559711A (en) | Synonymous text prompting method and device and electronic equipment | |
Rahman et al. | An Automated Approach for Answer Script Evaluation Using Natural Language Processing | |
Pal et al. | Word sense disambiguation in Bengali: An unsupervised approach | |
CN111259661A (en) | New emotion word extraction method based on commodity comments | |
Bloodgood et al. | Using global constraints and reranking to improve cognates detection | |
CN114912446A (en) | Keyword extraction method and device and storage medium | |
Li et al. | Chinese frame identification using t-crf model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |