CN112966117A - Entity linking method - Google Patents

Entity linking method Download PDF

Info

Publication number
CN112966117A
CN112966117A CN202011581122.6A CN202011581122A CN112966117A CN 112966117 A CN112966117 A CN 112966117A CN 202011581122 A CN202011581122 A CN 202011581122A CN 112966117 A CN112966117 A CN 112966117A
Authority
CN
China
Prior art keywords
entity
text
knowledge base
description information
vector representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011581122.6A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Shuzhilian Technology Co Ltd
Original Assignee
Chengdu Shuzhilian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Shuzhilian Technology Co Ltd filed Critical Chengdu Shuzhilian Technology Co Ltd
Priority to CN202011581122.6A priority Critical patent/CN112966117A/en
Publication of CN112966117A publication Critical patent/CN112966117A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The invention discloses an entity linking method, which relates to the field of natural language processing and comprises the following steps: generating text description information of the entities in the knowledge base according to the attributes of the entities in the knowledge base and the structural information of the relational graph; obtaining sentence global vector representation of text description information of an entity in a knowledge base, and taking the sentence global vector representation as vector representation of the entity in the knowledge base; obtaining vector representation of entity text fragments in the unstructured text; calculating similarity scores of vector representation of entity text segments in the unstructured text and vector representation of entities in a knowledge base through vector similarity calculation, sorting, and selecting candidate entities with scores larger than a threshold value and highest scores in sorting results as entities of the unstructured text linked with the knowledge base; the invention avoids the inconsistency of entity feature vector representation under respective space caused by the difference between the graph data structure and the text data structure; the method is beneficial to the accurate calculation of the entity characteristic similarity.

Description

Entity linking method
Technical Field
The invention relates to the field of natural language processing, in particular to an entity linking method.
Background
The entity link is a key method for knowledge fusion of the knowledge map, and is mainly responsible for linking entities extracted from various unstructured documents with entities existing in a knowledge base. The method is widely applied to various fields of natural language processing, such as knowledge base construction, intelligent question answering based on the knowledge base and the like. As for sentences in the document:
three loads of the Tang dynasty Tianbao,Li Baito all eastLuoyang medicine for treating kidney-yang deficiency. Here he encountersDu Fu
Entity linking entails linking the entities "Tang dynasty," "Libai," "Luoyang," and "Dufu" in a sentence with entities in an already existing knowledge base.
The entity linking method mainly comprises the following four steps:
(1) possible solid text segments in the sentence are detected, such as "Libai", "Luoyang", "Dufu" in the above-mentioned sentence.
(2) And searching a plurality of candidate entities in the knowledge base for each detected entity text segment by using a correlation algorithm, wherein the plurality of candidate entities such as "Libai" in the knowledge base may exist in Libai of poetry of dynasty, Libai of the first organization and the like.
(3) And learning entity text fragments in the vectorized document and candidate entities in the knowledge base by using distributed embedded expression, and then carrying out similarity score calculation.
(4) And (4) sorting the similarity of each candidate entity in the knowledge base, and taking the candidate entity with the similarity larger than a certain threshold and the highest score as the finally linked entity.
The main difference of the mainstream entity link method in the industry at present is a candidate entity vectorization method in an entity text fragment and a knowledge base, the traditional method utilizes distributed representation learning to calculate the probability of occurrence of candidate entities relative to the entity text fragment mention on related linguistic data, and K entities with the highest probability are selected as candidate entities.
Disclosure of Invention
The invention provides an end-to-end efficient method based on GraphToText for a knowledge graph entity link task, wherein the GraphToText mainly converts graph structure information of entity nodes in a knowledge base into text description information to help the entity links to carry out context semantic representation learning of the entity nodes in the graph, and can be widely applied to an entity link task of an unstructured text and the knowledge base.
In order to achieve the above object, the present invention provides an entity linking method, including:
generating text description information of the entities in the knowledge base according to the attributes of the entities in the knowledge base and the structural information of the relational graph;
obtaining sentence global vector representation of text description information of an entity in a knowledge base, and taking the sentence global vector representation as vector representation of the entity in the knowledge base;
obtaining vector representation of entity text fragments in the unstructured text;
calculating similarity scores of vector representation of entity text segments in the unstructured text and vector representation of entities in the knowledge base through vector similarity calculation, sorting based on similarity score results, and selecting candidate entities with scores larger than a threshold value and highest scores in the sorting results as entities of the unstructured text linked with the knowledge base.
The principle of the invention is as follows: converting graph structure information of entity nodes in a knowledge base into text description information of the entities by using an entity text description template; then, learning distributed vector representation of entity text fragment context information in the unstructured text and entity node text description information in a knowledge base by utilizing a pre-training language model; and finally, calculating the similarity score between the two vectors by using a pairwise vector similarity calculation method, and taking the score as a basis for judging whether the two vectors are linked.
Preferably, in the method, a BERT model is used for obtaining sentence global vector representation of text description information of an entity in a knowledge base, and a BERT model is used for obtaining vector representation of an entity text segment in an unstructured text.
Preferably, the method includes identifying entities in the unstructured text, and specifically includes:
unstructured text D ═ w1,w2,…,wlThe words of are derived from the vocabulary library, wi∈WvTable of 1, …, lThe number of words in the text D is shown, v represents the size of a vocabulary library, and W represents the space of the vocabulary library;
obtaining a vector representation sequence of a text D word sequence by using a BERT pre-training language model: x ═ X1,x2,…,xl},xi∈ Rd,i=1,2,…,l;xiIs a d-dimensional vector of a real number space, representing the vector representation of the ith word, R representing the real number space;
the sequence embedded representation X is input into a pre-set encoder, resulting in a word embedded representation of the entity.
Preferably, the preset encoder in the method is a transform encoder:
Xe=Transformer_Encoder(X)
wherein, XeRepresenting entity embedded representation information after being coded by a Transformer coder;
mixing XeInputting the entity classification into a Softmax classifier through a linear transformation:
Oe=Linear(Xe)
Pe=Softmax(Oe)
wherein, Oe∈Rn×cRepresents XeAn embedded representation obtained after a linear transformation, c representing the number of classes of the entity BIO tag, Pe∈Rn×cRepresenting the probability distribution of entity classification, and taking the subscript index of the item with the maximum probability value in the probability distribution as the index value of the entity classification label of the current character;
obtaining a detected entity text fragment set E ═ { E } through BIO marking of a decoding sequence1,e2,…,enWhere E represents a collection of entity text snippets, Ei(i ═ 1,2, …, n) text snippet [ w ] representing the ith entity in unstructured texti0,…,wik]Where k represents the length of the entity text segment.
Preferably, the method includes generating candidate entities of the entities in the unstructured text in the knowledge base, and specifically includes:
constructing entity names for entities in the set E to obtain a final entity name set;
querying a candidate entity set G ═ G { G } in a knowledge base by fuzzy matching based on the final entity name set1,…,GnTherein of
Figure RE-GDA0002998634270000031
And m represents the number of candidate entities in the knowledge base of each entity text segment.
Preferably, the method for constructing the entity name comprises the following steps: adding an alias dictionary of the entity name, generating shorthand of the name by using a name shorthand rule aiming at the entity names of different ontology types, and finally generating an entity name set by the entity according to the alias dictionary and the shorthand rule.
Preferably, the method includes generating context description information of the entity in the unstructured text, and specifically includes:
for each entity in the unstructured text, generating corresponding description information T ═ T according to the context environment of the entity1,T2,…,TnN is the number of entities in the unstructured text;
the method for generating the corresponding description information according to the context environment of the entity is as follows: and intercepting a context sentence where the entity is located as description information of the entity, and adding preset marker characters before and after each entity.
Preferably, the method includes generating candidate entity description information in a knowledge base, and specifically includes:
for entity e in unstructured textiEntity candidate set G in a knowledge baseiUsing a preset template of the ontology type for each candidate entity according to the ontology type of the entity
Figure RE-GDA0002998634270000032
Generating description information S of the entity; candidate entities
Figure RE-GDA0002998634270000033
Figure RE-GDA0002998634270000034
Description information S ═ S1,S2,…,Sn},
Figure RE-GDA0002998634270000035
Wherein S represents the description information of each entity in the unstructured text in the candidate entity set in the knowledge base, SiDescription information representing candidate entities of the ith entity in the unstructured text in the knowledge base,
Figure RE-GDA0002998634270000036
and the description information of the jth candidate entity of the ith entity in the knowledge base in the unstructured text is represented, n represents the number of entities in the unstructured text, and m represents the number of the knowledge base candidate entity set of the entities in the unstructured text.
Preferably, the vectorized representation of the entity in the unstructured text and the candidate entity in the knowledge base in the method includes:
setting the description information T of the entity in the unstructured text as T ═ T1,T2,…,TnInputting a BERT model to obtain text sequence vector representation of respective description information:
Figure RE-GDA0002998634270000037
Figure RE-GDA0002998634270000038
wherein t represents the maximum length of the description information text of the entity in the unstructured text, and d represents the dimension of the word vector output by BERT;
obtaining a global vector representation of a sentence by a pooling layer according to the text sequence vector representation of the entity description information:
Figure RE-GDA0002998634270000039
wherein
Figure RE-GDA00029986342700000310
A global vector representation representing a sentence;
knowledge base candidate entity description information S ═ { S ═ S1,S2,…,SnThe sequence vector obtained after entering the BERT is expressed as follows:
Figure RE-GDA0002998634270000041
Figure RE-GDA0002998634270000042
Figure RE-GDA0002998634270000043
wherein, XSAfter the description information of each entity in the candidate entity set S of the knowledge base passes through a BERT pre-training model, a text sequence vector representation matrix is obtained;
Figure RE-GDA0002998634270000044
a text sequence vector representation matrix formed by candidate entity set description information of the ith entity in the knowledge base in the unstructured text is represented;
Figure RE-GDA0002998634270000045
a jth knowledge base candidate entity description information text sequence vector representation representing an ith entity in the unstructured text,
Figure RE-GDA0002998634270000046
the ith knowledge base candidate entity description information of the ith entity in the text
Figure RE-GDA0002998634270000047
Word vector tableShown in the specification;
extracting a global vector representation of a sentence from the text sequence vector representation of the knowledge base candidate entity description information through a pooling layer:
Figure RE-GDA0002998634270000048
wherein the content of the first and second substances,
Figure RE-GDA0002998634270000049
and representing the global vector representation of the ith candidate entity description information of the ith entity in the unstructured text in the knowledge base.
Preferably, the method further comprises:
and calculating the similarity between the entity in the unstructured text and the description information of each candidate entity in the knowledge base by using a similarity calculation method:
Figure RE-GDA00029986342700000410
wherein CS (i, j) represents a similarity score between the description information of the ith entity in the unstructured text and the description information of the jth candidate entity in the knowledge base,
Figure RE-GDA00029986342700000411
a global vector representation representing a sentence is shown,
Figure RE-GDA00029986342700000412
representing the global vector representation of the ith entity description information of the jth candidate entity in the knowledge base in the unstructured text, and representing the related pairwise Similarity calculation method by using Similarity;
for the ith entity in the unstructured text, sorting the entity description information similarity scores of the candidate entity set in the knowledge base, and selecting the entity with the similarity score larger than a set threshold value and the highest score as the entity finally linked to the knowledge base;
max_i=argmaxjCS (i, j), i ═ 1,2, …, n, j ═ 1,2, …, m; wherein max _ i is the final link entity of the entity i, n represents the number of entities in the unstructured text, and m represents the number of the knowledge base candidate entity sets of the entities in the unstructured text.
The invention also provides an entity linking system, which comprises:
the text description information generating unit of the entity in the knowledge base is used for generating text description information of the entity in the knowledge base according to the attribute and the structural information of the relational graph of the entity in the knowledge base;
a vector representation obtaining unit of the entity in the knowledge base, which is used for obtaining sentence global vector representation of the text description information of the entity in the knowledge base and taking the sentence global vector representation as the vector representation of the entity in the knowledge base;
the vector representation obtaining unit of the entity text fragment in the unstructured text is used for obtaining the vector representation of the entity text fragment in the unstructured text;
and the link entity obtaining unit is used for calculating the similarity scores of the vector representation of the entity text fragment in the unstructured text and the vector representation of the entity in the knowledge base through vector similarity calculation, sorting based on the similarity score results, and selecting the candidate entity with the score larger than the threshold value and the highest score in the sorting results as the entity for linking the unstructured text and the knowledge base.
The invention also provides an entity linking device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of the entity linking method when executing the computer program.
The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the entity linking method.
According to the invention, special character marks are added before and after the entity needing to be linked in the unstructured text, so that the characteristic representation of the entity needing to be linked at present is captured in the global vector generation of the subsequent sentence, and the description information of the entity in the unstructured text can be generated.
According to the method, the candidate entity description information in the knowledge base is generated through the template, the characteristic vector representation of the entity is obtained by utilizing the BERT pre-training model, and finally the candidate entity vector representation in the knowledge base is obtained.
One or more technical schemes provided by the invention at least have the following technical effects or advantages:
the method converts graph structure data such as attributes, relations and the like of the candidate entities in the knowledge base into text data, thereby fully utilizing a BERT pre-training model to learn the vector representation related to the context of the candidate entities on the text description information of the candidate entities, and avoiding the inconsistency of the entity feature vector representation under respective spaces caused by the difference between the graph data structure and the text data structure.
The invention enables the entity in the unstructured text and the entity in the knowledge base under the graph structure to be unified to the same space, and utilizes the pre-training model to carry out representation learning of the entity characteristics, thereby being beneficial to the accurate calculation of the similarity of the entity characteristics.
The invention does not need to construct the corpus of the distributed representation learning in advance, thereby having higher efficiency.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention;
FIG. 1 is a schematic diagram of a link between an entity in an unstructured document and an entity in a knowledge base;
FIG. 2 is a schematic diagram of an entity linking method architecture based on GraphToText;
FIG. 3 is a schematic diagram of the entity linking system.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflicting with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described and thus the scope of the present invention is not limited by the specific embodiments disclosed below.
Example one
FIG. 2 is a schematic diagram of an entity linking method architecture based on GraphToText; the embodiment of the invention provides an end-to-end entity linking method based on GraphToText, which comprises the steps of firstly generating text description information of an entity by using a GraphToText method according to the attribute and the structure information of a relational graph of the entity in a knowledge base, then using the sentence global vector representation of the text description information of a BERT learning entity as the vector representation of the entity, and simultaneously using the BERT learning vector representation of an entity text fragment in an unstructured text. And finally, calculating similarity scores of the vector representation of the entity text fragment in the text and the entity vector representation in the knowledge base through vector similarity calculation, sorting the similarity scores, and taking the candidate entity with the score larger than a threshold value and the highest score as the entity of the final link.
Step 1 entity identification in the document;
for an input document D ═ w1,w2,…,wlGet its words from the vocabulary library, wi∈WvI is 1, …, l, l indicates the number of document words, v indicates the size of the vocabulary library, and W indicates the vocabulary library space. Then obtaining a vector representation sequence of the document word sequence through a BERT pre-training language model, wherein X is { X ═ X1,x2,…,xl},xi∈Rd,i=1,2,…,l。xiIs a d-dimensional vector of the real space, representing the vector representation of the ith word, and R represents the real space.
The sequence-embedded representation X is input into a transform encoder to obtain a word-embedded representation of the entity.
Xe=Transformer_Encoder(X)
Wherein XeIndicating that the entity coded by the transform coder is embedded with the indicating information. Mixing XeInputting the entity classification into a Softmax classifier through a linear transformation:
Oe=Linear(Xe)
Pe=SoftmaX(Oe)
wherein O ise∈Rn×cRepresents XeAn embedded representation obtained after a linear transformation, c representing the number of classes of the entity BIO tag, Pe∈Rn×cAnd representing the probability distribution of entity classification, and taking the subscript index of the item with the maximum probability value in the probability distribution as the index value of the entity classification label of the current character. Finally, obtaining a detected entity text fragment set E ═ { E } through the BIO mark of the decoding sequence1,e2,…,enWhere E represents a collection of entity text snippets, Ei(i ═ 1,2, …, n) text snippet [ w ] representing the ith entity in the texti0,wi1…,wik]Where k represents the length of the entity text segment.
Step 2, generating candidate entities of the entities in the document in the knowledge base;
and (3) for the entity in the detected entity text fragment set E, carefully constructing the entity name, such as adding an alias dictionary of the entity name, generating shorthand of the name by using a name shorthand rule aiming at the entity names of different ontology types, and querying a candidate entity set G { G ═ G in a knowledge base by fuzzy matching according to the entity name set finally generated by the current entity according to the dictionary and the rule1,G2,…,GnTherein of
Figure RE-GDA0002998634270000071
Representing the candidate entity set of the ith entity in the document in the knowledge base, and m represents the number of candidate entities in the knowledge base of each entity text segment.
Step 3, generating the context description information of the entity in the document;
for each entity in the document, corresponding description information T ═ T is generated for the entity according to the context environment of the entity1,T2,…,TnWherein n is the middle of the documentThe number of entities is determined by intercepting the context sentence where the entity is located as the description information of the entity, and adding special entity alignment marker characters [ LINK ] before and after each entity in order to distinguish the description information of different entities in the same sentence]For example, for a sentence:
three loads of the Tang dynasty Tianbao,Li Baito all eastLuoyang medicine for treating kidney-yang deficiency. Here he encountersDu Fu
The entities "tang dynasty", "li white", "luoyang", and "dufu" of the above sentences generate the following entity description information for them:
(1) entity description information of "tang dynasty": [ LINK)]Tang dynasty [ LINK]The Tianbao three loads are carried by the patient,Li Baito all eastLuoyang medicine for treating kidney-yang deficiency. Here he encountersDu Fu
(2) "entity description information of plum white: tang dynasty Tianbao three-year old, [ LINK ]]Li Bai[LINK]To all eastLuoyang medicine for treating kidney-yang deficiency. Here he encountersDu Fu
(3) "entity description information of luoyang: three loads of the Tang dynasty Tianbao,Li Baito east all [ LINK ]]Luoyang medicine for treating kidney-yang deficiency[LINK]. Here he encountersDu Fu
(4) Entity description information of "dupu": three loads of the Tang dynasty Tianbao,Li Baito all eastLuoyang medicine for treating kidney-yang deficiency. Here he encounters [ LINK ]] Du Fu[LINK]。
Step 4, generating texts of the candidate entity description information in the knowledge base;
for entity e in documentiEntity candidate set G in a knowledge baseiUsing a preset template of the ontology type for each candidate entity according to the ontology type of the entity
Figure RE-GDA0002998634270000072
Generating description information S ═ { S ═ of an entity1,S2,…,Sn},
Figure RE-GDA0002998634270000073
Wherein S represents the description information of the candidate entity set of each entity in the document in the knowledge base, SiRepresenting documentsDescription information of candidate entities of the ith entity in the knowledge base,
Figure RE-GDA0002998634270000074
the description information of the jth candidate entity in the knowledge base of the ith entity in the document is represented, n represents the number of entities in the document, and m represents the number of the knowledge base candidate entity sets of the entities in the document. For example, the entity description information preset template for specifying the type of a person is:
[ name of person]The career is [ career attribute value ]]The age [ age attribute value]And [ all friendship tails Entity name]And the like are friends.
For entity Liebai of the types of human beings in the knowledge base, the following entity description information can be generated:
libai, occupation is poetry, the age in Tang dynasty, and Dufu et al are friends.
Step 5, vectorization representation learning of the entities in the documents and the candidate entities in the knowledge base;
the description information T of the document entity is set as T ═ T1,T2,…,TnInputting a BERT model to obtain text sequence vector representation of respective description information:
Figure RE-GDA0002998634270000081
Figure RE-GDA0002998634270000082
where t represents the maximum length of the description information text of the document entity and d represents the word vector dimension of the BERT output. And then, expressing the text sequence vector of the entity description information through a MaxPhool 1D to obtain the global vector expression of the sentence:
Figure RE-GDA0002998634270000083
wherein
Figure RE-GDA0002998634270000084
A global vector representation representing a sentence.
Knowledge base candidate entity description information S ═ { S ═ S1,S2,…,SnThe sequence vector obtained after entering BERT is represented as follows:
Figure RE-GDA0002998634270000085
Figure RE-GDA0002998634270000086
Figure RE-GDA0002998634270000087
wherein XSAnd (3) obtaining a text sequence vector representation matrix after the description information representing each entity in the knowledge base candidate entity set S passes through a BERT pre-training model.
Figure RE-GDA0002998634270000088
And a text sequence vector representation matrix formed by candidate entity set description information of the ith entity in the knowledge base is represented.
Figure RE-GDA0002998634270000089
A jth knowledge base candidate entity representing an ith entity in the text describes a text sequence vector matrix representation of information,
Figure RE-GDA00029986342700000810
the ith knowledge base candidate entity description information of the ith entity in the text
Figure RE-GDA00029986342700000811
Vector representation of words. Then candidate the knowledge baseThe text sequence vector representation of the entity description information passes through an input channel and an output channel which are both d, the height of a convolution kernel is 1, and the width of the convolution kernel is the length of the text sequence
Figure RE-GDA00029986342700000814
MaxPool1D pooling layer, extract the global vector representation of the sentence:
Figure RE-GDA00029986342700000812
wherein
Figure RE-GDA00029986342700000813
And representing the global vector representation of the description information of the ith candidate entity in the knowledge base of the ith entity in the text.
Step 6, calculating the similarity degree scores of the entities in the document and the candidate entities in the knowledge base;
and calculating the similarity between the entity in the text and the description information of each candidate entity in the knowledge base thereof by using a similarity calculation method:
Figure RE-GDA0002998634270000091
where CS (i, j) represents a Similarity score between the description information of the ith entity in the text and the description information of the jth candidate entity in the knowledge base, and Similarity represents a related pairwise Similarity calculation method, such as
Figure RE-GDA0002998634270000092
Splicing, inputting a multilayer perceptron, inputting the output of the multilayer perceptron to a Sigmoid function to obtain a probability value as a similarity score, wherein the score is less than 0.5 to indicate that the two are not linked, and the score is more than 0.5 to indicate that the two are possibly linked, wherein the threshold value in the embodiment can be adjusted according to the actual situation, and the invention is not specifically limited.
Step 7, ranking similarity degree scores of candidate entities in a knowledge base;
and for the ith entity in the text, sorting the entity description information similarity scores of the candidate entity set in the knowledge base, and selecting the entity with the similarity score larger than a set threshold value and the highest score as the entity finally linked to the knowledge base.
max_i=argmaxjCS (i, j), i ═ 1,2, …, n, j ═ 1,2, …, m, where max _ i is the final linking entity of entity i.
Example two
The second embodiment of the present invention provides an entity linking method, which specifically includes:
for the sentence "three loads of Tianbao in Tang dynasty, Li Bai arrives at Dongdu Luoyang. Here, He encountered Dufu. ":
the first step is as follows: entities "li-white", "luyang", "dupu" in sentences are identified.
The second step is that: searching candidate entities in the knowledge base by the entity name "lisk" in the document has poetry of dynasty "lisk", the first organizer "lisk".
The third step: generating the context description information of each entity in the text, namely adding a special character [ LINK ] around the text fragment of the currently linked entity for marking:
tang dynasty: [ LINK ] Tang Dynasty [ LINK ] Tianbao Sanzhan, from Li Bai to Dongdu Luoyang. Here, He encountered Dufu.
Plum white: tang dynasty Tian Bao san Gao, and [ LINK ] Li Bai [ LINK ] to Dongdu Luo Yang. Here, He encountered Dufu.
Luoyang: tang dynasty Tian Bao san gan, Li Bai comes to Dongdu (LINK) Luoyang (LINK). Here, He encountered Dufu.
Dufu: tang dynasty Tian Bao san Li white to Dongdu Luo Yang. Here, he encountered [ LINK ] Dufu [ LINK ].
The fourth step: and generating entity description information according to the specific template of each entity type by utilizing the attribute and the relationship information of two entity Libais in the knowledge base:
entity description information of "Libai" of poetry of Tang dynasty:libai has a poem as an occupation and Dufu has a friend who is in the dynasty Is Tang dynasty.
Entity description information of organization member "plum":the occupation of Libai is a member of the organization, born in place A, the year it is in The generation is a certain generation.
Fifthly, learning vector representation of the entities in the document and the candidate entities in the knowledge base;
computing a feature representation of an entity "Libai" in a document: the Chinese character ' tang ', ' heading ', ' day ', ' treasure ', ' three ', ' carrier ', ' LINK ] ', ' lie ', ' white ', ' LINK ] ', …, '. "] obtaining the feature vector representation A of the sentence through the BERT model and the MaxPool1D model,
calculating the description information characteristic representation of candidate entity Tang dynasty poetry Limb and organization member Limb in the knowledge base:
a. first of all [ "plum", "white", "of", "job", "business", "yes", "poem", "person", … "," down "," towards "," etc. "] obtaining an entity description vector representation B of" Libai "of poem of Tang dynasty in the knowledge base through a BERT model and MaxPool 1D;
b. then the 'plum', 'white', 'casting', 'industry', 'yes', 'group', 'tissue', … are added
"a", "an", "generation", "etc. "] obtaining the entity description vector representation C of the first organizer of the knowledge base, Libai, by the BERT model and MaxPhool 1D.
And sixthly, calculating similarity scores of the entity in the document and the candidate entity in the knowledge base, and calculating the similarity score between a vector representation A of the entity 'Libai' in the document and a vector representation B of the entity Tang dynasty poetry 'Libai' in the knowledge base by using a similarity calculation method of a multilayer perception machine and a Sigmoid classifier, wherein the vector representation C of the organization member 'Libai' is a vector representation C. The similarity score for poem from Tang dynasty was calculated to be 0.90 and the similarity score for the members of the organization was calculated to be 0.45.
And seventhly, grading and sorting the similarity scores of the candidate entities in the knowledge base, sorting the scores obtained in the fifth step, and selecting the entity which is larger than the threshold value and has the highest score as the final linked entity according to the threshold value. According to the output probability of the sigmoid function, the preset threshold of the similarity score is 0.5, the similarity score of the Libai of poetry of Tang dynasty is the highest and is greater than the threshold 0.5, so that the Libai is the entity finally linked with the entity Libai in the text, wherein the threshold in the embodiment can be adjusted according to the actual situation, and the invention is not specifically limited.
EXAMPLE III
Fig. 3 is a schematic composition diagram of an entity link system, and a third embodiment of the present invention provides an entity link system, where the system includes:
the text description information generating unit of the entity in the knowledge base is used for generating text description information of the entity in the knowledge base according to the attribute and the structural information of the relational graph of the entity in the knowledge base;
a vector representation obtaining unit of the entity in the knowledge base, which is used for obtaining sentence global vector representation of the text description information of the entity in the knowledge base and taking the sentence global vector representation as the vector representation of the entity in the knowledge base;
the vector representation obtaining unit of the entity text fragment in the unstructured text is used for obtaining the vector representation of the entity text fragment in the unstructured text;
and the link entity obtaining unit is used for calculating the similarity scores of the vector representation of the entity text fragment in the unstructured text and the vector representation of the entity in the knowledge base through vector similarity calculation, sorting the results based on the similarity scores, and selecting the candidate entity with the score larger than the threshold value and the highest score in the sorted results as the entity for linking the unstructured text and the knowledge base.
Example four
The fourth embodiment of the present invention provides an entity linking apparatus, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the entity linking method when executing the computer program.
The processor may be a Central Processing Unit (CPU), or other general-purpose processor, a digital signal processor (digital signal processor), an Application Specific Integrated Circuit (Application Specific Integrated Circuit), an off-the-shelf programmable gate array (field programmable gate array) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory can be used for storing the computer programs and/or modules, and the processor can realize various functions of the entity linking device in the invention by operating or executing the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart memory card, a secure digital card, a flash memory card, at least one magnetic disk storage device, a flash memory device, or other volatile solid state storage device.
EXAMPLE five
An embodiment five of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the entity linking method are implemented.
The physical link means, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of implementing the embodiments of the present invention may also be stored in a computer readable storage medium through a computer program, and when the computer program is executed by a processor, the computer program may implement the steps of the above-described method embodiments. Wherein the computer program comprises computer program code, an object code form, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying said computer program code, a recording medium, a usb-disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory, a random access memory, a point carrier signal, a telecommunications signal, a software distribution medium, etc. It should be noted that the computer readable medium may contain content that is appropriately increased or decreased as required by legislation and patent practice in the jurisdiction.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. An entity linking method, characterized in that the method comprises:
generating text description information of the entities in the knowledge base according to the attributes of the entities in the knowledge base and the structural information of the relational graph;
obtaining sentence global vector representation of text description information of an entity in a knowledge base, and taking the sentence global vector representation as vector representation of the entity in the knowledge base;
obtaining vector representation of entity text fragments in the unstructured text;
calculating similarity scores of vector representation of entity text segments in the unstructured text and vector representation of entities in the knowledge base through vector similarity calculation, sorting based on similarity score results, and selecting candidate entities with scores larger than a threshold value and highest scores in the sorting results as entities of the unstructured text linked with the knowledge base.
2. The entity linking method according to claim 1, wherein in the method, a BERT model is used to obtain a sentence global vector representation of textual description information of an entity in the knowledge base, and a BERT model is used to obtain a vector representation of a text fragment of the entity in the unstructured text.
3. The entity linking method according to claim 1, wherein the method includes identifying entities in unstructured text, and specifically includes:
unstructured text D ═ w1,w2,…,wlGet the word from the vocabulary library, wi∈WvI is 1, …, l, l represents the number of words in the text D, v represents the size of the vocabulary library, W represents the vocabulary library space;
obtaining a vector representation sequence of a word sequence in the text D by using a BERT pre-training language model: x ═ X1,x2,…,xl},xi∈Rd,i=1,2,…,l;xiIs a d-dimensional vector of a real number space, representing the vector representation of the ith word, R representing the real number space;
the sequence embedded representation X is input into a pre-set encoder, resulting in a word embedded representation of the entity.
4. The entity linking method of claim 3, wherein the default encoder is a transform encoder:
Xe=Transformer_Encoder(X)
wherein, XeRepresenting entity embedded representation information after being coded by a Transformer coder;
mixing XeInputting the entity classification into a Softmax classifier through a linear transformation:
Oe=Linear(Xe)
Pe=Softmax(Oe)
wherein, Oe∈Rn×cRepresents XeAn embedded representation obtained after a linear transformation, c representing the number of classes of the entity BIO tag, Pe∈Rn×cRepresenting the probability distribution of the entity class by indexing the subscript of the maximum probability value in the probability distributionAs an entity class label index value for the current character;
obtaining a detected entity text fragment set E ═ { E } through BIO marking of a decoding sequence1,…,enWhere E represents a collection of entity text snippets, Ei(i ═ 1,2, …, n) text snippet [ w ] representing the ith entity in unstructured texti0,…,wik]Where k represents the length of the entity text segment.
5. The entity linking method according to claim 4, wherein the method includes generating candidate entities of the entities in the unstructured text in the knowledge base, and specifically includes:
constructing entity names for entities in the set E to obtain a final entity name set;
querying a candidate entity set G ═ G { G } in a knowledge base by fuzzy matching based on the final entity name set1,…,GnTherein of
Figure FDA0002865053670000021
And m represents the number of candidate entities in the knowledge base of each entity text segment.
6. The entity linking method according to claim 5, wherein the entity name is constructed by: adding an alias dictionary of the entity name, generating shorthand of the name by using a name shorthand rule aiming at the entity names of different ontology types, and finally generating an entity name set by the entity according to the alias dictionary and the shorthand rule.
7. The entity linking method according to claim 1, wherein the method includes generating context description information of the entity in the unstructured text, and specifically includes:
for each entity in the unstructured text, generating corresponding description information T ═ T according to the context environment of the entity1,T2,…,TnN is the number of entities in the unstructured text;
the method for generating the corresponding description information according to the context environment of the entity is as follows: and intercepting a context sentence where the entity is located as description information of the entity, and adding preset marker characters before and after each entity.
8. The entity linking method according to claim 4, wherein the method comprises generating candidate entity description information in a knowledge base, and specifically comprises:
for entity e in unstructured textiEntity candidate set G in a knowledge baseiUsing a preset template of the ontology type for each candidate entity according to the ontology type of the entity
Figure FDA0002865053670000022
Generating description information S of the entity; candidate entities
Figure FDA0002865053670000023
Figure FDA0002865053670000024
Description information S ═ S1,S2,…,Sn},
Figure FDA0002865053670000025
Wherein S represents the description information of each entity in the unstructured text in the candidate entity set in the knowledge base, SiDescription information representing candidate entities of the ith entity in the unstructured text in the knowledge base,
Figure FDA0002865053670000026
and the description information of the jth candidate entity of the ith entity in the knowledge base in the unstructured text is represented, n represents the number of entities in the unstructured text, and m represents the number of the knowledge base candidate entity set of the entities in the unstructured text.
9. The entity linking method of claim 1, wherein the vectorized representation of the entities in the unstructured text and the candidate entities in the knowledge base comprises:
setting the description information T of the entity in the unstructured text as T ═ T1,T2,…,TnInputting a BERT model to obtain text sequence vector representation of respective description information:
Figure FDA0002865053670000031
Figure FDA0002865053670000032
wherein t represents the maximum length of the description information text of the entity in the unstructured text, and d represents the dimension of the word vector output by BERT;
obtaining a global vector representation of a sentence by a pooling layer according to the text sequence vector representation of the entity description information:
Figure FDA0002865053670000033
wherein
Figure FDA0002865053670000034
A global vector representation representing a sentence;
knowledge base candidate entity description information S ═ { S ═ S1,S2,…,SnThe sequence vector obtained after entering the BERT is expressed as follows:
Figure FDA0002865053670000035
Figure FDA0002865053670000036
Figure FDA0002865053670000037
wherein, XSAfter the description information of each entity in the candidate entity set S of the knowledge base passes through a BERT pre-training model, a text sequence vector representation matrix is obtained;
Figure FDA0002865053670000038
a text sequence vector representation matrix formed by candidate entity set description information of the ith entity in the knowledge base in the unstructured text is represented;
Figure FDA0002865053670000039
a jth knowledge base candidate entity description information text sequence vector representation representing an ith entity in the unstructured text,
Figure FDA00028650536700000310
the ith knowledge base candidate entity description information of the ith entity in the text
Figure FDA00028650536700000311
A vector representation of a word;
extracting a global vector representation of a sentence from the text sequence vector representation of the knowledge base candidate entity description information through a pooling layer:
Figure FDA00028650536700000312
wherein the content of the first and second substances,
Figure FDA00028650536700000313
and representing the global vector representation of the ith candidate entity description information of the ith entity in the unstructured text in the knowledge base.
10. The entity linking method according to claim 1, wherein the method comprises:
and calculating the similarity between the entity in the unstructured text and the description information of each candidate entity in the knowledge base by using a similarity calculation method:
Figure FDA00028650536700000314
wherein CS (i, j) represents a similarity score between the description information of the ith entity in the unstructured text and the description information of the jth candidate entity in the knowledge base,
Figure FDA00028650536700000315
a global vector representation representing a sentence is shown,
Figure FDA00028650536700000316
representing the global vector representation of the ith entity description information of the jth candidate entity in the knowledge base in the unstructured text, and representing the related pairwise Similarity calculation method by using Similarity;
for the ith entity in the unstructured text, sorting the entity description information similarity scores of the candidate entity set in the knowledge base, and selecting the entity with the similarity score larger than a set threshold value and the highest score as the entity finally linked to the knowledge base;
max_i=argmaxjCS (i, j), i ═ 1,2, …, n, j ═ 1,2, …, m; wherein max _ i is the final link entity of the entity i, n represents the number of entities in the unstructured text, and m represents the number of the knowledge base candidate entity sets of the entities in the unstructured text.
CN202011581122.6A 2020-12-28 2020-12-28 Entity linking method Pending CN112966117A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011581122.6A CN112966117A (en) 2020-12-28 2020-12-28 Entity linking method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011581122.6A CN112966117A (en) 2020-12-28 2020-12-28 Entity linking method

Publications (1)

Publication Number Publication Date
CN112966117A true CN112966117A (en) 2021-06-15

Family

ID=76271500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011581122.6A Pending CN112966117A (en) 2020-12-28 2020-12-28 Entity linking method

Country Status (1)

Country Link
CN (1) CN112966117A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116049447A (en) * 2023-03-24 2023-05-02 中科雨辰科技有限公司 Entity linking system based on knowledge base
CN116306504A (en) * 2023-05-23 2023-06-23 匀熵智能科技(无锡)有限公司 Candidate entity generation method and device, storage medium and electronic equipment
CN116341531A (en) * 2023-02-28 2023-06-27 人民网股份有限公司 Knowledge-driven character information extraction and inspection method and device
CN116756295A (en) * 2023-08-16 2023-09-15 北京盛通知行教育科技集团有限公司 Knowledge base retrieval method, device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106940702A (en) * 2016-01-05 2017-07-11 富士通株式会社 Entity refers to the method and apparatus with entity in semantic knowledge-base in connection short text
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks
CN108959258A (en) * 2018-07-02 2018-12-07 昆明理工大学 It is a kind of that entity link method is integrated based on the specific area for indicating to learn
CN109408743A (en) * 2018-08-21 2019-03-01 中国科学院自动化研究所 Text link embedding grammar
CN109933785A (en) * 2019-02-03 2019-06-25 北京百度网讯科技有限公司 Method, apparatus, equipment and medium for entity associated
CN111428507A (en) * 2020-06-09 2020-07-17 北京百度网讯科技有限公司 Entity chain finger method, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106940702A (en) * 2016-01-05 2017-07-11 富士通株式会社 Entity refers to the method and apparatus with entity in semantic knowledge-base in connection short text
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks
CN108959258A (en) * 2018-07-02 2018-12-07 昆明理工大学 It is a kind of that entity link method is integrated based on the specific area for indicating to learn
CN109408743A (en) * 2018-08-21 2019-03-01 中国科学院自动化研究所 Text link embedding grammar
CN109933785A (en) * 2019-02-03 2019-06-25 北京百度网讯科技有限公司 Method, apparatus, equipment and medium for entity associated
CN111428507A (en) * 2020-06-09 2020-07-17 北京百度网讯科技有限公司 Entity chain finger method, device, equipment and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341531A (en) * 2023-02-28 2023-06-27 人民网股份有限公司 Knowledge-driven character information extraction and inspection method and device
CN116341531B (en) * 2023-02-28 2023-10-10 人民网股份有限公司 Knowledge-driven character information extraction and inspection method and device
CN116049447A (en) * 2023-03-24 2023-05-02 中科雨辰科技有限公司 Entity linking system based on knowledge base
CN116049447B (en) * 2023-03-24 2023-06-13 中科雨辰科技有限公司 Entity linking system based on knowledge base
CN116306504A (en) * 2023-05-23 2023-06-23 匀熵智能科技(无锡)有限公司 Candidate entity generation method and device, storage medium and electronic equipment
CN116306504B (en) * 2023-05-23 2023-08-08 匀熵智能科技(无锡)有限公司 Candidate entity generation method and device, storage medium and electronic equipment
CN116756295A (en) * 2023-08-16 2023-09-15 北京盛通知行教育科技集团有限公司 Knowledge base retrieval method, device and storage medium
CN116756295B (en) * 2023-08-16 2023-11-03 北京盛通知行教育科技集团有限公司 Knowledge base retrieval method, device and storage medium

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN111475623B (en) Case Information Semantic Retrieval Method and Device Based on Knowledge Graph
Davis et al. End-to-end document recognition and understanding with dessurt
CN112966117A (en) Entity linking method
CN109800437A (en) A kind of name entity recognition method based on Fusion Features
CN112149421A (en) Software programming field entity identification method based on BERT embedding
CN110390018A (en) A kind of social networks comment generation method based on LSTM
CN112926345B (en) Multi-feature fusion neural machine translation error detection method based on data enhancement training
CN109977220B (en) Method for reversely generating abstract based on key sentence and key word
US20220004545A1 (en) Method of searching patent documents
CN111143507B (en) Reading and understanding method based on compound problem
CN112328800A (en) System and method for automatically generating programming specification question answers
Logeswaran et al. Sentence ordering using recurrent neural networks
CN112597300A (en) Text clustering method and device, terminal equipment and storage medium
CN113196277A (en) System for retrieving natural language documents
CN114169312A (en) Two-stage hybrid automatic summarization method for judicial official documents
CN113282729A (en) Question-answering method and device based on knowledge graph
Pham Transferring, transforming, ensembling: the novel formula of identifying fake news
CN113934835B (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
CN106897274B (en) Cross-language comment replying method
CN114064901B (en) Book comment text classification method based on knowledge graph word meaning disambiguation
CN114048305A (en) Plan recommendation method for administrative penalty documents based on graph convolution neural network
Günther et al. Pre-trained web table embeddings for table discovery
CN117520515A (en) Bank field question and answer method, system and device based on local knowledge base
CN115344668A (en) Multi-field and multi-disciplinary science and technology policy resource retrieval method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 610042 No. 270, floor 2, No. 8, Jinxiu street, Wuhou District, Chengdu, Sichuan

Applicant after: Chengdu shuzhilian Technology Co.,Ltd.

Address before: No.2, floor 4, building 1, Jule road crossing, Section 1, West 1st ring road, Wuhou District, Chengdu City, Sichuan Province 610041

Applicant before: CHENGDU SHUZHILIAN TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information