CN112966117A - Entity linking method - Google Patents
Entity linking method Download PDFInfo
- Publication number
- CN112966117A CN112966117A CN202011581122.6A CN202011581122A CN112966117A CN 112966117 A CN112966117 A CN 112966117A CN 202011581122 A CN202011581122 A CN 202011581122A CN 112966117 A CN112966117 A CN 112966117A
- Authority
- CN
- China
- Prior art keywords
- entity
- text
- knowledge base
- description information
- vector representation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Abstract
The invention discloses an entity linking method, which relates to the field of natural language processing and comprises the following steps: generating text description information of the entities in the knowledge base according to the attributes of the entities in the knowledge base and the structural information of the relational graph; obtaining sentence global vector representation of text description information of an entity in a knowledge base, and taking the sentence global vector representation as vector representation of the entity in the knowledge base; obtaining vector representation of entity text fragments in the unstructured text; calculating similarity scores of vector representation of entity text segments in the unstructured text and vector representation of entities in a knowledge base through vector similarity calculation, sorting, and selecting candidate entities with scores larger than a threshold value and highest scores in sorting results as entities of the unstructured text linked with the knowledge base; the invention avoids the inconsistency of entity feature vector representation under respective space caused by the difference between the graph data structure and the text data structure; the method is beneficial to the accurate calculation of the entity characteristic similarity.
Description
Technical Field
The invention relates to the field of natural language processing, in particular to an entity linking method.
Background
The entity link is a key method for knowledge fusion of the knowledge map, and is mainly responsible for linking entities extracted from various unstructured documents with entities existing in a knowledge base. The method is widely applied to various fields of natural language processing, such as knowledge base construction, intelligent question answering based on the knowledge base and the like. As for sentences in the document:
three loads of the Tang dynasty Tianbao,Li Baito all eastLuoyang medicine for treating kidney-yang deficiency. Here he encountersDu Fu。
Entity linking entails linking the entities "Tang dynasty," "Libai," "Luoyang," and "Dufu" in a sentence with entities in an already existing knowledge base.
The entity linking method mainly comprises the following four steps:
(1) possible solid text segments in the sentence are detected, such as "Libai", "Luoyang", "Dufu" in the above-mentioned sentence.
(2) And searching a plurality of candidate entities in the knowledge base for each detected entity text segment by using a correlation algorithm, wherein the plurality of candidate entities such as "Libai" in the knowledge base may exist in Libai of poetry of dynasty, Libai of the first organization and the like.
(3) And learning entity text fragments in the vectorized document and candidate entities in the knowledge base by using distributed embedded expression, and then carrying out similarity score calculation.
(4) And (4) sorting the similarity of each candidate entity in the knowledge base, and taking the candidate entity with the similarity larger than a certain threshold and the highest score as the finally linked entity.
The main difference of the mainstream entity link method in the industry at present is a candidate entity vectorization method in an entity text fragment and a knowledge base, the traditional method utilizes distributed representation learning to calculate the probability of occurrence of candidate entities relative to the entity text fragment mention on related linguistic data, and K entities with the highest probability are selected as candidate entities.
Disclosure of Invention
The invention provides an end-to-end efficient method based on GraphToText for a knowledge graph entity link task, wherein the GraphToText mainly converts graph structure information of entity nodes in a knowledge base into text description information to help the entity links to carry out context semantic representation learning of the entity nodes in the graph, and can be widely applied to an entity link task of an unstructured text and the knowledge base.
In order to achieve the above object, the present invention provides an entity linking method, including:
generating text description information of the entities in the knowledge base according to the attributes of the entities in the knowledge base and the structural information of the relational graph;
obtaining sentence global vector representation of text description information of an entity in a knowledge base, and taking the sentence global vector representation as vector representation of the entity in the knowledge base;
obtaining vector representation of entity text fragments in the unstructured text;
calculating similarity scores of vector representation of entity text segments in the unstructured text and vector representation of entities in the knowledge base through vector similarity calculation, sorting based on similarity score results, and selecting candidate entities with scores larger than a threshold value and highest scores in the sorting results as entities of the unstructured text linked with the knowledge base.
The principle of the invention is as follows: converting graph structure information of entity nodes in a knowledge base into text description information of the entities by using an entity text description template; then, learning distributed vector representation of entity text fragment context information in the unstructured text and entity node text description information in a knowledge base by utilizing a pre-training language model; and finally, calculating the similarity score between the two vectors by using a pairwise vector similarity calculation method, and taking the score as a basis for judging whether the two vectors are linked.
Preferably, in the method, a BERT model is used for obtaining sentence global vector representation of text description information of an entity in a knowledge base, and a BERT model is used for obtaining vector representation of an entity text segment in an unstructured text.
Preferably, the method includes identifying entities in the unstructured text, and specifically includes:
unstructured text D ═ w1,w2,…,wlThe words of are derived from the vocabulary library, wi∈WvTable of 1, …, lThe number of words in the text D is shown, v represents the size of a vocabulary library, and W represents the space of the vocabulary library;
obtaining a vector representation sequence of a text D word sequence by using a BERT pre-training language model: x ═ X1,x2,…,xl},xi∈ Rd,i=1,2,…,l;xiIs a d-dimensional vector of a real number space, representing the vector representation of the ith word, R representing the real number space;
the sequence embedded representation X is input into a pre-set encoder, resulting in a word embedded representation of the entity.
Preferably, the preset encoder in the method is a transform encoder:
Xe=Transformer_Encoder(X)
wherein, XeRepresenting entity embedded representation information after being coded by a Transformer coder;
mixing XeInputting the entity classification into a Softmax classifier through a linear transformation:
Oe=Linear(Xe)
Pe=Softmax(Oe)
wherein, Oe∈Rn×cRepresents XeAn embedded representation obtained after a linear transformation, c representing the number of classes of the entity BIO tag, Pe∈Rn×cRepresenting the probability distribution of entity classification, and taking the subscript index of the item with the maximum probability value in the probability distribution as the index value of the entity classification label of the current character;
obtaining a detected entity text fragment set E ═ { E } through BIO marking of a decoding sequence1,e2,…,enWhere E represents a collection of entity text snippets, Ei(i ═ 1,2, …, n) text snippet [ w ] representing the ith entity in unstructured texti0,…,wik]Where k represents the length of the entity text segment.
Preferably, the method includes generating candidate entities of the entities in the unstructured text in the knowledge base, and specifically includes:
constructing entity names for entities in the set E to obtain a final entity name set;
querying a candidate entity set G ═ G { G } in a knowledge base by fuzzy matching based on the final entity name set1,…,GnTherein ofAnd m represents the number of candidate entities in the knowledge base of each entity text segment.
Preferably, the method for constructing the entity name comprises the following steps: adding an alias dictionary of the entity name, generating shorthand of the name by using a name shorthand rule aiming at the entity names of different ontology types, and finally generating an entity name set by the entity according to the alias dictionary and the shorthand rule.
Preferably, the method includes generating context description information of the entity in the unstructured text, and specifically includes:
for each entity in the unstructured text, generating corresponding description information T ═ T according to the context environment of the entity1,T2,…,TnN is the number of entities in the unstructured text;
the method for generating the corresponding description information according to the context environment of the entity is as follows: and intercepting a context sentence where the entity is located as description information of the entity, and adding preset marker characters before and after each entity.
Preferably, the method includes generating candidate entity description information in a knowledge base, and specifically includes:
for entity e in unstructured textiEntity candidate set G in a knowledge baseiUsing a preset template of the ontology type for each candidate entity according to the ontology type of the entityGenerating description information S of the entity; candidate entities Description information S ═ S1,S2,…,Sn},Wherein S represents the description information of each entity in the unstructured text in the candidate entity set in the knowledge base, SiDescription information representing candidate entities of the ith entity in the unstructured text in the knowledge base,and the description information of the jth candidate entity of the ith entity in the knowledge base in the unstructured text is represented, n represents the number of entities in the unstructured text, and m represents the number of the knowledge base candidate entity set of the entities in the unstructured text.
Preferably, the vectorized representation of the entity in the unstructured text and the candidate entity in the knowledge base in the method includes:
setting the description information T of the entity in the unstructured text as T ═ T1,T2,…,TnInputting a BERT model to obtain text sequence vector representation of respective description information:
wherein t represents the maximum length of the description information text of the entity in the unstructured text, and d represents the dimension of the word vector output by BERT;
obtaining a global vector representation of a sentence by a pooling layer according to the text sequence vector representation of the entity description information:
knowledge base candidate entity description information S ═ { S ═ S1,S2,…,SnThe sequence vector obtained after entering the BERT is expressed as follows:
wherein, XSAfter the description information of each entity in the candidate entity set S of the knowledge base passes through a BERT pre-training model, a text sequence vector representation matrix is obtained;a text sequence vector representation matrix formed by candidate entity set description information of the ith entity in the knowledge base in the unstructured text is represented;a jth knowledge base candidate entity description information text sequence vector representation representing an ith entity in the unstructured text,the ith knowledge base candidate entity description information of the ith entity in the textWord vector tableShown in the specification;
extracting a global vector representation of a sentence from the text sequence vector representation of the knowledge base candidate entity description information through a pooling layer:
wherein the content of the first and second substances,and representing the global vector representation of the ith candidate entity description information of the ith entity in the unstructured text in the knowledge base.
Preferably, the method further comprises:
and calculating the similarity between the entity in the unstructured text and the description information of each candidate entity in the knowledge base by using a similarity calculation method:
wherein CS (i, j) represents a similarity score between the description information of the ith entity in the unstructured text and the description information of the jth candidate entity in the knowledge base,a global vector representation representing a sentence is shown,representing the global vector representation of the ith entity description information of the jth candidate entity in the knowledge base in the unstructured text, and representing the related pairwise Similarity calculation method by using Similarity;
for the ith entity in the unstructured text, sorting the entity description information similarity scores of the candidate entity set in the knowledge base, and selecting the entity with the similarity score larger than a set threshold value and the highest score as the entity finally linked to the knowledge base;
max_i=argmaxjCS (i, j), i ═ 1,2, …, n, j ═ 1,2, …, m; wherein max _ i is the final link entity of the entity i, n represents the number of entities in the unstructured text, and m represents the number of the knowledge base candidate entity sets of the entities in the unstructured text.
The invention also provides an entity linking system, which comprises:
the text description information generating unit of the entity in the knowledge base is used for generating text description information of the entity in the knowledge base according to the attribute and the structural information of the relational graph of the entity in the knowledge base;
a vector representation obtaining unit of the entity in the knowledge base, which is used for obtaining sentence global vector representation of the text description information of the entity in the knowledge base and taking the sentence global vector representation as the vector representation of the entity in the knowledge base;
the vector representation obtaining unit of the entity text fragment in the unstructured text is used for obtaining the vector representation of the entity text fragment in the unstructured text;
and the link entity obtaining unit is used for calculating the similarity scores of the vector representation of the entity text fragment in the unstructured text and the vector representation of the entity in the knowledge base through vector similarity calculation, sorting based on the similarity score results, and selecting the candidate entity with the score larger than the threshold value and the highest score in the sorting results as the entity for linking the unstructured text and the knowledge base.
The invention also provides an entity linking device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of the entity linking method when executing the computer program.
The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the entity linking method.
According to the invention, special character marks are added before and after the entity needing to be linked in the unstructured text, so that the characteristic representation of the entity needing to be linked at present is captured in the global vector generation of the subsequent sentence, and the description information of the entity in the unstructured text can be generated.
According to the method, the candidate entity description information in the knowledge base is generated through the template, the characteristic vector representation of the entity is obtained by utilizing the BERT pre-training model, and finally the candidate entity vector representation in the knowledge base is obtained.
One or more technical schemes provided by the invention at least have the following technical effects or advantages:
the method converts graph structure data such as attributes, relations and the like of the candidate entities in the knowledge base into text data, thereby fully utilizing a BERT pre-training model to learn the vector representation related to the context of the candidate entities on the text description information of the candidate entities, and avoiding the inconsistency of the entity feature vector representation under respective spaces caused by the difference between the graph data structure and the text data structure.
The invention enables the entity in the unstructured text and the entity in the knowledge base under the graph structure to be unified to the same space, and utilizes the pre-training model to carry out representation learning of the entity characteristics, thereby being beneficial to the accurate calculation of the similarity of the entity characteristics.
The invention does not need to construct the corpus of the distributed representation learning in advance, thereby having higher efficiency.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention;
FIG. 1 is a schematic diagram of a link between an entity in an unstructured document and an entity in a knowledge base;
FIG. 2 is a schematic diagram of an entity linking method architecture based on GraphToText;
FIG. 3 is a schematic diagram of the entity linking system.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflicting with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described and thus the scope of the present invention is not limited by the specific embodiments disclosed below.
Example one
FIG. 2 is a schematic diagram of an entity linking method architecture based on GraphToText; the embodiment of the invention provides an end-to-end entity linking method based on GraphToText, which comprises the steps of firstly generating text description information of an entity by using a GraphToText method according to the attribute and the structure information of a relational graph of the entity in a knowledge base, then using the sentence global vector representation of the text description information of a BERT learning entity as the vector representation of the entity, and simultaneously using the BERT learning vector representation of an entity text fragment in an unstructured text. And finally, calculating similarity scores of the vector representation of the entity text fragment in the text and the entity vector representation in the knowledge base through vector similarity calculation, sorting the similarity scores, and taking the candidate entity with the score larger than a threshold value and the highest score as the entity of the final link.
Step 1 entity identification in the document;
for an input document D ═ w1,w2,…,wlGet its words from the vocabulary library, wi∈WvI is 1, …, l, l indicates the number of document words, v indicates the size of the vocabulary library, and W indicates the vocabulary library space. Then obtaining a vector representation sequence of the document word sequence through a BERT pre-training language model, wherein X is { X ═ X1,x2,…,xl},xi∈Rd,i=1,2,…,l。xiIs a d-dimensional vector of the real space, representing the vector representation of the ith word, and R represents the real space.
The sequence-embedded representation X is input into a transform encoder to obtain a word-embedded representation of the entity.
Xe=Transformer_Encoder(X)
Wherein XeIndicating that the entity coded by the transform coder is embedded with the indicating information. Mixing XeInputting the entity classification into a Softmax classifier through a linear transformation:
Oe=Linear(Xe)
Pe=SoftmaX(Oe)
wherein O ise∈Rn×cRepresents XeAn embedded representation obtained after a linear transformation, c representing the number of classes of the entity BIO tag, Pe∈Rn×cAnd representing the probability distribution of entity classification, and taking the subscript index of the item with the maximum probability value in the probability distribution as the index value of the entity classification label of the current character. Finally, obtaining a detected entity text fragment set E ═ { E } through the BIO mark of the decoding sequence1,e2,…,enWhere E represents a collection of entity text snippets, Ei(i ═ 1,2, …, n) text snippet [ w ] representing the ith entity in the texti0,wi1…,wik]Where k represents the length of the entity text segment.
Step 2, generating candidate entities of the entities in the document in the knowledge base;
and (3) for the entity in the detected entity text fragment set E, carefully constructing the entity name, such as adding an alias dictionary of the entity name, generating shorthand of the name by using a name shorthand rule aiming at the entity names of different ontology types, and querying a candidate entity set G { G ═ G in a knowledge base by fuzzy matching according to the entity name set finally generated by the current entity according to the dictionary and the rule1,G2,…,GnTherein ofRepresenting the candidate entity set of the ith entity in the document in the knowledge base, and m represents the number of candidate entities in the knowledge base of each entity text segment.
Step 3, generating the context description information of the entity in the document;
for each entity in the document, corresponding description information T ═ T is generated for the entity according to the context environment of the entity1,T2,…,TnWherein n is the middle of the documentThe number of entities is determined by intercepting the context sentence where the entity is located as the description information of the entity, and adding special entity alignment marker characters [ LINK ] before and after each entity in order to distinguish the description information of different entities in the same sentence]For example, for a sentence:
three loads of the Tang dynasty Tianbao,Li Baito all eastLuoyang medicine for treating kidney-yang deficiency. Here he encountersDu Fu。
The entities "tang dynasty", "li white", "luoyang", and "dufu" of the above sentences generate the following entity description information for them:
(1) entity description information of "tang dynasty": [ LINK)]Tang dynasty [ LINK]The Tianbao three loads are carried by the patient,Li Baito all eastLuoyang medicine for treating kidney-yang deficiency. Here he encountersDu Fu。
(2) "entity description information of plum white: tang dynasty Tianbao three-year old, [ LINK ]]Li Bai[LINK]To all eastLuoyang medicine for treating kidney-yang deficiency. Here he encountersDu Fu。
(3) "entity description information of luoyang: three loads of the Tang dynasty Tianbao,Li Baito east all [ LINK ]]Luoyang medicine for treating kidney-yang deficiency[LINK]. Here he encountersDu Fu。
(4) Entity description information of "dupu": three loads of the Tang dynasty Tianbao,Li Baito all eastLuoyang medicine for treating kidney-yang deficiency. Here he encounters [ LINK ]] Du Fu[LINK]。
Step 4, generating texts of the candidate entity description information in the knowledge base;
for entity e in documentiEntity candidate set G in a knowledge baseiUsing a preset template of the ontology type for each candidate entity according to the ontology type of the entityGenerating description information S ═ { S ═ of an entity1,S2,…,Sn},Wherein S represents the description information of the candidate entity set of each entity in the document in the knowledge base, SiRepresenting documentsDescription information of candidate entities of the ith entity in the knowledge base,the description information of the jth candidate entity in the knowledge base of the ith entity in the document is represented, n represents the number of entities in the document, and m represents the number of the knowledge base candidate entity sets of the entities in the document. For example, the entity description information preset template for specifying the type of a person is:
[ name of person]The career is [ career attribute value ]]The age [ age attribute value]And [ all friendship tails
Entity name]And the like are friends.
For entity Liebai of the types of human beings in the knowledge base, the following entity description information can be generated:
libai, occupation is poetry, the age in Tang dynasty, and Dufu et al are friends.
Step 5, vectorization representation learning of the entities in the documents and the candidate entities in the knowledge base;
the description information T of the document entity is set as T ═ T1,T2,…,TnInputting a BERT model to obtain text sequence vector representation of respective description information:
where t represents the maximum length of the description information text of the document entity and d represents the word vector dimension of the BERT output. And then, expressing the text sequence vector of the entity description information through a MaxPhool 1D to obtain the global vector expression of the sentence:
Knowledge base candidate entity description information S ═ { S ═ S1,S2,…,SnThe sequence vector obtained after entering BERT is represented as follows:
wherein XSAnd (3) obtaining a text sequence vector representation matrix after the description information representing each entity in the knowledge base candidate entity set S passes through a BERT pre-training model.And a text sequence vector representation matrix formed by candidate entity set description information of the ith entity in the knowledge base is represented.A jth knowledge base candidate entity representing an ith entity in the text describes a text sequence vector matrix representation of information,the ith knowledge base candidate entity description information of the ith entity in the textVector representation of words. Then candidate the knowledge baseThe text sequence vector representation of the entity description information passes through an input channel and an output channel which are both d, the height of a convolution kernel is 1, and the width of the convolution kernel is the length of the text sequenceMaxPool1D pooling layer, extract the global vector representation of the sentence:
whereinAnd representing the global vector representation of the description information of the ith candidate entity in the knowledge base of the ith entity in the text.
Step 6, calculating the similarity degree scores of the entities in the document and the candidate entities in the knowledge base;
and calculating the similarity between the entity in the text and the description information of each candidate entity in the knowledge base thereof by using a similarity calculation method:
where CS (i, j) represents a Similarity score between the description information of the ith entity in the text and the description information of the jth candidate entity in the knowledge base, and Similarity represents a related pairwise Similarity calculation method, such asSplicing, inputting a multilayer perceptron, inputting the output of the multilayer perceptron to a Sigmoid function to obtain a probability value as a similarity score, wherein the score is less than 0.5 to indicate that the two are not linked, and the score is more than 0.5 to indicate that the two are possibly linked, wherein the threshold value in the embodiment can be adjusted according to the actual situation, and the invention is not specifically limited.
Step 7, ranking similarity degree scores of candidate entities in a knowledge base;
and for the ith entity in the text, sorting the entity description information similarity scores of the candidate entity set in the knowledge base, and selecting the entity with the similarity score larger than a set threshold value and the highest score as the entity finally linked to the knowledge base.
max_i=argmaxjCS (i, j), i ═ 1,2, …, n, j ═ 1,2, …, m, where max _ i is the final linking entity of entity i.
Example two
The second embodiment of the present invention provides an entity linking method, which specifically includes:
for the sentence "three loads of Tianbao in Tang dynasty, Li Bai arrives at Dongdu Luoyang. Here, He encountered Dufu. ":
the first step is as follows: entities "li-white", "luyang", "dupu" in sentences are identified.
The second step is that: searching candidate entities in the knowledge base by the entity name "lisk" in the document has poetry of dynasty "lisk", the first organizer "lisk".
The third step: generating the context description information of each entity in the text, namely adding a special character [ LINK ] around the text fragment of the currently linked entity for marking:
tang dynasty: [ LINK ] Tang Dynasty [ LINK ] Tianbao Sanzhan, from Li Bai to Dongdu Luoyang. Here, He encountered Dufu.
Plum white: tang dynasty Tian Bao san Gao, and [ LINK ] Li Bai [ LINK ] to Dongdu Luo Yang. Here, He encountered Dufu.
Luoyang: tang dynasty Tian Bao san gan, Li Bai comes to Dongdu (LINK) Luoyang (LINK). Here, He encountered Dufu.
Dufu: tang dynasty Tian Bao san Li white to Dongdu Luo Yang. Here, he encountered [ LINK ] Dufu [ LINK ].
The fourth step: and generating entity description information according to the specific template of each entity type by utilizing the attribute and the relationship information of two entity Libais in the knowledge base:
entity description information of "Libai" of poetry of Tang dynasty:libai has a poem as an occupation and Dufu has a friend who is in the dynasty Is Tang dynasty.
Entity description information of organization member "plum":the occupation of Libai is a member of the organization, born in place A, the year it is in The generation is a certain generation.
Fifthly, learning vector representation of the entities in the document and the candidate entities in the knowledge base;
computing a feature representation of an entity "Libai" in a document: the Chinese character ' tang ', ' heading ', ' day ', ' treasure ', ' three ', ' carrier ', ' LINK ] ', ' lie ', ' white ', ' LINK ] ', …, '. "] obtaining the feature vector representation A of the sentence through the BERT model and the MaxPool1D model,
calculating the description information characteristic representation of candidate entity Tang dynasty poetry Limb and organization member Limb in the knowledge base:
a. first of all [ "plum", "white", "of", "job", "business", "yes", "poem", "person", … "," down "," towards "," etc. "] obtaining an entity description vector representation B of" Libai "of poem of Tang dynasty in the knowledge base through a BERT model and MaxPool 1D;
b. then the 'plum', 'white', 'casting', 'industry', 'yes', 'group', 'tissue', … are added
"a", "an", "generation", "etc. "] obtaining the entity description vector representation C of the first organizer of the knowledge base, Libai, by the BERT model and MaxPhool 1D.
And sixthly, calculating similarity scores of the entity in the document and the candidate entity in the knowledge base, and calculating the similarity score between a vector representation A of the entity 'Libai' in the document and a vector representation B of the entity Tang dynasty poetry 'Libai' in the knowledge base by using a similarity calculation method of a multilayer perception machine and a Sigmoid classifier, wherein the vector representation C of the organization member 'Libai' is a vector representation C. The similarity score for poem from Tang dynasty was calculated to be 0.90 and the similarity score for the members of the organization was calculated to be 0.45.
And seventhly, grading and sorting the similarity scores of the candidate entities in the knowledge base, sorting the scores obtained in the fifth step, and selecting the entity which is larger than the threshold value and has the highest score as the final linked entity according to the threshold value. According to the output probability of the sigmoid function, the preset threshold of the similarity score is 0.5, the similarity score of the Libai of poetry of Tang dynasty is the highest and is greater than the threshold 0.5, so that the Libai is the entity finally linked with the entity Libai in the text, wherein the threshold in the embodiment can be adjusted according to the actual situation, and the invention is not specifically limited.
EXAMPLE III
Fig. 3 is a schematic composition diagram of an entity link system, and a third embodiment of the present invention provides an entity link system, where the system includes:
the text description information generating unit of the entity in the knowledge base is used for generating text description information of the entity in the knowledge base according to the attribute and the structural information of the relational graph of the entity in the knowledge base;
a vector representation obtaining unit of the entity in the knowledge base, which is used for obtaining sentence global vector representation of the text description information of the entity in the knowledge base and taking the sentence global vector representation as the vector representation of the entity in the knowledge base;
the vector representation obtaining unit of the entity text fragment in the unstructured text is used for obtaining the vector representation of the entity text fragment in the unstructured text;
and the link entity obtaining unit is used for calculating the similarity scores of the vector representation of the entity text fragment in the unstructured text and the vector representation of the entity in the knowledge base through vector similarity calculation, sorting the results based on the similarity scores, and selecting the candidate entity with the score larger than the threshold value and the highest score in the sorted results as the entity for linking the unstructured text and the knowledge base.
Example four
The fourth embodiment of the present invention provides an entity linking apparatus, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the entity linking method when executing the computer program.
The processor may be a Central Processing Unit (CPU), or other general-purpose processor, a digital signal processor (digital signal processor), an Application Specific Integrated Circuit (Application Specific Integrated Circuit), an off-the-shelf programmable gate array (field programmable gate array) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory can be used for storing the computer programs and/or modules, and the processor can realize various functions of the entity linking device in the invention by operating or executing the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart memory card, a secure digital card, a flash memory card, at least one magnetic disk storage device, a flash memory device, or other volatile solid state storage device.
EXAMPLE five
An embodiment five of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the entity linking method are implemented.
The physical link means, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of implementing the embodiments of the present invention may also be stored in a computer readable storage medium through a computer program, and when the computer program is executed by a processor, the computer program may implement the steps of the above-described method embodiments. Wherein the computer program comprises computer program code, an object code form, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying said computer program code, a recording medium, a usb-disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory, a random access memory, a point carrier signal, a telecommunications signal, a software distribution medium, etc. It should be noted that the computer readable medium may contain content that is appropriately increased or decreased as required by legislation and patent practice in the jurisdiction.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (10)
1. An entity linking method, characterized in that the method comprises:
generating text description information of the entities in the knowledge base according to the attributes of the entities in the knowledge base and the structural information of the relational graph;
obtaining sentence global vector representation of text description information of an entity in a knowledge base, and taking the sentence global vector representation as vector representation of the entity in the knowledge base;
obtaining vector representation of entity text fragments in the unstructured text;
calculating similarity scores of vector representation of entity text segments in the unstructured text and vector representation of entities in the knowledge base through vector similarity calculation, sorting based on similarity score results, and selecting candidate entities with scores larger than a threshold value and highest scores in the sorting results as entities of the unstructured text linked with the knowledge base.
2. The entity linking method according to claim 1, wherein in the method, a BERT model is used to obtain a sentence global vector representation of textual description information of an entity in the knowledge base, and a BERT model is used to obtain a vector representation of a text fragment of the entity in the unstructured text.
3. The entity linking method according to claim 1, wherein the method includes identifying entities in unstructured text, and specifically includes:
unstructured text D ═ w1,w2,…,wlGet the word from the vocabulary library, wi∈WvI is 1, …, l, l represents the number of words in the text D, v represents the size of the vocabulary library, W represents the vocabulary library space;
obtaining a vector representation sequence of a word sequence in the text D by using a BERT pre-training language model: x ═ X1,x2,…,xl},xi∈Rd,i=1,2,…,l;xiIs a d-dimensional vector of a real number space, representing the vector representation of the ith word, R representing the real number space;
the sequence embedded representation X is input into a pre-set encoder, resulting in a word embedded representation of the entity.
4. The entity linking method of claim 3, wherein the default encoder is a transform encoder:
Xe=Transformer_Encoder(X)
wherein, XeRepresenting entity embedded representation information after being coded by a Transformer coder;
mixing XeInputting the entity classification into a Softmax classifier through a linear transformation:
Oe=Linear(Xe)
Pe=Softmax(Oe)
wherein, Oe∈Rn×cRepresents XeAn embedded representation obtained after a linear transformation, c representing the number of classes of the entity BIO tag, Pe∈Rn×cRepresenting the probability distribution of the entity class by indexing the subscript of the maximum probability value in the probability distributionAs an entity class label index value for the current character;
obtaining a detected entity text fragment set E ═ { E } through BIO marking of a decoding sequence1,…,enWhere E represents a collection of entity text snippets, Ei(i ═ 1,2, …, n) text snippet [ w ] representing the ith entity in unstructured texti0,…,wik]Where k represents the length of the entity text segment.
5. The entity linking method according to claim 4, wherein the method includes generating candidate entities of the entities in the unstructured text in the knowledge base, and specifically includes:
constructing entity names for entities in the set E to obtain a final entity name set;
6. The entity linking method according to claim 5, wherein the entity name is constructed by: adding an alias dictionary of the entity name, generating shorthand of the name by using a name shorthand rule aiming at the entity names of different ontology types, and finally generating an entity name set by the entity according to the alias dictionary and the shorthand rule.
7. The entity linking method according to claim 1, wherein the method includes generating context description information of the entity in the unstructured text, and specifically includes:
for each entity in the unstructured text, generating corresponding description information T ═ T according to the context environment of the entity1,T2,…,TnN is the number of entities in the unstructured text;
the method for generating the corresponding description information according to the context environment of the entity is as follows: and intercepting a context sentence where the entity is located as description information of the entity, and adding preset marker characters before and after each entity.
8. The entity linking method according to claim 4, wherein the method comprises generating candidate entity description information in a knowledge base, and specifically comprises:
for entity e in unstructured textiEntity candidate set G in a knowledge baseiUsing a preset template of the ontology type for each candidate entity according to the ontology type of the entityGenerating description information S of the entity; candidate entities Description information S ═ S1,S2,…,Sn},Wherein S represents the description information of each entity in the unstructured text in the candidate entity set in the knowledge base, SiDescription information representing candidate entities of the ith entity in the unstructured text in the knowledge base,and the description information of the jth candidate entity of the ith entity in the knowledge base in the unstructured text is represented, n represents the number of entities in the unstructured text, and m represents the number of the knowledge base candidate entity set of the entities in the unstructured text.
9. The entity linking method of claim 1, wherein the vectorized representation of the entities in the unstructured text and the candidate entities in the knowledge base comprises:
setting the description information T of the entity in the unstructured text as T ═ T1,T2,…,TnInputting a BERT model to obtain text sequence vector representation of respective description information:
wherein t represents the maximum length of the description information text of the entity in the unstructured text, and d represents the dimension of the word vector output by BERT;
obtaining a global vector representation of a sentence by a pooling layer according to the text sequence vector representation of the entity description information:
knowledge base candidate entity description information S ═ { S ═ S1,S2,…,SnThe sequence vector obtained after entering the BERT is expressed as follows:
wherein, XSAfter the description information of each entity in the candidate entity set S of the knowledge base passes through a BERT pre-training model, a text sequence vector representation matrix is obtained;a text sequence vector representation matrix formed by candidate entity set description information of the ith entity in the knowledge base in the unstructured text is represented;a jth knowledge base candidate entity description information text sequence vector representation representing an ith entity in the unstructured text,the ith knowledge base candidate entity description information of the ith entity in the textA vector representation of a word;
extracting a global vector representation of a sentence from the text sequence vector representation of the knowledge base candidate entity description information through a pooling layer:
10. The entity linking method according to claim 1, wherein the method comprises:
and calculating the similarity between the entity in the unstructured text and the description information of each candidate entity in the knowledge base by using a similarity calculation method:
wherein CS (i, j) represents a similarity score between the description information of the ith entity in the unstructured text and the description information of the jth candidate entity in the knowledge base,a global vector representation representing a sentence is shown,representing the global vector representation of the ith entity description information of the jth candidate entity in the knowledge base in the unstructured text, and representing the related pairwise Similarity calculation method by using Similarity;
for the ith entity in the unstructured text, sorting the entity description information similarity scores of the candidate entity set in the knowledge base, and selecting the entity with the similarity score larger than a set threshold value and the highest score as the entity finally linked to the knowledge base;
max_i=argmaxjCS (i, j), i ═ 1,2, …, n, j ═ 1,2, …, m; wherein max _ i is the final link entity of the entity i, n represents the number of entities in the unstructured text, and m represents the number of the knowledge base candidate entity sets of the entities in the unstructured text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011581122.6A CN112966117A (en) | 2020-12-28 | 2020-12-28 | Entity linking method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011581122.6A CN112966117A (en) | 2020-12-28 | 2020-12-28 | Entity linking method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112966117A true CN112966117A (en) | 2021-06-15 |
Family
ID=76271500
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011581122.6A Pending CN112966117A (en) | 2020-12-28 | 2020-12-28 | Entity linking method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112966117A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116049447A (en) * | 2023-03-24 | 2023-05-02 | 中科雨辰科技有限公司 | Entity linking system based on knowledge base |
CN116306504A (en) * | 2023-05-23 | 2023-06-23 | 匀熵智能科技(无锡)有限公司 | Candidate entity generation method and device, storage medium and electronic equipment |
CN116341531A (en) * | 2023-02-28 | 2023-06-27 | 人民网股份有限公司 | Knowledge-driven character information extraction and inspection method and device |
CN116756295A (en) * | 2023-08-16 | 2023-09-15 | 北京盛通知行教育科技集团有限公司 | Knowledge base retrieval method, device and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106940702A (en) * | 2016-01-05 | 2017-07-11 | 富士通株式会社 | Entity refers to the method and apparatus with entity in semantic knowledge-base in connection short text |
CN107102989A (en) * | 2017-05-24 | 2017-08-29 | 南京大学 | A kind of entity disambiguation method based on term vector, convolutional neural networks |
CN108959258A (en) * | 2018-07-02 | 2018-12-07 | 昆明理工大学 | It is a kind of that entity link method is integrated based on the specific area for indicating to learn |
CN109408743A (en) * | 2018-08-21 | 2019-03-01 | 中国科学院自动化研究所 | Text link embedding grammar |
CN109933785A (en) * | 2019-02-03 | 2019-06-25 | 北京百度网讯科技有限公司 | Method, apparatus, equipment and medium for entity associated |
CN111428507A (en) * | 2020-06-09 | 2020-07-17 | 北京百度网讯科技有限公司 | Entity chain finger method, device, equipment and storage medium |
-
2020
- 2020-12-28 CN CN202011581122.6A patent/CN112966117A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106940702A (en) * | 2016-01-05 | 2017-07-11 | 富士通株式会社 | Entity refers to the method and apparatus with entity in semantic knowledge-base in connection short text |
CN107102989A (en) * | 2017-05-24 | 2017-08-29 | 南京大学 | A kind of entity disambiguation method based on term vector, convolutional neural networks |
CN108959258A (en) * | 2018-07-02 | 2018-12-07 | 昆明理工大学 | It is a kind of that entity link method is integrated based on the specific area for indicating to learn |
CN109408743A (en) * | 2018-08-21 | 2019-03-01 | 中国科学院自动化研究所 | Text link embedding grammar |
CN109933785A (en) * | 2019-02-03 | 2019-06-25 | 北京百度网讯科技有限公司 | Method, apparatus, equipment and medium for entity associated |
CN111428507A (en) * | 2020-06-09 | 2020-07-17 | 北京百度网讯科技有限公司 | Entity chain finger method, device, equipment and storage medium |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116341531A (en) * | 2023-02-28 | 2023-06-27 | 人民网股份有限公司 | Knowledge-driven character information extraction and inspection method and device |
CN116341531B (en) * | 2023-02-28 | 2023-10-10 | 人民网股份有限公司 | Knowledge-driven character information extraction and inspection method and device |
CN116049447A (en) * | 2023-03-24 | 2023-05-02 | 中科雨辰科技有限公司 | Entity linking system based on knowledge base |
CN116049447B (en) * | 2023-03-24 | 2023-06-13 | 中科雨辰科技有限公司 | Entity linking system based on knowledge base |
CN116306504A (en) * | 2023-05-23 | 2023-06-23 | 匀熵智能科技(无锡)有限公司 | Candidate entity generation method and device, storage medium and electronic equipment |
CN116306504B (en) * | 2023-05-23 | 2023-08-08 | 匀熵智能科技(无锡)有限公司 | Candidate entity generation method and device, storage medium and electronic equipment |
CN116756295A (en) * | 2023-08-16 | 2023-09-15 | 北京盛通知行教育科技集团有限公司 | Knowledge base retrieval method, device and storage medium |
CN116756295B (en) * | 2023-08-16 | 2023-11-03 | 北京盛通知行教育科技集团有限公司 | Knowledge base retrieval method, device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN111475623B (en) | Case Information Semantic Retrieval Method and Device Based on Knowledge Graph | |
Davis et al. | End-to-end document recognition and understanding with dessurt | |
CN112966117A (en) | Entity linking method | |
CN109800437A (en) | A kind of name entity recognition method based on Fusion Features | |
CN112149421A (en) | Software programming field entity identification method based on BERT embedding | |
CN110390018A (en) | A kind of social networks comment generation method based on LSTM | |
CN112926345B (en) | Multi-feature fusion neural machine translation error detection method based on data enhancement training | |
CN109977220B (en) | Method for reversely generating abstract based on key sentence and key word | |
US20220004545A1 (en) | Method of searching patent documents | |
CN111143507B (en) | Reading and understanding method based on compound problem | |
CN112328800A (en) | System and method for automatically generating programming specification question answers | |
Logeswaran et al. | Sentence ordering using recurrent neural networks | |
CN112597300A (en) | Text clustering method and device, terminal equipment and storage medium | |
CN113196277A (en) | System for retrieving natural language documents | |
CN114169312A (en) | Two-stage hybrid automatic summarization method for judicial official documents | |
CN113282729A (en) | Question-answering method and device based on knowledge graph | |
Pham | Transferring, transforming, ensembling: the novel formula of identifying fake news | |
CN113934835B (en) | Retrieval type reply dialogue method and system combining keywords and semantic understanding representation | |
CN106897274B (en) | Cross-language comment replying method | |
CN114064901B (en) | Book comment text classification method based on knowledge graph word meaning disambiguation | |
CN114048305A (en) | Plan recommendation method for administrative penalty documents based on graph convolution neural network | |
Günther et al. | Pre-trained web table embeddings for table discovery | |
CN117520515A (en) | Bank field question and answer method, system and device based on local knowledge base | |
CN115344668A (en) | Multi-field and multi-disciplinary science and technology policy resource retrieval method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 610042 No. 270, floor 2, No. 8, Jinxiu street, Wuhou District, Chengdu, Sichuan Applicant after: Chengdu shuzhilian Technology Co.,Ltd. Address before: No.2, floor 4, building 1, Jule road crossing, Section 1, West 1st ring road, Wuhou District, Chengdu City, Sichuan Province 610041 Applicant before: CHENGDU SHUZHILIAN TECHNOLOGY Co.,Ltd. |
|
CB02 | Change of applicant information |