CN113297386A - Method and device for linking entities in limited field - Google Patents

Method and device for linking entities in limited field Download PDF

Info

Publication number
CN113297386A
CN113297386A CN202010108590.5A CN202010108590A CN113297386A CN 113297386 A CN113297386 A CN 113297386A CN 202010108590 A CN202010108590 A CN 202010108590A CN 113297386 A CN113297386 A CN 113297386A
Authority
CN
China
Prior art keywords
entity
domain
entities
candidate
knowledge base
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010108590.5A
Other languages
Chinese (zh)
Inventor
侯磊
张馨如
史佳欣
李涓子
张鹏
唐杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202010108590.5A priority Critical patent/CN113297386A/en
Publication of CN113297386A publication Critical patent/CN113297386A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The embodiment of the invention provides an entity linking method and a device in a limited field, wherein the entity linking method in the limited field comprises the following steps: acquiring entity mentions and a candidate entity set in the text to be linked through an entity mention-knowledge base entity dictionary; inputting the obtained global features and local features of the entity mentions and the candidate entity sets into an entity disambiguation model, and obtaining the probability that the candidate entities in the candidate entity sets output by the entity disambiguation model are the knowledge base entities referred by the entity mentions; and determining entity links of the texts to be linked according to the probability that the candidate entities in the candidate entity set are the knowledge base entities referred by the entity mentions. The entity linking method in the limited field of the embodiment of the invention can avoid manual marking work and has high linking accuracy.

Description

Method and device for linking entities in limited field
Technical Field
The present invention relates to the field of entity linking technologies, and in particular, to a method and an apparatus for entity linking in a limited field.
Background
The goal of entity linking is to link entity mentions appearing in the text to knowledge base entities. This is a basic task in the field of Natural Language Processing (NLP) and can provide support for other tasks in the field, such as question and answer systems, relationship extraction, and the like, and therefore, in recent years, the related art of entity linking has been vigorously developed.
The main challenge of entity linking is the ambiguity of entity mentions. An entity reference may refer to multiple knowledge base entities, and a knowledge base entity is often referred to in multiple ways. Another challenge with this technique is that, in practice, an entity linking system should link to more meaningful, more concrete entities.
Existing entity linking methods typically include four steps: 1. finding all entity mentions from the text; 2. finding all knowledge base entities to which entity mentions are possible; 3. carrying out feature representation on entity mentions, knowledge base entities, context information and the like, wherein the step is usually to obtain feature vectors in the same semantic space through representation learning; 4. entity disambiguation is performed using a classification or learning ranking algorithm.
The existing entity linking methods face the following challenges: 1. for entity linking in the general field, since one entity may refer to hundreds of knowledge base entities, the training efficiency of the model is extremely low due to the number of parameters of the corresponding neural network model, and accordingly, the accuracy is greatly reduced. 2. The entity linking method facing the general field is often linked to a plurality of generalized and general entities, and the linking result has little meaning in practical application, especially application facing the mass users. 3. Training of the entity-link model to define a domain requires a large amount of domain annotation data, which is too costly to implement.
Disclosure of Invention
Embodiments of the present invention provide a limited field entity linking method, apparatus, electronic device and readable storage medium that overcome or at least partially solve the above problems.
In a first aspect, an embodiment of the present invention provides a method for linking entities in a defined domain, including: acquiring entity mentions and a candidate entity set in the text to be linked through an entity mention-knowledge base entity dictionary; inputting the obtained global features and local features of the entity mentions and the candidate entity sets into an entity disambiguation model, and obtaining the probability that the candidate entities in the candidate entity sets output by the entity disambiguation model are the knowledge base entities referred by the entity mentions; determining entity links of texts to be linked according to the probability that the candidate entities in the candidate entity set are the knowledge base entities referred by the entity mentions; the entity mention-knowledge base entity dictionary is determined according to a pre-constructed domain data set, wherein the domain data set comprises existing entity mentions in the target encyclopedia and corresponding knowledge base entities; the entity disambiguation model integrates different characteristics by using a multilayer perceptron, information is transmitted between a candidate entity and a context entity thereof by using a graph convolution network, the entity disambiguation model is obtained by taking global characteristic sample data and local characteristic sample data of any training corpus in the field data set as samples and taking a probability result of a knowledge base entity referred by entity mentioning in the any training corpus as a sample label training.
In some embodiments, the obtaining entity mentions and candidate entity sets in the text to be linked through the entity mentions-knowledge base entity dictionary includes: constructing a dictionary tree for character string matching through the entity mention-knowledge base entity dictionary; and obtaining all entity mentions appearing in the text by adopting a character string matching algorithm based on the dictionary tree, and selecting the entity mention with the longest or the largest appearing times as a matching result for the entity mentions with conflict, and obtaining the candidate entity set at the same time.
In some embodiments, the global feature sample data and the local feature sample data are obtained by performing vector training on a corpus in the domain data set; wherein the vector training of the corpus in the domain data set comprises: and obtaining field vector representation and open domain vector representation for any entity and word in the training corpus, and connecting the field vector and the open domain vector as vector representation of the entity and the word in the calculation process of feature extraction.
In some embodiments, pre-constructing the domain data set comprises: randomly sequencing the categories of all entities of a target encyclopedia to obtain category sequences corresponding to the entities, wherein the category sequences corresponding to all the entities form a training corpus; obtaining vector representation of any category sequence by a context category prediction method; determining a domain category set corresponding to any domain, wherein the domain category set comprises a plurality of encyclopedia categories corresponding to the domain; and obtaining the field data set according to the entity of the category in the data field category set and the field category set.
In some embodiments, the determining a set of domain categories corresponding to any domain, the set of domain categories including a plurality of encyclopedia categories corresponding to the domain, comprises: determining the encyclopedic class c corresponding to any fieldd(ii) a According to a preset maximum traversal level number, from top to bottom from the encyclopedic class cdStarting to traverse the classification system of the target encyclopedia according to layers; adding the preset category of the front k layers into a field category set; during the traversal, any class c is calculatedjMean of vector representations of classes that have been added to a set of domain classes
Figure BDA0002389166100000031
Calculate any of the classes cjSimilarity with categories added into the field category set, and selecting the categories added into the field category set C with the preset value of x% before similarity rankingdIn (1).
In some embodiments, the global features are used to characterize semantic consistency of all entities linked to a piece of text, and the local features are used to characterize semantic consistency of the linked knowledge base entities with local contexts.
In some embodiments, the global features include entity graph features and similarity features mentioned for any entity and context entity; the local features include string similarity and contextual similarity.
In a second aspect, an embodiment of the present invention provides a domain-restricted entity linking apparatus, including: the acquisition unit is used for acquiring entity mentions and candidate entity sets in the text to be linked through the entity mentioning-knowledge base entity dictionary; the processing unit is used for inputting the acquired global features and local features of the entity mentions and the candidate entity sets into an entity disambiguation model, and obtaining the probability that the candidate entities in the candidate entity sets output by the entity disambiguation model are the knowledge base entities referred by the entity mentions; the determining unit is used for determining entity links of texts to be linked according to the probability that the candidate entities in the candidate entity set are the knowledge base entities referred by the entity mentions; the entity mention-knowledge base entity dictionary is determined according to a pre-constructed domain data set, wherein the domain data set comprises existing entity mentions in the target encyclopedia and corresponding knowledge base entities; the entity disambiguation model integrates different characteristics by using a multilayer perceptron, information is transmitted between a candidate entity and a context entity thereof by using a graph convolution network, the entity disambiguation model is obtained by taking global characteristic sample data and local characteristic sample data of any training corpus in the field data set as samples and taking a probability result of a knowledge base entity referred by entity mentioning in the any training corpus as a sample label training.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method provided in the first aspect when executing the program.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method as provided in the first aspect.
According to the method and device for limiting the entity link in the field, the electronic equipment and the readable storage medium, the entity link model of the limited field is trained on the field data set, the field data are generated based on the classification system of the target encyclopedia through an unsupervised method without manual marking, and the entity link system of the limited field further limits the solution space, so that the time and space complexity of entity link is greatly reduced, the link accuracy is improved, and the linked knowledge base entity has more practical application significance.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flowchart of a method for linking entities in a defined domain according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for linking entities in a defined domain according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a domain-defining entity linking apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
An entity linking method of a defined field of the embodiment of the present invention is described below with reference to fig. 1 and 2.
As shown in fig. 1, the entity linking method for defining a domain provided in the embodiment of the present invention includes the following steps:
and S100, acquiring entity mention and a candidate entity set in the text to be linked through an entity mention-knowledge base entity dictionary.
The entity mention-knowledge base entity dictionary is determined according to a pre-constructed domain data set, wherein the domain data set comprises entity mention existing in the target encyclopedia and corresponding knowledge base entities.
It should be noted that the target encyclopedia may be wikipedia, encyclopedia, or the like. Because a large amount of linked texts exist in the target encyclopedia, the linked texts can be naturally used as entity linked labeled linguistic data, a field data set can be conveniently constructed, and manual labeling is not needed.
Step S200, inputting the obtained global features and local features of the entity mention and the candidate entity set into an entity disambiguation model, and obtaining the probability that the candidate entities in the candidate entity set output by the entity disambiguation model are the knowledge base entities referred by the entity mention.
The entity disambiguation model is obtained by training global characteristic sample data and local characteristic sample data of any training corpus in a domain data set as samples and knowledge base entities referred by entity mentions in any training corpus as sample labels.
As the entity disambiguation model takes the global characteristic sample data and the local characteristic sample data as samples, in the using process, the consistency of the disambiguation result mentioned by other entities in the text can be considered while the local context information mentioned by the entities is considered, namely the local disambiguation and the global disambiguation are integrated.
And step S300, determining entity links of the texts to be linked according to the probability that the candidate entities in the candidate entity set are the knowledge base entities referred by entity mentions.
And selecting the entity with the highest probability, wherein the candidate entity corresponding to the probability is the entity link of the text to be linked.
It can be understood that, in the entity linking method in the limited field, the entity linking model training in the limited field is performed on the field data set, the field data is generated based on the classification system of the target encyclopedia through an unsupervised method without manual labeling, and the entity linking system in the limited field further limits the solution space, so that the time and space complexity of entity linking is greatly reduced, the linking accuracy is improved, and the linked knowledge base entity has more practical application significance.
According to the entity linking method in the limited field, the manual marking work can be omitted, and the linking accuracy is high.
In some embodiments, the domain data set is pre-constructed, including steps S010 to S040.
And S010, randomly sequencing the categories of all the entities of the target encyclopedia to obtain category sequences corresponding to the entities, wherein the category sequences corresponding to all the entities form a training corpus.
In actual implementation, the categories of each entity are randomly ordered 10 times and then connected in series to obtain a category sequence, and all the category sequences obtained by all the entities of the target encyclopedia are used as training corpora according to the method.
Step S020, obtaining a vector representation of any category sequence by a method of predicting a context category.
And S030, determining a field category set corresponding to any field, wherein the field category set comprises a plurality of encyclopedia categories corresponding to the field.
And S030, determining a field category set corresponding to any field, wherein the field category set comprises a plurality of encyclopedia categories corresponding to the field, and comprises substeps S031 through substep S034.
Substep S031, determine the encyclopedic class c corresponding to any domain (e.g. domain d)dFor the classification system of the target encyclopedia, the encyclopedia class one cdBelonging to the topmost category.
Substep S032, according to the preset maximum traversal level max _ depth, from top to bottom from encyclopedic class cdThe classification system of the target encyclopedia is traversed by layers.
And a substep S033 of adding the preset category of the top k layers to the domain category set.
In other words, the categories of the top k layers must be added to the set of domain categories.
Substep S034, in the traversal process, calculating any category cjMean of vector representations of classes that have been added to a set of domain classes
Figure BDA0002389166100000071
Calculate any of the classes cjSimilarity with categories added into the field category set, and selecting the categories added into the field category set C with the preset value of x% before similarity rankingdIn (1).
In other words, starting from the category at the k +1 th layer, it is necessary to determine the similarity between the category and the category already added to the domain category set, and select the category with the high similarity to be added to the domain category set.
And step S040, obtaining a domain data set according to the entity of the class in the data domain class set and the domain class set. In other words, the entities of the categories in the data domain category set are added into the domain category set, and the domain training corpus for entity linking is obtained.
Since a large amount of linked texts exist in the target encyclopedia and can be naturally used as the labeled corpus of the entity link, the domain corpus that can be used for the entity link model training task is constructed after the substep S040 is finished.
In some embodiments, the global feature sample data and the local feature sample data are obtained by performing vector training on a training corpus in the domain data set; wherein the vector training of the corpus in the domain data set comprises: and obtaining field vector representation and open domain vector representation for any entity and word in the training corpus, and connecting the field vector and the open domain vector as vector representation of the entity and the word in the calculation process of feature extraction.
It can be understood that after the vectorization representation of the words and entities is obtained through the training corpus, the global feature sample data and the local feature sample data can be well represented and used in the disambiguation model.
In a practical implementation, for the obtained domain corpus, the vector representation of the entity and the word in the same semantic space can be obtained by the representation learning method proposed by Yamada et al in 2016. Specifically, the domain training corpora are respectively subjected to vector training in different domains, and simultaneously, the vector training is also carried out in the open domain. For each entity and word, there are at least two vector representations, one is a realm-owned vector representation and the other is an open-domain vector representation. In the calculation process of feature extraction, the domain vector and the open domain vector are connected as vector representations of entities and words.
For each of the corpus, the relevant global feature sample data and local feature sample data may be extracted for disambiguation.
The global feature sample data is used for representing semantic consistency of all entities linked to a piece of text, and comprises entity graph features and similarity features mentioned by any entity and a context entity.
And the entity graph is characterized in that an entity graph is constructed for each piece of training data (namely, a text with marks), candidate entities mentioned by all entities appearing in the text are used as nodes of the graph, edges exist between every two entities, and the value of the edge weight is the cosine similarity of the two entities. For each candidate entity, the entity graph formed by all candidate entities with global characteristics mentioned by m entities before and after is a subgraph of the entity graph of the whole document.
And calculating the cosine similarity of the vector mean of the entity and the context entity mentioning words to represent the semantic consistency of the candidate entity and the context entity mentioning.
The local feature sample data is used to characterize semantic consistency of the knowledge base entity to which it is linked with the local context. The local feature sample data includes a character string similarity and a context similarity.
String similarity, i.e., the full name of the knowledge base entity and the edit distance mentioned by the entity. Context similarity, i.e. the cosine similarity of the knowledge base entity to the mean of the context word vectors.
For the entity disambiguation model, the input of the entity disambiguation model is global characteristics and local characteristics obtained by characteristic extraction, a multilayer perceptron is used for integrating different characteristics, a graph convolution network is used for transmitting information between a candidate entity and a context entity thereof, and after the graph convolution network, a hidden layer state is mapped to the probability that the candidate entity is a knowledge base entity referred to by entity mentions by using a full connection layer.
In some embodiments, an entity mention-repository entity dictionary needs to be constructed in advance. Specifically, a large number of links exist in the target encyclopedia, the text of the links can be regarded as entity mentions, and the wikipedia page pointed by the links is a knowledge base entity. After the domain corpus is obtained, an entity mention-knowledge base entity dictionary can be obtained by counting the links in the corpus.
In some embodiments, the step S100 of obtaining entity mentions and candidate entity sets in the text to be linked through the entity mention-knowledge base entity dictionary includes: step S110 and step S120.
And step S110, constructing a dictionary tree for character string matching through an entity mention-knowledge base entity dictionary.
And step S120, obtaining all entity mentions appearing in the text by adopting a character string matching algorithm based on a dictionary tree, selecting the entity mention with the longest or the most appearing times as a matching result for the entity mentions with conflict, and obtaining a candidate entity set at the same time.
It can be understood that a dictionary tree for efficient character string matching is constructed through an obtained entity mention-knowledge base entity dictionary, a text to be linked is given, all entity mentions appearing in the text are obtained through a character string matching algorithm based on the dictionary tree, and for entity mentions with conflicts (such as conflicts of the Yangtze river bridge in Nanjing, city and the Yangtze river bridge), the entity mention with the longest occurrence frequency or the entity mention with the largest occurrence frequency is selected as a matching result, and meanwhile, a candidate entity set of the entity mentions can also be obtained.
In some embodiments, global features are used to characterize semantic consistency for all entities to which a piece of text is linked, including entity graph features and similarity features mentioned for any entity and a contextual entity.
And the entity graph is characterized in that an entity graph is constructed for each piece of training data (namely, a text with marks), candidate entities mentioned by all entities appearing in the text are used as nodes of the graph, edges exist between every two entities, and the value of the edge weight is the cosine similarity of the two entities. For each candidate entity, the entity graph formed by all candidate entities with global characteristics mentioned by m entities before and after is a subgraph of the entity graph of the whole document.
And calculating the cosine similarity of the vector mean of the entity and the context entity mentioning words to represent the semantic consistency of the candidate entity and the context entity mentioning.
The local features are used to characterize semantic consistency of the knowledge base entities linked to with the local context. The local features include string similarity and contextual similarity.
String similarity, i.e., the full name of the knowledge base entity and the edit distance mentioned by the entity. Context similarity, i.e. the cosine similarity of the knowledge base entity to the mean of the context word vectors.
In summary, the present invention realizes an unsupervised manner for generating domain data, and can provide support for different domain-oriented natural language processing tasks. In addition, the invention realizes the entity link in the limited field, improves the effect of the entity link and ensures that the entity link system has more practical value.
Through verification, the entity linking method provided by the embodiment of the invention is high in accuracy.
Table 1 shows a data set.
TABLE 1
Figure BDA0002389166100000111
Table 2 shows the overlap ratio of the domain categories.
TABLE 2
Figure BDA0002389166100000112
Table 3 defines the entity linking results for the domain.
TABLE 3
Figure BDA0002389166100000121
As shown in table 1, 12 fields are generated according to the method of the present invention, as shown in table 2, the overlap rate between fields is mostly below 10%, and the overlap rate between fields 2 and 3 is 65% because they represent law and politics, respectively, and the overlap classification of the two is naturally higher, which further illustrates the effectiveness of the method of the present invention.
That is, the present invention can effectively obtain domain data, and the semantic convergence of the domain is good.
As shown in table 3, model training is performed on the data sets of the open domain and the limited domain, and then testing is performed on the data sets (the portions not used for training) of different domains, so that it can be seen that the performance of the domain model is far better than that of the open domain model, and the effect of entity linking is significantly improved.
That is, the present invention can improve the effect of entity linking.
The following describes the entity linking device for the limited domain provided in the embodiment of the present invention, and the entity linking device for the limited domain described below and the entity linking method for the limited domain described above may be referred to correspondingly.
As shown in fig. 3, an entity linking apparatus for defining a domain provided in an embodiment of the present invention includes: an acquisition unit 710, a processing unit 720 and a determination unit 730.
An obtaining unit 710, configured to obtain an entity mention and a candidate entity set in a text to be linked through an entity mention-knowledge base entity dictionary; a processing unit 720, configured to input the obtained global features and local features of the entity mention and the candidate entity set into the entity disambiguation model, and obtain a probability that a candidate entity in the candidate entity set output by the entity disambiguation model is a knowledge base entity referred to by the entity mention; the determining unit 730, configured to determine an entity link of the text to be linked according to a probability that a candidate entity in the candidate entity set is an entity of the knowledge base referred to by the entity mention; the entity mention-knowledge base entity dictionary is determined according to a pre-constructed domain data set, wherein the domain data set comprises existing entity mention in the target encyclopedia and a corresponding knowledge base entity; the entity disambiguation model integrates different characteristics by using a multilayer perceptron, information is transmitted between a candidate entity and a context entity by using a graph convolution network, the entity disambiguation model is obtained by taking global characteristic sample data and local characteristic sample data of any training corpus in a domain data set as samples and taking a probability result of a knowledge base entity referred by entity mentioning in any training corpus as a sample label training.
Fig. 4 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 4: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform domain-defined entity linking methods, including: acquiring entity mentions and a candidate entity set in the text to be linked through an entity mention-knowledge base entity dictionary; inputting the obtained global features and local features of the entity mention and the candidate entity set into an entity disambiguation model, and obtaining the probability that the candidate entities in the candidate entity set output by the entity disambiguation model are knowledge base entities referred by the entity mention; determining entity links of texts to be linked according to the probability that the candidate entities in the candidate entity set are the knowledge base entities referred by entity mentions; the entity mention-knowledge base entity dictionary is determined according to a pre-constructed domain data set, wherein the domain data set comprises existing entity mention in the target encyclopedia and a corresponding knowledge base entity; the entity disambiguation model integrates different characteristics by using a multilayer perceptron, information is transmitted between a candidate entity and a context entity by using a graph convolution network, the entity disambiguation model is obtained by taking global characteristic sample data and local characteristic sample data of any training corpus in a domain data set as samples and taking a probability result of a knowledge base entity referred by entity mentioning in any training corpus as a sample label training.
It should be noted that, when being implemented specifically, the electronic device in this embodiment may be a server, a PC, or other devices, as long as the structure includes the processor 810, the communication interface 820, the memory 830, and the communication bus 840 shown in fig. 4, where the processor 810, the communication interface 820, and the memory 830 complete mutual communication through the communication bus 840, and the processor 810 may call the logic instructions in the memory 830 to execute the above method. The embodiment does not limit the specific implementation form of the electronic device.
In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Further, embodiments of the present invention disclose a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, the computer is capable of performing the methods provided by the above-mentioned method embodiments, for example, comprising: acquiring entity mentions and a candidate entity set in the text to be linked through an entity mention-knowledge base entity dictionary; inputting the obtained global features and local features of the entity mention and the candidate entity set into an entity disambiguation model, and obtaining the probability that the candidate entities in the candidate entity set output by the entity disambiguation model are knowledge base entities referred by the entity mention; determining entity links of texts to be linked according to the probability that the candidate entities in the candidate entity set are the knowledge base entities referred by entity mentions; the entity mention-knowledge base entity dictionary is determined according to a pre-constructed domain data set, wherein the domain data set comprises existing entity mention in the target encyclopedia and a corresponding knowledge base entity; the entity disambiguation model is obtained by training a probability result of a knowledge base entity referred by entity mentioning in any training corpus as a sample label by using global characteristic sample data and local characteristic sample data of any training corpus in a domain data set as samples and using a graph convolution network to transmit information between a candidate entity and a context entity thereof.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, and for example, the method includes: acquiring entity mentions and a candidate entity set in the text to be linked through an entity mention-knowledge base entity dictionary; inputting the obtained global features and local features of the entity mention and the candidate entity set into an entity disambiguation model, and obtaining the probability that the candidate entities in the candidate entity set output by the entity disambiguation model are knowledge base entities referred by the entity mention; determining entity links of texts to be linked according to the probability that the candidate entities in the candidate entity set are the knowledge base entities referred by entity mentions; the entity mention-knowledge base entity dictionary is determined according to a pre-constructed domain data set, wherein the domain data set comprises existing entity mention in the target encyclopedia and a corresponding knowledge base entity; the entity disambiguation model integrates different characteristics by using a multilayer perceptron, information is transmitted between a candidate entity and a context entity by using a graph convolution network, the entity disambiguation model is obtained by taking global characteristic sample data and local characteristic sample data of any training corpus in a domain data set as samples and taking a probability result of a knowledge base entity referred by entity mentioning in any training corpus as a sample label training.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for linking entities in a defined domain, comprising:
acquiring entity mentions and a candidate entity set in the text to be linked through an entity mention-knowledge base entity dictionary;
inputting the obtained global features and local features of the entity mentions and the candidate entity sets into an entity disambiguation model, and obtaining the probability that the candidate entities in the candidate entity sets output by the entity disambiguation model are the knowledge base entities referred by the entity mentions;
determining entity links of texts to be linked according to the probability that the candidate entities in the candidate entity set are the knowledge base entities referred by the entity mentions;
the entity mention-knowledge base entity dictionary is determined according to a pre-constructed domain data set, wherein the domain data set comprises existing entity mentions in the target encyclopedia and corresponding knowledge base entities;
the entity disambiguation model integrates different characteristics by using a multilayer perceptron, information is transmitted between a candidate entity and a context entity thereof by using a graph convolution network, the entity disambiguation model is obtained by taking global characteristic sample data and local characteristic sample data of any training corpus in the field data set as samples and taking a probability result of a knowledge base entity referred by entity mentioning in the any training corpus as a sample label training.
2. The method for linking entities in a defined field according to claim 1, wherein the obtaining entity mentions and candidate entity sets in the text to be linked through an entity mentions-knowledge base entity dictionary comprises:
constructing a dictionary tree for character string matching through the entity mention-knowledge base entity dictionary;
and obtaining all entity mentions appearing in the text by adopting a character string matching algorithm based on the dictionary tree, and selecting the entity mention with the longest or the largest appearing times as a matching result for the entity mentions with conflict, and obtaining the candidate entity set at the same time.
3. The method according to claim 1, wherein the global feature sample data and the local feature sample data are obtained by performing vector training on a corpus in the domain data set; wherein
The vector training of the corpus in the domain data set comprises: and obtaining field vector representation and open domain vector representation for any entity and word in the training corpus, and connecting the field vector and the open domain vector as vector representation of the entity and the word in the calculation process of feature extraction.
4. The method for entity linkage defining a domain according to any one of claims 1 to 3, wherein the domain data set is constructed in advance, including:
randomly sequencing the categories of all entities of a target encyclopedia to obtain category sequences corresponding to the entities, wherein the category sequences corresponding to all the entities form a training corpus;
obtaining vector representation of any category sequence by a context category prediction method;
determining a domain category set corresponding to any domain, wherein the domain category set comprises a plurality of encyclopedia categories corresponding to the domain;
and obtaining the field data set according to the entity of the category in the data field category set and the field category set.
5. The method for entity linking in a defined domain according to claim 4, wherein the determining a set of domain categories corresponding to any domain, the set of domain categories including a plurality of encyclopedia categories corresponding to the domain, comprises:
determining the encyclopedic class c corresponding to any fieldd
According to a preset maximum traversal level number, from top to bottom from the encyclopedic class cdStarting to traverse the classification system of the target encyclopedia according to layers;
adding the preset category of the front k layers into a field category set;
during the traversal, any class c is calculatedjMean of vector representations of classes that have been added to a set of domain classes
Figure FDA0002389166090000021
Calculate any of the classes cjSimilarity with categories added into the field category set, and selecting the categories added into the field category set C with the preset value of x% before similarity rankingdIn (1).
6. The method for entity linking in a defined domain according to any one of claims 1-3, wherein the global features are used to characterize semantic consistency of all entities linked to a piece of text, and the local features are used to characterize semantic consistency of knowledge base entities linked to with local contexts.
7. The domain-defining entity linking method of claim 6,
the global features comprise entity graph features and similarity features mentioned by any entity and a context entity;
the local features include string similarity and contextual similarity.
8. An apparatus for defining a domain of entities, comprising:
the acquisition unit is used for acquiring entity mentions and candidate entity sets in the text to be linked through the entity mentioning-knowledge base entity dictionary;
the processing unit is used for inputting the acquired global features and local features of the entity mentions and the candidate entity sets into an entity disambiguation model, and obtaining the probability that the candidate entities in the candidate entity sets output by the entity disambiguation model are the knowledge base entities referred by the entity mentions;
the determining unit is used for determining entity links of texts to be linked according to the probability that the candidate entities in the candidate entity set are the knowledge base entities referred by the entity mentions;
the entity mention-knowledge base entity dictionary is determined according to a pre-constructed domain data set, wherein the domain data set comprises existing entity mentions in the target encyclopedia and corresponding knowledge base entities;
the entity disambiguation model integrates different characteristics by using a multilayer perceptron, information is transmitted between a candidate entity and a context entity thereof by using a graph convolution network, the entity disambiguation model is obtained by taking global characteristic sample data and local characteristic sample data of any training corpus in the field data set as samples and taking a probability result of a knowledge base entity referred by entity mentioning in the any training corpus as a sample label training.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the method of physical linking of domains as defined in any one of claims 1 to 7.
10. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of entity linking of the defined field as set forth in any one of claims 1 to 7.
CN202010108590.5A 2020-02-21 2020-02-21 Method and device for linking entities in limited field Pending CN113297386A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010108590.5A CN113297386A (en) 2020-02-21 2020-02-21 Method and device for linking entities in limited field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010108590.5A CN113297386A (en) 2020-02-21 2020-02-21 Method and device for linking entities in limited field

Publications (1)

Publication Number Publication Date
CN113297386A true CN113297386A (en) 2021-08-24

Family

ID=77317582

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010108590.5A Pending CN113297386A (en) 2020-02-21 2020-02-21 Method and device for linking entities in limited field

Country Status (1)

Country Link
CN (1) CN113297386A (en)

Similar Documents

Publication Publication Date Title
CN110110062B (en) Machine intelligent question and answer method and device and electronic equipment
CN110929038B (en) Knowledge graph-based entity linking method, device, equipment and storage medium
CN104598611B (en) The method and system being ranked up to search entry
US11216701B1 (en) Unsupervised representation learning for structured records
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN111382255A (en) Method, apparatus, device and medium for question and answer processing
CN111866004B (en) Security assessment method, apparatus, computer system, and medium
CN111160041B (en) Semantic understanding method and device, electronic equipment and storage medium
CN113705196A (en) Chinese open information extraction method and device based on graph neural network
CN113593661A (en) Clinical term standardization method, device, electronic equipment and storage medium
CN113947084A (en) Question-answer knowledge retrieval method, device and equipment based on graph embedding
CN112580331A (en) Method and system for establishing knowledge graph of policy text
CN112085091A (en) Artificial intelligence-based short text matching method, device, equipment and storage medium
CN114722833A (en) Semantic classification method and device
CN113779190A (en) Event cause and effect relationship identification method and device, electronic equipment and storage medium
CN112395866B (en) Customs clearance sheet data matching method and device
CN112307738A (en) Method and device for processing text
CN111241843B (en) Semantic relation inference system and method based on composite neural network
CN110287396A (en) Text matching technique and device
CN113742445B (en) Text recognition sample obtaining method and device and text recognition method and device
CN113297386A (en) Method and device for linking entities in limited field
CN113761874A (en) Event reality prediction method and device, electronic equipment and storage medium
CN111401070A (en) Word sense similarity determining method and device, electronic equipment and storage medium
US20240086768A1 (en) Learning device, inference device, non-transitory computer-readable medium, learning method, and inference method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination