CN113111136A - Entity disambiguation method and device based on UCL knowledge space - Google Patents

Entity disambiguation method and device based on UCL knowledge space Download PDF

Info

Publication number
CN113111136A
CN113111136A CN202110475291.XA CN202110475291A CN113111136A CN 113111136 A CN113111136 A CN 113111136A CN 202110475291 A CN202110475291 A CN 202110475291A CN 113111136 A CN113111136 A CN 113111136A
Authority
CN
China
Prior art keywords
entity
ucl
knowledge
disambiguated
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110475291.XA
Other languages
Chinese (zh)
Other versions
CN113111136B (en
Inventor
杨鹏
常欣辰
范路平
于晓潭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202110475291.XA priority Critical patent/CN113111136B/en
Publication of CN113111136A publication Critical patent/CN113111136A/en
Application granted granted Critical
Publication of CN113111136B publication Critical patent/CN113111136B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides an entity disambiguation method and device based on a UCL knowledge space, which comprises the steps of firstly, constructing a basic knowledge base to complete the construction of the UCL knowledge space; then acquiring a candidate entity set related to the entity to be disambiguated from a UCL knowledge space, and generating the candidate entity and the embedded representation of the entity to be disambiguated by using a word vector representation method; then extracting conceptual features of the entity to be disambiguated and the context thereof, and extracting features of the context of the candidate entity; finally, four vector representations generated before are used as input, and a self-attention matching network based on a deep structured semantic matching model (DSSM) is adopted to obtain the matching degree; and obtaining a final disambiguation result according to the ordering of the matching results, and completing entity linkage between the entity in the text and the entity in the UCL knowledge space. The invention can solve the problem of less entity related information in the short text and improve the accuracy of entity disambiguation.

Description

Entity disambiguation method and device based on UCL knowledge space
Technical Field
The invention relates to an entity disambiguation method and device based on a UCL knowledge space, and belongs to the technical field of knowledge graph construction in the Internet.
Background
With the rapid development of the internet, the number of network news is increased rapidly, knowledge information contained in news is more and more complicated, and a proper carrier is urgently needed to effectively store and manage the news information. A knowledge-graph is able to form a graph database by associating entities together by constructing "entity-relationship-entity" triplets and "entity-attribute (value)" key-value pairs. The unified Content tag ucl (unified Content tag) defined by the national standard "unified Content tag format specification" (GB/T35304-. The webpage news content with great disorder can be effectively organized by filling the description attribute set of the UCL. Based on the above premises, a UCL Knowledge Space (ucll Knowledge Space, UCLKS) arises at the discretion, and the UCL Knowledge Space is based on basic Knowledge bases such as wikipedia, hundred-degree encyclopedia and the like, and supplements entity Knowledge with network news content. The process of associating entities in the news text with existing entities in the UCL knowledge space is entity disambiguation, or entity linking. Entity disambiguation is a necessary prerequisite for subsequent knowledge space updates and for the implementation of knowledge space-based applications.
Over the past few decades, researchers have conducted a great deal of research into entity disambiguation, mainly involving rule-based methods, traditional statistical methods, and methods of deep learning. In the case of rich text context, the most advanced algorithms have achieved very good disambiguation. However, the performance of these methods on short text as well as spoken text drops dramatically. The missing context of these short texts is essential for disambiguation. In recent years, many scholars have endeavored to improve the effect of entity disambiguation by exploring more abundant contextual information. These methods are mainly focused on better utilization of existing text, but the problem of information loss is still not solved. It is difficult to improve the disambiguation result substantially if the source text cannot be augmented by adding external information.
Disclosure of Invention
In order to solve the problems, the invention provides an entity disambiguation method and device based on a UCL knowledge space, which respectively perform knowledge expansion on an entity to be disambiguated and a candidate entity to improve the model effect. For an entity to be disambiguated, giving the entity a concept vector representation and formalizing the concept vector representation into a concept subgraph; for candidate entities, an entity vector representation is given to the entity to formalize it as factual knowledge in the UCL knowledge space. And finally, obtaining the entity disambiguation result by adopting a deep structured semantic model based on self-attention matching. The method can effectively solve the problem of low disambiguation precision caused by insufficient entity context information in the short text, and improve the effect of entity disambiguation.
In order to achieve the aim, the invention provides an entity disambiguation method based on UCL knowledge space, which comprises the steps of firstly, constructing a basic knowledge base by acquiring Wikipedia and Baidu encyclopedia knowledge, indexing news information by UCL and storing the news information in the knowledge base as content supplement of the UCL knowledge space, thereby completing construction of the UCL knowledge space; extracting an entity to be disambiguated in the text by using an open domain three-tuple extraction method, and acquiring a candidate entity set for entity disambiguation from a UCL knowledge space; then, extracting the characteristics of the entity to be disambiguated, and enhancing the embedded representation of the entity to be disambiguated by using the concept knowledge corresponding to the context information of the entity; embedding and enhancing the candidate entity by utilizing the fact knowledge in the UCL knowledge space; and finally, obtaining a matching result by utilizing a deep structured semantic model based on the self-attention matching network.
Specifically, the invention provides the following technical scheme:
an entity disambiguation method based on UCL knowledge space comprises the following steps:
(1) constructing a UCL knowledge space: acquiring entities, entity basic attributes and associations among the entities from an open offline database by using an information extraction related technology, and constructing a basic knowledge base; acquiring network news, indexing the network news by using UCL, calculating the association between an entity and the news as the supplement of a knowledge base, and completing the construction of UCL knowledge space;
(2) vector representation generation: acquiring a candidate entity set related to the entity to be disambiguated from a UCL knowledge space, and generating an embedded representation of the candidate entity and the entity to be disambiguated by using a word vector representation method as the input of the semantic matching module in the step (4);
(3) characteristic extraction: not only the content information of the text is considered, but also the text is supplemented by introducing external knowledge; the first stage extracts the conceptual features of the entity to be disambiguated and the context thereof, and the second stage extracts the features of the context of the candidate entity;
(4) semantic matching: and (3) utilizing four vector representations of the conceptual embedding of the entity to be disambiguated generated in the step (3), the context embedding of the candidate entity and the word vector embedding of the candidate entity and the entity to be disambiguated generated in the step (2) as input, and obtaining the matching degree by adopting a self-attention matching network based on a deep structured semantic matching model (DSSM). And obtaining a final disambiguation result according to the ordering of the matching results, completing entity linkage between the entity in the text and the entity in the UCL knowledge space, and further updating the content of the UCL knowledge space.
Preferably, when the basic knowledge base is constructed in the step (1), selecting a Chinese Wikipedia offline database as the basic knowledge base, and acquiring entities, attributes and associated information from entry pages of Wikipedia; aiming at the incomplete problem of the Wikipedia data, the Baidu encyclopedia data is used as the supplement of the entity, and the content of the knowledge base is further improved.
Preferably, when the construction of the UCL knowledge space is completed in the step (1), crawl news of a mainstream chinese news portal site by using a crawler technology, perform cleaning and semantic parsing on the obtained news information, index entities in the news into UCL entities, blend the UCL entities into a basic entity library, and construct links between UCLs and between the entities and the UCL, thereby completing the construction of an initial UCL knowledge space.
Preferably, in the step (2), for the acquisition of the candidate entity, firstly, link candidate filtering is completed, and a target object possibly linked with the entity is acquired from the knowledge base by selecting a manner of using an entity reference term dictionary; given an object m to be disambiguated and a candidate entity e, a word vector representation of the entity to be disambiguated is generated by averaging hidden representations of words or characters in the text
Figure BDA0003046798940000031
And word vector representations of candidate entities
Figure BDA0003046798940000032
Preferably, the feature extraction process of the entity to be disambiguated in the first stage in the step (3) is implemented based on three modules, specifically including a knowledge feature network module, a subgraph construction module and a conceptualization module.
Preferably, the knowledge feature network in the step (3) is constructed by depending on the corresponding relation among the words in the sentence where the entity to be disambiguated is located, the entity to be disambiguated itself and the candidate concepts; wherein, the words and the entities to be disambiguated are obtained by the way of vocabulary analysis and entity recognition, and the candidate concepts are obtained by the way of knowledge space matching; the knowledge characteristic network describes three relations, namely a concept-entity relation, a concept-concept relation and a word-concept relation; constructing a conceptualized subgraph by acquiring the three relations; obtaining the most suitable concept by restarting the calculation of the random walk algorithm, and converting the concept corresponding to the entity to be disambiguated into conceptualized embedding
Figure BDA0003046798940000033
Preferably, the feature extraction process for the candidate entity in the second stage in step (3) includes the following processes: firstly, generating a positive sample according to the description of an entity in a UCL knowledge space, and segmenting a description text; counting word frequencies in the positive samples, and generating negative samples by utilizing layered random sampling; generating a training sample by utilizing an entity co-occurrence data set and S-P-O triple data in a knowledge space so as to learn the relation between a candidate entity and an enhanced entity representation, wherein the S-P-O triple comes from a UCL node stored in the knowledge space, and a news event indexed by the UCL naturally carries triple information; then, generating vector representation of the entity and the context information thereof through a CBOW model, then updating the embedded representation of the entity through a Skip-Gram model, enhancing the relation between the entities and training the entitiesEmbedded representation to candidate entities
Figure BDA0003046798940000034
Preferably, the semantic matching process of the entity in the step (4) adopts a deep structured semantic model based on a self-attention matching network; representing the four vectors of the entity to be disambiguated and the candidate entity obtained in the step (2) and the step (3): to-be-disambiguated entity word vector
Figure BDA0003046798940000035
Candidate entity word vector
Figure BDA0003046798940000036
Conceptual vector of entity to be disambiguated
Figure BDA0003046798940000037
And candidate entity context vectors
Figure BDA0003046798940000038
Join the network "map the vectors to a continuous semantic space to learn the low latitude vector representation of the text, and then model the semantic similarity between the vectors. Calculating the similarity loss by using cosine similarity through a multilayer full-connection network as a model for generating a low-latitude semantic vector; and finally, finding a candidate entity which is most matched with the entity to be disambiguated semantically, and completing the disambiguation task.
The invention also provides an entity disambiguation device based on the UCL knowledge space, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the entity disambiguation method based on the UCL knowledge space when being loaded to the processor.
Preferably, the computer program comprises a UCL knowledge space construction module, a vector generation representation module, a feature extraction module and a semantic matching module; the UCL knowledge space construction module is used for constructing a basic knowledge base by acquiring Wikipedia and Baidu encyclopedia knowledge, indexing news information by UCL and storing the news information in the knowledge base as content supplement of the UCL knowledge space, so that the construction of the UCL knowledge space is completed; the generating vector representation module is used for extracting an entity to be disambiguated from the text and acquiring a candidate entity set for entity disambiguation from a UCL knowledge space; the characteristic extraction module is used for extracting characteristics of the entity to be disambiguated and enhancing the embedded representation of the entity to be disambiguated by utilizing concept knowledge corresponding to the context information of the entity; embedding and enhancing the candidate entity by utilizing the fact knowledge in the UCL knowledge space; the semantic matching module is used for obtaining a matching result by utilizing a deep structured semantic model based on a self-attention matching network.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) the method indexes the news text by using the characteristic that the UCL content format is matched with the news element, and constructs the UCL knowledge space by fusing the basic knowledge base. Compared with the traditional encyclopedia knowledge base, the semantic association of the entities in the knowledge base is increased through the abundant semantic information of the UCL.
(2) The invention utilizes the knowledge information in the UCL knowledge space to supplement the related entities of the candidate entities, thereby enhancing the context expression of the candidate entities. After the entity disambiguation work is finished, forward feedback is generated for the UCL knowledge space, and the related information of the entity in the UCL knowledge space is updated. As the magnitude of the entity disambiguation task increases, the effect of entity disambiguation will also increase.
(3) The invention introduces the knowledge characteristic network of the entity to be disambiguated, adds conceptualization attributes to the entity to be disambiguated, and solves the problem of insufficient entity link context information in the short text through conceptualization representation of fine granularity. The invention can solve the problem of less entity related information in the short text and improve the accuracy of entity disambiguation.
Drawings
FIG. 1 is a flow chart of a method according to an embodiment of the present invention.
Fig. 2 is a diagram of an example of a UCL knowledge space node according to an embodiment of the present invention.
Fig. 3 is a diagram illustrating an example of entity disambiguation tasks according to an embodiment of the present invention.
Fig. 4 is a block diagram of the overall architecture of entity disambiguation based on the UCL knowledge space according to an embodiment of the present invention.
Fig. 5 is a block diagram of a conceptual feature extraction module for an entity to be disambiguated according to an embodiment of the present invention.
Fig. 6 is a diagram illustrating a candidate entity context embedding vector generation structure according to an embodiment of the present invention.
Detailed Description
The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.
As shown in fig. 1, an entity disambiguation method based on a UCL knowledge space disclosed in the embodiment of the present invention includes the following specific implementation steps:
step 1, constructing a UCL knowledge space. Acquiring entities, entity basic attributes and associations among the entities from an open offline database by using an information extraction related technology, and constructing a basic knowledge base; and acquiring network news, indexing the network news by using the UCL, and completing the construction of a UCL knowledge space as the supplement of a knowledge base. The steps are the premise work of the invention, and the specific steps are as follows:
and a substep 1-1, constructing a basic knowledge base. The method comprises the steps of establishing a basic database by using an offline database of Chinese Wikipedia as source data, crawling classification pages and content pages of the Wikipedia, and acquiring information such as entities, attributes, associations and the like from entry pages of the Wikipedia. The links of the classification pages can be linked to specific entity content pages and next-level sub-classification pages, and the content pages are content detail pages and comprise structured data and related entity links. And then, entity and relationship data of the encyclopedia are crawled, a plurality of different meaning items of the same entity are displayed on an encyclopedia entry interface, and the encyclopedia entry interface can be used as a candidate entity source for subsequent entity disambiguation work by extracting the content. And (4) completing the fusion of the Wikipedia knowledge and the Baidu encyclopedia knowledge by utilizing related technologies such as entity fusion, entity alignment and the like to form a basic knowledge base.
And a substep 1-2, crawling news of a mainstream Chinese news portal site by using a crawler tool, cleaning and semantically analyzing crawled news information, and obtaining element information of news such as a title, an author, time and an abstract and entity and relationship information in a news body through a natural language processing tool. And then indexing the news content by using a UCL indexing tool, and indexing an entity in the news as a UCL entity to obtain a UCL node.
Substeps 1-3 calculate the frequency of occurrence of each entity in the news as shown in equation 1. count (e)i) The number of times of occurrence of the entity is represented, and the denominator is the number of times of occurrence of all the entities. After the word frequency calculation is completed, entities with lower word frequency are filtered out, so that the subsequent calculation is simplified.
Figure BDA0003046798940000051
Substeps 1-4, distinguishing the position of the entity, and setting the position weight as location (e) according to the different areas of the entityi). As shown in equation 2, P is the number of news paragraphs and P is the current entity ei(e) location when the total number of news paragraphs does not exceed twoi) Is a fixed value; when the news total paragraph exceeds two segments, the entity scores in the first and last segments are the same, and the scores of the other segments are unified into one fourth of the score of the first and last segments.
Figure BDA0003046798940000052
And a substep 1-5 of extracting a central sentence set by using a TextRank algorithm, wherein the set is marked as sending ═ s1,s2,...,snS in the setiRepresents a central sentence, which is composed of entities. The more times an entity appears in different central sentences, the higher the relative weight of the representative entity. As shown in equation 3, n is the number of central sentences, I (e)i∈st) To indicate a function, an entity e is representediWhether or not there is a current central sentence stIn (1).
Figure BDA0003046798940000053
Substeps 1-6, calculating the weight parameter values of the three parts, and providing an entity semantic weight calculation formula in UCL after combination, wherein the formula is shown as a formula 4:
EW(ei)=Avg(location(ei))×(η·freq(ei)+(1-η)·center(ei) Equation 4)
Wherein eta is an adjusting parameter and ranges from 0 to 1. Avg (location (e)i) Represent the average location weight of an entity, because the same entity may appear at different locations in an article multiple times, the weighted average of the location weights of the entities needs to be calculated by taking the frequency of appearance of the entity at each location as a weight. After calculating EW (e) of all entitiesi) Then, the UCL semantic weight of each entity is obtained through normalization.
And substeps 1-7, establishing a connection between an entity in the UCL and an entity in a basic knowledge base, and adding the UCL node into the knowledge base to complete the construction of a UCL knowledge space. Fig. 2 illustrates the relationship of UCL nodes to entity nodes in the UCL knowledge space, where square nodes represent UCLs indexing news content and circular nodes represent basic entities. The numerical value on the connecting line of the UCL node and the entity node represents the semantic weight of the entity in the UCL indexed news, and the connecting line between the entities represents the relationship between the entities. Taking the disambiguation process of the entity "eight heaven parts" in the work "eight heaven parts" which is most popular in the national future in the figure 3 as an example, after the knowledge space is constructed, the "eight heaven parts" have 18 meaning items in the knowledge space.
And 2, generating vector representation. And (4) acquiring a candidate entity set related to the entity to be disambiguated from the UCL knowledge space, and generating an embedded representation of the candidate entity and the entity to be disambiguated by using a word vector representation method as input of the semantic matching module in the step (4). The method comprises the following specific steps:
and a substep 2-1, aiming at the entity m to be disambiguated, acquiring a target object possibly linked with the entity from a knowledge base by using an entity named item dictionary mode, wherein the named item dictionary restrains alias names existing in the entity through a self-defining method, and reserving all target entities corresponding to the word with one word ambiguity.
Substep 2-2, given an object m to be disambiguated and a candidate entity e obtained in substep 2-1, generating a word vector representation of the entity to be disambiguated by averaging hidden representations of words or characters in the text in which the entity to be disambiguated is located
Figure BDA0003046798940000061
And word vector representations of candidate entities
Figure BDA0003046798940000062
And 3, extracting the characteristics, not only considering the content information of the text, but also supplementing the text by introducing external knowledge. And utilizing concepts possibly corresponding to the entity to be disambiguated in the UCL knowledge space and relevant context information of the candidate entity in the UCL knowledge space as external supplement of the text, thereby enriching the vector representation of the entity.
The first stage of feature extraction is carried out on the entity to be disambiguated, the entity to be disambiguated is marked by using concepts, and the concepts of the entity are used as additional features of the entity to be disambiguated. A knowledge Feature network KFN (knowledge Feature network) is constructed by using a text entity and a corresponding concept, a subgraph with an entity and a concept node is generated based on the KFN, and a proper entity concept to be disambiguated is obtained by using a restart random walk algorithm. Representing the conceptualization in a vector form as a supplementary feature of an entity to be disambiguated;
and in the second stage, feature extraction is carried out on the candidate entities, the existing knowledge of a UCL knowledge space is utilized to learn the relation between the entity context text and the entities at the same time, and a CBOW model with a sigmoid function layer is used to generate vector representation of the candidate entities.
The specific implementation steps are as follows:
substep 3-1, feature extraction of the entity to be disambiguated. As shown in fig. 4, this stage is implemented by using three modules, which are a knowledge Feature network kfn (knowledge Feature network) module, a subgraph construction module, and a conceptualization module. Given a text object, namely ' eight heaven parts ' in fig. 3, which should be regarded as the best-selling work in the gold world ', a knowledge feature network is first constructed by using text and factual knowledge. Including three types of content knowledge of entities to be disambiguated in the text, namely "eight heaven portions", other words in the text, namely "mediocre", "popular", "work", and candidate concepts.
And a substep 3-2, constructing the knowledge characteristic network by depending on the corresponding relation among the words in the sentence where the entity to be disambiguated is located, the entity to be disambiguated and the candidate concepts. The words and the entities to be disambiguated are obtained through vocabulary analysis and entity recognition, and the candidate concepts are obtained through knowledge space matching. The KFN describes three concept relationships, namely a concept-entity relationship, a concept-concept relationship and a word-concept relationship. The specific acquisition steps are as follows:
(1) the concept-entity relationship is represented by the generation probability of the concept c to the entity e, and the p (c | e) probability is obtained by the page click rate statistics based on the Wikipedia entity page, as shown in formula 5, NPV(e) Is the statistical number of entity e, e' is any entity belonging to concept c:
Figure BDA0003046798940000071
(2) concept-concept relationships are represented by transition probabilities between two concepts, e.g. ciAnd cjProbability p (c)i|cj) Is calculated based on the co-occurrence frequency of the entities under two concepts, as shown in equation 6:
Figure BDA0003046798940000072
wherein, as shown in formula 7, the co-occurrence probability N (e)j,ei) Based on the anchor link statistical calculation of the Baidu encyclopedia, w is the statistical Baidu encyclopediaThe window size of the co-occurrence frequency of the middle entity pair is set to 25 in the invention.
N(ej,ei)=freqw(ej,ei) Equation 7
(3) The word-concept relationship is represented by the labeling probability between the word w and the related concept c. The probability is calculated based on word frequency and word-concept co-occurrence frequency, as shown in equation 8, where n (w) is a frequency statistic of the occurrence of words w:
Figure BDA0003046798940000073
sub-step 3-3, the invention uses a Restart Random Walk algorithm (RWR) on the knowledge characteristic network to get the proper concept of the entity to be disambiguated. Firstly, the weights of nodes and edges are initialized through the following formulas 9 and 10, wherein l represents the edge of the knowledge characteristic network, n represents the node of the knowledge characteristic network, T represents the number of entities in the knowledge characteristic network, T represents the entity node, and c represents the concept node.
Figure BDA0003046798940000081
Figure BDA0003046798940000082
Subsequently, the nodes and edges are iteratively updated by the following equations 11, 12:
Nk=(1-α)E′×Nk-1+αN0equation 11
Ek←(1-β)Nk+βEkEquation 12
Where α, β are the hyper-parameters adjusted on the development set. Finally, by normalizing the edge weights, the concept type with the highest weight is obtained, as shown in equation 13:
Figure BDA0003046798940000083
as shown in fig. 5, a conceptualized sub-graph structure is obtained by calculation, in which relationships among entities, words, and concepts and conversion probabilities are included, and a concept having the highest final conversion probability is used as an object for conceptualization embedding to generate a vector
Figure BDA0003046798940000084
In the illustration, the concept of highest conversion probability corresponding to the "eight dragon" entities and the "best selling" is the "novel".
And a substep 3-4 of feature extraction for the candidate entity context. As shown in fig. 6, the present invention proposes a new feature learning method, which learns the candidate entity context and the entity relationship contained in the knowledge space at the same time, and then generates the context vector representation corresponding to the candidate entity through the CBOW model with the activation function sigmoid. The method comprises the following specific steps:
(1) a positive sample is first generated based on the associated description for the entity in the UCL knowledge space, the description including text or entities associated with the candidate entity. And then, performing word segmentation on the entity description text, counting the word frequency of the positive sample, and generating a negative sample by utilizing hierarchical random sampling.
(2) In the above steps, in order to enhance the relationship between the candidate entity and the related entity, the co-occurrence data of the entity and the S-P-O triplet data in the UCL knowledge space are used to generate training samples. The S-P-O triplets are from news text indexed by UCL associated with the candidate entity in UCL knowledge space, and the triplet information contained in the news event is stored in the knowledge space. The entity semantic weight calculated in the sub-step 1-6 can be used as supplementary information of the candidate entity context.
(3) And obtaining an entity sequence related to the candidate entities as a training sample, and generating an embedded representation of each entity by utilizing a CBOW model. Finally, the context embedded representation of the entity is updated by using the Skip-Gram model, and then a vector of candidate entity contexts is obtained
Figure BDA0003046798940000085
And 4, semantic matching. As shown in FIG. 4, the entity word vector to be disambiguated obtained by the training of substep 2-2, substep 3-3, and substep 3-4
Figure BDA0003046798940000086
Candidate entity word vector
Figure BDA0003046798940000087
Conceptual vector of entity to be disambiguated
Figure BDA0003046798940000088
And candidate entity context vectors
Figure BDA0003046798940000089
Adding the Semantic Model into a Deep Structured Semantic Model (DSSM) based on a self-attention matching network. The method comprises the following specific steps:
and a substep 4-1, capturing key information of the original sentence by the input word vector sequence through an LSTM layer, and respectively outputting a semantic vector. And then, respectively aggregating the vector representation related to the entity to be disambiguated and the vector representation related to the candidate entity into a single high-dimensional word vector by utilizing a two-layer self-attention mechanism. And then, mapping the high-dimensional semantic vectors to a continuous semantic space to learn the low-latitude vector representation of the text by utilizing a full-connection layer, and further modeling the semantic similarity between the vectors. Cosine similarity is used to calculate similarity loss, i.e. association scores between entities, through a multi-layer Fully Connected networks (FC) as a model for generating low latitude semantic vectors.
And a substep 4-2, repeating the steps for all the entities in the candidate entity set in sequence, and obtaining the highest-score entity according to the association score sequence, namely the candidate entity which is most matched with the entity to be disambiguated, so as to complete the entity disambiguation work. And finally, completing the link between the entity to be disambiguated and the corresponding entity in the knowledge space according to the entity disambiguation result.
Based on the same inventive concept, the invention further provides an entity disambiguation apparatus based on the UCL knowledge space, which comprises a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and when the computer program is loaded to the processor, the entity disambiguation apparatus based on the UCL knowledge space realizes the entity disambiguation method based on the UCL knowledge space. The entity disambiguation device based on the UCL knowledge space comprises a UCL knowledge space construction module, a vector generation expression module, a feature extraction module and a semantic matching module, wherein the UCL knowledge space construction module is used for constructing a basic knowledge base by acquiring Wikipedia and Baidu encyclopedia knowledge, indexing news information by UCL and storing the news information in the knowledge base to serve as content supplement of the UCL knowledge space, so that the construction of the UCL knowledge space is completed, and the content in the step 1 is specifically realized; the generated vector representation module is used for extracting an entity to be disambiguated from the text, acquiring a candidate entity set for entity disambiguation from a UCL knowledge space, and specifically realizing the content of the step 2; the characteristic extraction module is used for extracting characteristics of the entity to be disambiguated and enhancing the embedded representation of the entity to be disambiguated by utilizing concept knowledge corresponding to the context information of the entity; then embedding and enhancing the candidate entity by utilizing the existing fact knowledge in the UCL knowledge space, and specifically realizing the content of the step 3; and the semantic matching module is used for obtaining a matching result by utilizing a deep structured semantic model based on the self-attention matching network, and specifically realizing the content of the step 4.
The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims (10)

1. An entity disambiguation method based on UCL knowledge space is characterized by comprising the following steps:
(1) constructing a UCL knowledge space: acquiring entities, entity basic attributes and associations among the entities from an open offline database by using an information extraction related technology, and constructing a basic knowledge base; acquiring network news, indexing the network news by using UCL, calculating the association between an entity and the news as the supplement of a knowledge base, and completing the construction of UCL knowledge space;
(2) vector representation generation: acquiring a candidate entity set related to the entity to be disambiguated from a UCL knowledge space, and generating an embedded representation of the candidate entity and the entity to be disambiguated by using a word vector representation method as the input of the semantic matching module in the step (4);
(3) characteristic extraction: not only the content information of the text is considered, but also the text is supplemented by introducing external knowledge; the first stage extracts the conceptual features of the entity to be disambiguated and the context thereof, and the second stage extracts the features of the context of the candidate entity;
(4) semantic matching: using four vector representations of the conceptual embedding of the entity to be disambiguated generated in the step (3), the context embedding of the candidate entity and the word vector embedding of the candidate entity and the entity to be disambiguated generated in the step (2) as input, and obtaining the matching degree by adopting a self-attention matching network based on a deep structured semantic matching model (DSSM); and obtaining a final disambiguation result according to the ordering of the matching results, completing entity linkage between the entity in the text and the entity in the UCL knowledge space, and further updating the content of the UCL knowledge space.
2. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: when the basic knowledge base is constructed in the step (1), selecting a Chinese Wikipedia offline database as a basic knowledge base, and acquiring entities, attributes and associated information from entry pages of Wikipedia; aiming at the incomplete problem of the Wikipedia data, the Baidu encyclopedia data is used as the supplement of the entity, and the content of the knowledge base is further improved.
3. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: when the UCL knowledge space is constructed in the step (1), crawl news of a mainstream Chinese news portal site by using a crawler technology, clear and semantically analyze the obtained news information, index entities in the news into UCL entities, integrate the UCL entities into a basic entity library, and construct links between UCLs and between the entities and the UCLs, so that the construction of the initial UCL knowledge space is completed.
4. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: in the step (2), for the acquisition of the candidate entity, firstly, link candidate filtering is completed, and a target object possibly linked with the entity is acquired from a knowledge base in a mode of selecting an entity nominal item dictionary; given an object m to be disambiguated and a candidate entity e, a word vector representation of the entity to be disambiguated is generated by averaging hidden representations of words or characters in the text
Figure FDA0003046798930000011
And word vector representations of candidate entities
Figure FDA0003046798930000012
5. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: the feature extraction process of the entity to be disambiguated in the first stage in the step (3) is realized based on three modules, and specifically comprises a knowledge feature network module, a subgraph construction module and a conceptualization module.
6. The method for entity disambiguation based on the UCL knowledge space of claim 5, wherein: the knowledge characteristic network in the step (3) is constructed by depending on the corresponding relation among the words in the sentence where the entity to be disambiguated is located, the entity to be disambiguated and the candidate concepts; wherein, the words and the entities to be disambiguated are obtained by the way of vocabulary analysis and entity recognition, and the candidate concepts are obtained by the way of knowledge space matching; the knowledge characteristic network describes three relations, namely a concept-entity relation, a concept-concept relation and a word-concept relation; constructing a conceptualized subgraph by acquiring the three relations; obtaining the best fit by restarting the calculation of the random walk algorithmProper concepts, namely converting concepts corresponding to the entities to be disambiguated into conceptualized embedding
Figure FDA0003046798930000021
7. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: the second stage of the feature extraction process for the candidate entity in the step (3) comprises the following processes: firstly, generating a positive sample according to the description of an entity in a UCL knowledge space, and segmenting a description text; counting word frequencies in the positive samples, and generating negative samples by utilizing layered random sampling; generating a training sample by utilizing an entity co-occurrence data set and S-P-O triple data in a knowledge space so as to learn the relation between a candidate entity and an enhanced entity representation, wherein the S-P-O triple comes from a UCL node stored in the knowledge space, and a news event indexed by the UCL naturally carries triple information; then, generating vector representation of the entity and the context information thereof through a CBOW model, then updating the embedded representation of the entity through a Skip-Gram model, enhancing the relation between the entities, and training to obtain the embedded representation of the candidate entity
Figure FDA0003046798930000022
8. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: the semantic matching process of the entity in the step (4) adopts a deep structured semantic model based on a self-attention matching network; representing the four vectors of the entity to be disambiguated and the candidate entity obtained in the step (2) and the step (3): to-be-disambiguated entity word vector
Figure FDA0003046798930000023
Candidate entity word vector
Figure FDA0003046798930000024
Conceptual vector of entity to be disambiguated
Figure FDA0003046798930000025
And candidate entity context vectors
Figure FDA0003046798930000026
Joining in a network '' to map the vectors to a continuous semantic space to learn the low latitude vector representation of the text, and further modeling the semantic similarity between the vectors; calculating the similarity loss by using cosine similarity through a multilayer full-connection network as a model for generating a low-latitude semantic vector; and finally, finding a candidate entity which is most matched with the entity to be disambiguated semantically, and completing the disambiguation task.
9. An apparatus for entity disambiguation based on a UCL knowledge space, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein: the computer program when loaded into a processor implements the method for entity disambiguation based on the UCL knowledge space of any of claims 1-8.
10. The UCL knowledge-space-based entity disambiguation apparatus of claim 1, wherein: the computer program comprises a UCL knowledge space construction module, a vector generation representation module, a feature extraction module and a semantic matching module; the UCL knowledge space construction module is used for constructing a basic knowledge base by acquiring Wikipedia and Baidu encyclopedia knowledge, indexing news information by UCL and storing the news information in the knowledge base as content supplement of the UCL knowledge space, so that the construction of the UCL knowledge space is completed; the generating vector representation module is used for extracting an entity to be disambiguated from the text and acquiring a candidate entity set for entity disambiguation from a UCL knowledge space; the characteristic extraction module is used for extracting characteristics of the entity to be disambiguated and enhancing the embedded representation of the entity to be disambiguated by utilizing concept knowledge corresponding to the context information of the entity; embedding and enhancing the candidate entity by utilizing the fact knowledge in the UCL knowledge space; the semantic matching module is used for obtaining a matching result by utilizing a deep structured semantic model based on a self-attention matching network.
CN202110475291.XA 2021-04-29 2021-04-29 Entity disambiguation method and device based on UCL knowledge space Active CN113111136B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110475291.XA CN113111136B (en) 2021-04-29 2021-04-29 Entity disambiguation method and device based on UCL knowledge space

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110475291.XA CN113111136B (en) 2021-04-29 2021-04-29 Entity disambiguation method and device based on UCL knowledge space

Publications (2)

Publication Number Publication Date
CN113111136A true CN113111136A (en) 2021-07-13
CN113111136B CN113111136B (en) 2024-04-16

Family

ID=76720462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110475291.XA Active CN113111136B (en) 2021-04-29 2021-04-29 Entity disambiguation method and device based on UCL knowledge space

Country Status (1)

Country Link
CN (1) CN113111136B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656556A (en) * 2021-08-20 2021-11-16 上海大参林医疗健康科技有限公司 Text feature extraction method and knowledge graph construction method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090144609A1 (en) * 2007-10-17 2009-06-04 Jisheng Liang NLP-based entity recognition and disambiguation
CN111428047A (en) * 2020-03-19 2020-07-17 东南大学 Knowledge graph construction method and device based on UC L semantic indexing
CN111581973A (en) * 2020-04-24 2020-08-25 中国科学院空天信息创新研究院 Entity disambiguation method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090144609A1 (en) * 2007-10-17 2009-06-04 Jisheng Liang NLP-based entity recognition and disambiguation
CN111428047A (en) * 2020-03-19 2020-07-17 东南大学 Knowledge graph construction method and device based on UC L semantic indexing
CN111581973A (en) * 2020-04-24 2020-08-25 中国科学院空天信息创新研究院 Entity disambiguation method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
IKUYA YAMADA 等: "Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation", ARXIV, 10 June 2016 (2016-06-10), pages 1 - 10 *
怀宝兴 等: "一种基于概率主题模型的命名实体链接方法", 软件学报, vol. 25, no. 9, 31 December 2014 (2014-12-31), pages 2076 - 2087 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656556A (en) * 2021-08-20 2021-11-16 上海大参林医疗健康科技有限公司 Text feature extraction method and knowledge graph construction method
CN113656556B (en) * 2021-08-20 2023-08-15 广州天宸健康科技有限公司 Text feature extraction method and knowledge graph construction method

Also Published As

Publication number Publication date
CN113111136B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
US11631007B2 (en) Method and device for text-enhanced knowledge graph joint representation learning
CN110633409B (en) Automobile news event extraction method integrating rules and deep learning
JP7468929B2 (en) How to acquire geographical knowledge
CN110598005B (en) Public safety event-oriented multi-source heterogeneous data knowledge graph construction method
CN114064918B (en) Multi-modal event knowledge graph construction method
CN104318340B (en) Information visualization methods and intelligent visible analysis system based on text resume information
CN106845411B (en) Video description generation method based on deep learning and probability map model
CN108874878A (en) A kind of building system and method for knowledge mapping
CN111143576A (en) Event-oriented dynamic knowledge graph construction method and device
CN110633366A (en) Short text classification method, device and storage medium
CN110516239B (en) Segmentation pooling relation extraction method based on convolutional neural network
CN115269865A (en) Knowledge graph construction method for auxiliary diagnosis
CN116127095A (en) Question-answering method combining sequence model and knowledge graph
CN116127090B (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
CN112597285A (en) Man-machine interaction method and system based on knowledge graph
CN114238653A (en) Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education
CN112417170B (en) Relationship linking method for incomplete knowledge graph
CN113869040A (en) Voice recognition method for power grid dispatching
CN113111136B (en) Entity disambiguation method and device based on UCL knowledge space
CN113392183A (en) Characterization and calculation method of children domain map knowledge
CN117290478A (en) Knowledge graph question-answering method, device, equipment and storage medium
CN111859955A (en) Public opinion data analysis model based on deep learning
CN111813927A (en) Sentence similarity calculation method based on topic model and LSTM
CN116340507A (en) Aspect-level emotion analysis method based on mixed weight and double-channel graph convolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant