CN113111136A - Entity disambiguation method and device based on UCL knowledge space - Google Patents
Entity disambiguation method and device based on UCL knowledge space Download PDFInfo
- Publication number
- CN113111136A CN113111136A CN202110475291.XA CN202110475291A CN113111136A CN 113111136 A CN113111136 A CN 113111136A CN 202110475291 A CN202110475291 A CN 202110475291A CN 113111136 A CN113111136 A CN 113111136A
- Authority
- CN
- China
- Prior art keywords
- entity
- ucl
- knowledge
- disambiguated
- space
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 239000013598 vector Substances 0.000 claims abstract description 67
- 238000010276 construction Methods 0.000 claims abstract description 22
- 238000000605 extraction Methods 0.000 claims description 21
- 239000013589 supplement Substances 0.000 claims description 12
- 230000002708 enhancing effect Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000012549 training Methods 0.000 claims description 7
- 238000005516 engineering process Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 4
- 238000005295 random walk Methods 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 2
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000004140 cleaning Methods 0.000 description 2
- 230000009193 crawling Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention provides an entity disambiguation method and device based on a UCL knowledge space, which comprises the steps of firstly, constructing a basic knowledge base to complete the construction of the UCL knowledge space; then acquiring a candidate entity set related to the entity to be disambiguated from a UCL knowledge space, and generating the candidate entity and the embedded representation of the entity to be disambiguated by using a word vector representation method; then extracting conceptual features of the entity to be disambiguated and the context thereof, and extracting features of the context of the candidate entity; finally, four vector representations generated before are used as input, and a self-attention matching network based on a deep structured semantic matching model (DSSM) is adopted to obtain the matching degree; and obtaining a final disambiguation result according to the ordering of the matching results, and completing entity linkage between the entity in the text and the entity in the UCL knowledge space. The invention can solve the problem of less entity related information in the short text and improve the accuracy of entity disambiguation.
Description
Technical Field
The invention relates to an entity disambiguation method and device based on a UCL knowledge space, and belongs to the technical field of knowledge graph construction in the Internet.
Background
With the rapid development of the internet, the number of network news is increased rapidly, knowledge information contained in news is more and more complicated, and a proper carrier is urgently needed to effectively store and manage the news information. A knowledge-graph is able to form a graph database by associating entities together by constructing "entity-relationship-entity" triplets and "entity-attribute (value)" key-value pairs. The unified Content tag ucl (unified Content tag) defined by the national standard "unified Content tag format specification" (GB/T35304-. The webpage news content with great disorder can be effectively organized by filling the description attribute set of the UCL. Based on the above premises, a UCL Knowledge Space (ucll Knowledge Space, UCLKS) arises at the discretion, and the UCL Knowledge Space is based on basic Knowledge bases such as wikipedia, hundred-degree encyclopedia and the like, and supplements entity Knowledge with network news content. The process of associating entities in the news text with existing entities in the UCL knowledge space is entity disambiguation, or entity linking. Entity disambiguation is a necessary prerequisite for subsequent knowledge space updates and for the implementation of knowledge space-based applications.
Over the past few decades, researchers have conducted a great deal of research into entity disambiguation, mainly involving rule-based methods, traditional statistical methods, and methods of deep learning. In the case of rich text context, the most advanced algorithms have achieved very good disambiguation. However, the performance of these methods on short text as well as spoken text drops dramatically. The missing context of these short texts is essential for disambiguation. In recent years, many scholars have endeavored to improve the effect of entity disambiguation by exploring more abundant contextual information. These methods are mainly focused on better utilization of existing text, but the problem of information loss is still not solved. It is difficult to improve the disambiguation result substantially if the source text cannot be augmented by adding external information.
Disclosure of Invention
In order to solve the problems, the invention provides an entity disambiguation method and device based on a UCL knowledge space, which respectively perform knowledge expansion on an entity to be disambiguated and a candidate entity to improve the model effect. For an entity to be disambiguated, giving the entity a concept vector representation and formalizing the concept vector representation into a concept subgraph; for candidate entities, an entity vector representation is given to the entity to formalize it as factual knowledge in the UCL knowledge space. And finally, obtaining the entity disambiguation result by adopting a deep structured semantic model based on self-attention matching. The method can effectively solve the problem of low disambiguation precision caused by insufficient entity context information in the short text, and improve the effect of entity disambiguation.
In order to achieve the aim, the invention provides an entity disambiguation method based on UCL knowledge space, which comprises the steps of firstly, constructing a basic knowledge base by acquiring Wikipedia and Baidu encyclopedia knowledge, indexing news information by UCL and storing the news information in the knowledge base as content supplement of the UCL knowledge space, thereby completing construction of the UCL knowledge space; extracting an entity to be disambiguated in the text by using an open domain three-tuple extraction method, and acquiring a candidate entity set for entity disambiguation from a UCL knowledge space; then, extracting the characteristics of the entity to be disambiguated, and enhancing the embedded representation of the entity to be disambiguated by using the concept knowledge corresponding to the context information of the entity; embedding and enhancing the candidate entity by utilizing the fact knowledge in the UCL knowledge space; and finally, obtaining a matching result by utilizing a deep structured semantic model based on the self-attention matching network.
Specifically, the invention provides the following technical scheme:
an entity disambiguation method based on UCL knowledge space comprises the following steps:
(1) constructing a UCL knowledge space: acquiring entities, entity basic attributes and associations among the entities from an open offline database by using an information extraction related technology, and constructing a basic knowledge base; acquiring network news, indexing the network news by using UCL, calculating the association between an entity and the news as the supplement of a knowledge base, and completing the construction of UCL knowledge space;
(2) vector representation generation: acquiring a candidate entity set related to the entity to be disambiguated from a UCL knowledge space, and generating an embedded representation of the candidate entity and the entity to be disambiguated by using a word vector representation method as the input of the semantic matching module in the step (4);
(3) characteristic extraction: not only the content information of the text is considered, but also the text is supplemented by introducing external knowledge; the first stage extracts the conceptual features of the entity to be disambiguated and the context thereof, and the second stage extracts the features of the context of the candidate entity;
(4) semantic matching: and (3) utilizing four vector representations of the conceptual embedding of the entity to be disambiguated generated in the step (3), the context embedding of the candidate entity and the word vector embedding of the candidate entity and the entity to be disambiguated generated in the step (2) as input, and obtaining the matching degree by adopting a self-attention matching network based on a deep structured semantic matching model (DSSM). And obtaining a final disambiguation result according to the ordering of the matching results, completing entity linkage between the entity in the text and the entity in the UCL knowledge space, and further updating the content of the UCL knowledge space.
Preferably, when the basic knowledge base is constructed in the step (1), selecting a Chinese Wikipedia offline database as the basic knowledge base, and acquiring entities, attributes and associated information from entry pages of Wikipedia; aiming at the incomplete problem of the Wikipedia data, the Baidu encyclopedia data is used as the supplement of the entity, and the content of the knowledge base is further improved.
Preferably, when the construction of the UCL knowledge space is completed in the step (1), crawl news of a mainstream chinese news portal site by using a crawler technology, perform cleaning and semantic parsing on the obtained news information, index entities in the news into UCL entities, blend the UCL entities into a basic entity library, and construct links between UCLs and between the entities and the UCL, thereby completing the construction of an initial UCL knowledge space.
Preferably, in the step (2), for the acquisition of the candidate entity, firstly, link candidate filtering is completed, and a target object possibly linked with the entity is acquired from the knowledge base by selecting a manner of using an entity reference term dictionary; given an object m to be disambiguated and a candidate entity e, a word vector representation of the entity to be disambiguated is generated by averaging hidden representations of words or characters in the textAnd word vector representations of candidate entities
Preferably, the feature extraction process of the entity to be disambiguated in the first stage in the step (3) is implemented based on three modules, specifically including a knowledge feature network module, a subgraph construction module and a conceptualization module.
Preferably, the knowledge feature network in the step (3) is constructed by depending on the corresponding relation among the words in the sentence where the entity to be disambiguated is located, the entity to be disambiguated itself and the candidate concepts; wherein, the words and the entities to be disambiguated are obtained by the way of vocabulary analysis and entity recognition, and the candidate concepts are obtained by the way of knowledge space matching; the knowledge characteristic network describes three relations, namely a concept-entity relation, a concept-concept relation and a word-concept relation; constructing a conceptualized subgraph by acquiring the three relations; obtaining the most suitable concept by restarting the calculation of the random walk algorithm, and converting the concept corresponding to the entity to be disambiguated into conceptualized embedding
Preferably, the feature extraction process for the candidate entity in the second stage in step (3) includes the following processes: firstly, generating a positive sample according to the description of an entity in a UCL knowledge space, and segmenting a description text; counting word frequencies in the positive samples, and generating negative samples by utilizing layered random sampling; generating a training sample by utilizing an entity co-occurrence data set and S-P-O triple data in a knowledge space so as to learn the relation between a candidate entity and an enhanced entity representation, wherein the S-P-O triple comes from a UCL node stored in the knowledge space, and a news event indexed by the UCL naturally carries triple information; then, generating vector representation of the entity and the context information thereof through a CBOW model, then updating the embedded representation of the entity through a Skip-Gram model, enhancing the relation between the entities and training the entitiesEmbedded representation to candidate entities
Preferably, the semantic matching process of the entity in the step (4) adopts a deep structured semantic model based on a self-attention matching network; representing the four vectors of the entity to be disambiguated and the candidate entity obtained in the step (2) and the step (3): to-be-disambiguated entity word vectorCandidate entity word vectorConceptual vector of entity to be disambiguatedAnd candidate entity context vectorsJoin the network "map the vectors to a continuous semantic space to learn the low latitude vector representation of the text, and then model the semantic similarity between the vectors. Calculating the similarity loss by using cosine similarity through a multilayer full-connection network as a model for generating a low-latitude semantic vector; and finally, finding a candidate entity which is most matched with the entity to be disambiguated semantically, and completing the disambiguation task.
The invention also provides an entity disambiguation device based on the UCL knowledge space, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the entity disambiguation method based on the UCL knowledge space when being loaded to the processor.
Preferably, the computer program comprises a UCL knowledge space construction module, a vector generation representation module, a feature extraction module and a semantic matching module; the UCL knowledge space construction module is used for constructing a basic knowledge base by acquiring Wikipedia and Baidu encyclopedia knowledge, indexing news information by UCL and storing the news information in the knowledge base as content supplement of the UCL knowledge space, so that the construction of the UCL knowledge space is completed; the generating vector representation module is used for extracting an entity to be disambiguated from the text and acquiring a candidate entity set for entity disambiguation from a UCL knowledge space; the characteristic extraction module is used for extracting characteristics of the entity to be disambiguated and enhancing the embedded representation of the entity to be disambiguated by utilizing concept knowledge corresponding to the context information of the entity; embedding and enhancing the candidate entity by utilizing the fact knowledge in the UCL knowledge space; the semantic matching module is used for obtaining a matching result by utilizing a deep structured semantic model based on a self-attention matching network.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) the method indexes the news text by using the characteristic that the UCL content format is matched with the news element, and constructs the UCL knowledge space by fusing the basic knowledge base. Compared with the traditional encyclopedia knowledge base, the semantic association of the entities in the knowledge base is increased through the abundant semantic information of the UCL.
(2) The invention utilizes the knowledge information in the UCL knowledge space to supplement the related entities of the candidate entities, thereby enhancing the context expression of the candidate entities. After the entity disambiguation work is finished, forward feedback is generated for the UCL knowledge space, and the related information of the entity in the UCL knowledge space is updated. As the magnitude of the entity disambiguation task increases, the effect of entity disambiguation will also increase.
(3) The invention introduces the knowledge characteristic network of the entity to be disambiguated, adds conceptualization attributes to the entity to be disambiguated, and solves the problem of insufficient entity link context information in the short text through conceptualization representation of fine granularity. The invention can solve the problem of less entity related information in the short text and improve the accuracy of entity disambiguation.
Drawings
FIG. 1 is a flow chart of a method according to an embodiment of the present invention.
Fig. 2 is a diagram of an example of a UCL knowledge space node according to an embodiment of the present invention.
Fig. 3 is a diagram illustrating an example of entity disambiguation tasks according to an embodiment of the present invention.
Fig. 4 is a block diagram of the overall architecture of entity disambiguation based on the UCL knowledge space according to an embodiment of the present invention.
Fig. 5 is a block diagram of a conceptual feature extraction module for an entity to be disambiguated according to an embodiment of the present invention.
Fig. 6 is a diagram illustrating a candidate entity context embedding vector generation structure according to an embodiment of the present invention.
Detailed Description
The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.
As shown in fig. 1, an entity disambiguation method based on a UCL knowledge space disclosed in the embodiment of the present invention includes the following specific implementation steps:
step 1, constructing a UCL knowledge space. Acquiring entities, entity basic attributes and associations among the entities from an open offline database by using an information extraction related technology, and constructing a basic knowledge base; and acquiring network news, indexing the network news by using the UCL, and completing the construction of a UCL knowledge space as the supplement of a knowledge base. The steps are the premise work of the invention, and the specific steps are as follows:
and a substep 1-1, constructing a basic knowledge base. The method comprises the steps of establishing a basic database by using an offline database of Chinese Wikipedia as source data, crawling classification pages and content pages of the Wikipedia, and acquiring information such as entities, attributes, associations and the like from entry pages of the Wikipedia. The links of the classification pages can be linked to specific entity content pages and next-level sub-classification pages, and the content pages are content detail pages and comprise structured data and related entity links. And then, entity and relationship data of the encyclopedia are crawled, a plurality of different meaning items of the same entity are displayed on an encyclopedia entry interface, and the encyclopedia entry interface can be used as a candidate entity source for subsequent entity disambiguation work by extracting the content. And (4) completing the fusion of the Wikipedia knowledge and the Baidu encyclopedia knowledge by utilizing related technologies such as entity fusion, entity alignment and the like to form a basic knowledge base.
And a substep 1-2, crawling news of a mainstream Chinese news portal site by using a crawler tool, cleaning and semantically analyzing crawled news information, and obtaining element information of news such as a title, an author, time and an abstract and entity and relationship information in a news body through a natural language processing tool. And then indexing the news content by using a UCL indexing tool, and indexing an entity in the news as a UCL entity to obtain a UCL node.
Substeps 1-3 calculate the frequency of occurrence of each entity in the news as shown in equation 1. count (e)i) The number of times of occurrence of the entity is represented, and the denominator is the number of times of occurrence of all the entities. After the word frequency calculation is completed, entities with lower word frequency are filtered out, so that the subsequent calculation is simplified.
Substeps 1-4, distinguishing the position of the entity, and setting the position weight as location (e) according to the different areas of the entityi). As shown in equation 2, P is the number of news paragraphs and P is the current entity ei(e) location when the total number of news paragraphs does not exceed twoi) Is a fixed value; when the news total paragraph exceeds two segments, the entity scores in the first and last segments are the same, and the scores of the other segments are unified into one fourth of the score of the first and last segments.
And a substep 1-5 of extracting a central sentence set by using a TextRank algorithm, wherein the set is marked as sending ═ s1,s2,...,snS in the setiRepresents a central sentence, which is composed of entities. The more times an entity appears in different central sentences, the higher the relative weight of the representative entity. As shown in equation 3, n is the number of central sentences, I (e)i∈st) To indicate a function, an entity e is representediWhether or not there is a current central sentence stIn (1).
Substeps 1-6, calculating the weight parameter values of the three parts, and providing an entity semantic weight calculation formula in UCL after combination, wherein the formula is shown as a formula 4:
EW(ei)=Avg(location(ei))×(η·freq(ei)+(1-η)·center(ei) Equation 4)
Wherein eta is an adjusting parameter and ranges from 0 to 1. Avg (location (e)i) Represent the average location weight of an entity, because the same entity may appear at different locations in an article multiple times, the weighted average of the location weights of the entities needs to be calculated by taking the frequency of appearance of the entity at each location as a weight. After calculating EW (e) of all entitiesi) Then, the UCL semantic weight of each entity is obtained through normalization.
And substeps 1-7, establishing a connection between an entity in the UCL and an entity in a basic knowledge base, and adding the UCL node into the knowledge base to complete the construction of a UCL knowledge space. Fig. 2 illustrates the relationship of UCL nodes to entity nodes in the UCL knowledge space, where square nodes represent UCLs indexing news content and circular nodes represent basic entities. The numerical value on the connecting line of the UCL node and the entity node represents the semantic weight of the entity in the UCL indexed news, and the connecting line between the entities represents the relationship between the entities. Taking the disambiguation process of the entity "eight heaven parts" in the work "eight heaven parts" which is most popular in the national future in the figure 3 as an example, after the knowledge space is constructed, the "eight heaven parts" have 18 meaning items in the knowledge space.
And 2, generating vector representation. And (4) acquiring a candidate entity set related to the entity to be disambiguated from the UCL knowledge space, and generating an embedded representation of the candidate entity and the entity to be disambiguated by using a word vector representation method as input of the semantic matching module in the step (4). The method comprises the following specific steps:
and a substep 2-1, aiming at the entity m to be disambiguated, acquiring a target object possibly linked with the entity from a knowledge base by using an entity named item dictionary mode, wherein the named item dictionary restrains alias names existing in the entity through a self-defining method, and reserving all target entities corresponding to the word with one word ambiguity.
Substep 2-2, given an object m to be disambiguated and a candidate entity e obtained in substep 2-1, generating a word vector representation of the entity to be disambiguated by averaging hidden representations of words or characters in the text in which the entity to be disambiguated is locatedAnd word vector representations of candidate entities
And 3, extracting the characteristics, not only considering the content information of the text, but also supplementing the text by introducing external knowledge. And utilizing concepts possibly corresponding to the entity to be disambiguated in the UCL knowledge space and relevant context information of the candidate entity in the UCL knowledge space as external supplement of the text, thereby enriching the vector representation of the entity.
The first stage of feature extraction is carried out on the entity to be disambiguated, the entity to be disambiguated is marked by using concepts, and the concepts of the entity are used as additional features of the entity to be disambiguated. A knowledge Feature network KFN (knowledge Feature network) is constructed by using a text entity and a corresponding concept, a subgraph with an entity and a concept node is generated based on the KFN, and a proper entity concept to be disambiguated is obtained by using a restart random walk algorithm. Representing the conceptualization in a vector form as a supplementary feature of an entity to be disambiguated;
and in the second stage, feature extraction is carried out on the candidate entities, the existing knowledge of a UCL knowledge space is utilized to learn the relation between the entity context text and the entities at the same time, and a CBOW model with a sigmoid function layer is used to generate vector representation of the candidate entities.
The specific implementation steps are as follows:
substep 3-1, feature extraction of the entity to be disambiguated. As shown in fig. 4, this stage is implemented by using three modules, which are a knowledge Feature network kfn (knowledge Feature network) module, a subgraph construction module, and a conceptualization module. Given a text object, namely ' eight heaven parts ' in fig. 3, which should be regarded as the best-selling work in the gold world ', a knowledge feature network is first constructed by using text and factual knowledge. Including three types of content knowledge of entities to be disambiguated in the text, namely "eight heaven portions", other words in the text, namely "mediocre", "popular", "work", and candidate concepts.
And a substep 3-2, constructing the knowledge characteristic network by depending on the corresponding relation among the words in the sentence where the entity to be disambiguated is located, the entity to be disambiguated and the candidate concepts. The words and the entities to be disambiguated are obtained through vocabulary analysis and entity recognition, and the candidate concepts are obtained through knowledge space matching. The KFN describes three concept relationships, namely a concept-entity relationship, a concept-concept relationship and a word-concept relationship. The specific acquisition steps are as follows:
(1) the concept-entity relationship is represented by the generation probability of the concept c to the entity e, and the p (c | e) probability is obtained by the page click rate statistics based on the Wikipedia entity page, as shown in formula 5, NPV(e) Is the statistical number of entity e, e' is any entity belonging to concept c:
(2) concept-concept relationships are represented by transition probabilities between two concepts, e.g. ciAnd cjProbability p (c)i|cj) Is calculated based on the co-occurrence frequency of the entities under two concepts, as shown in equation 6:
wherein, as shown in formula 7, the co-occurrence probability N (e)j,ei) Based on the anchor link statistical calculation of the Baidu encyclopedia, w is the statistical Baidu encyclopediaThe window size of the co-occurrence frequency of the middle entity pair is set to 25 in the invention.
N(ej,ei)=freqw(ej,ei) Equation 7
(3) The word-concept relationship is represented by the labeling probability between the word w and the related concept c. The probability is calculated based on word frequency and word-concept co-occurrence frequency, as shown in equation 8, where n (w) is a frequency statistic of the occurrence of words w:
sub-step 3-3, the invention uses a Restart Random Walk algorithm (RWR) on the knowledge characteristic network to get the proper concept of the entity to be disambiguated. Firstly, the weights of nodes and edges are initialized through the following formulas 9 and 10, wherein l represents the edge of the knowledge characteristic network, n represents the node of the knowledge characteristic network, T represents the number of entities in the knowledge characteristic network, T represents the entity node, and c represents the concept node.
Subsequently, the nodes and edges are iteratively updated by the following equations 11, 12:
Nk=(1-α)E′×Nk-1+αN0equation 11
Ek←(1-β)Nk+βEkEquation 12
Where α, β are the hyper-parameters adjusted on the development set. Finally, by normalizing the edge weights, the concept type with the highest weight is obtained, as shown in equation 13:
as shown in fig. 5, a conceptualized sub-graph structure is obtained by calculation, in which relationships among entities, words, and concepts and conversion probabilities are included, and a concept having the highest final conversion probability is used as an object for conceptualization embedding to generate a vectorIn the illustration, the concept of highest conversion probability corresponding to the "eight dragon" entities and the "best selling" is the "novel".
And a substep 3-4 of feature extraction for the candidate entity context. As shown in fig. 6, the present invention proposes a new feature learning method, which learns the candidate entity context and the entity relationship contained in the knowledge space at the same time, and then generates the context vector representation corresponding to the candidate entity through the CBOW model with the activation function sigmoid. The method comprises the following specific steps:
(1) a positive sample is first generated based on the associated description for the entity in the UCL knowledge space, the description including text or entities associated with the candidate entity. And then, performing word segmentation on the entity description text, counting the word frequency of the positive sample, and generating a negative sample by utilizing hierarchical random sampling.
(2) In the above steps, in order to enhance the relationship between the candidate entity and the related entity, the co-occurrence data of the entity and the S-P-O triplet data in the UCL knowledge space are used to generate training samples. The S-P-O triplets are from news text indexed by UCL associated with the candidate entity in UCL knowledge space, and the triplet information contained in the news event is stored in the knowledge space. The entity semantic weight calculated in the sub-step 1-6 can be used as supplementary information of the candidate entity context.
(3) And obtaining an entity sequence related to the candidate entities as a training sample, and generating an embedded representation of each entity by utilizing a CBOW model. Finally, the context embedded representation of the entity is updated by using the Skip-Gram model, and then a vector of candidate entity contexts is obtained
And 4, semantic matching. As shown in FIG. 4, the entity word vector to be disambiguated obtained by the training of substep 2-2, substep 3-3, and substep 3-4Candidate entity word vectorConceptual vector of entity to be disambiguatedAnd candidate entity context vectorsAdding the Semantic Model into a Deep Structured Semantic Model (DSSM) based on a self-attention matching network. The method comprises the following specific steps:
and a substep 4-1, capturing key information of the original sentence by the input word vector sequence through an LSTM layer, and respectively outputting a semantic vector. And then, respectively aggregating the vector representation related to the entity to be disambiguated and the vector representation related to the candidate entity into a single high-dimensional word vector by utilizing a two-layer self-attention mechanism. And then, mapping the high-dimensional semantic vectors to a continuous semantic space to learn the low-latitude vector representation of the text by utilizing a full-connection layer, and further modeling the semantic similarity between the vectors. Cosine similarity is used to calculate similarity loss, i.e. association scores between entities, through a multi-layer Fully Connected networks (FC) as a model for generating low latitude semantic vectors.
And a substep 4-2, repeating the steps for all the entities in the candidate entity set in sequence, and obtaining the highest-score entity according to the association score sequence, namely the candidate entity which is most matched with the entity to be disambiguated, so as to complete the entity disambiguation work. And finally, completing the link between the entity to be disambiguated and the corresponding entity in the knowledge space according to the entity disambiguation result.
Based on the same inventive concept, the invention further provides an entity disambiguation apparatus based on the UCL knowledge space, which comprises a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and when the computer program is loaded to the processor, the entity disambiguation apparatus based on the UCL knowledge space realizes the entity disambiguation method based on the UCL knowledge space. The entity disambiguation device based on the UCL knowledge space comprises a UCL knowledge space construction module, a vector generation expression module, a feature extraction module and a semantic matching module, wherein the UCL knowledge space construction module is used for constructing a basic knowledge base by acquiring Wikipedia and Baidu encyclopedia knowledge, indexing news information by UCL and storing the news information in the knowledge base to serve as content supplement of the UCL knowledge space, so that the construction of the UCL knowledge space is completed, and the content in the step 1 is specifically realized; the generated vector representation module is used for extracting an entity to be disambiguated from the text, acquiring a candidate entity set for entity disambiguation from a UCL knowledge space, and specifically realizing the content of the step 2; the characteristic extraction module is used for extracting characteristics of the entity to be disambiguated and enhancing the embedded representation of the entity to be disambiguated by utilizing concept knowledge corresponding to the context information of the entity; then embedding and enhancing the candidate entity by utilizing the existing fact knowledge in the UCL knowledge space, and specifically realizing the content of the step 3; and the semantic matching module is used for obtaining a matching result by utilizing a deep structured semantic model based on the self-attention matching network, and specifically realizing the content of the step 4.
The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.
Claims (10)
1. An entity disambiguation method based on UCL knowledge space is characterized by comprising the following steps:
(1) constructing a UCL knowledge space: acquiring entities, entity basic attributes and associations among the entities from an open offline database by using an information extraction related technology, and constructing a basic knowledge base; acquiring network news, indexing the network news by using UCL, calculating the association between an entity and the news as the supplement of a knowledge base, and completing the construction of UCL knowledge space;
(2) vector representation generation: acquiring a candidate entity set related to the entity to be disambiguated from a UCL knowledge space, and generating an embedded representation of the candidate entity and the entity to be disambiguated by using a word vector representation method as the input of the semantic matching module in the step (4);
(3) characteristic extraction: not only the content information of the text is considered, but also the text is supplemented by introducing external knowledge; the first stage extracts the conceptual features of the entity to be disambiguated and the context thereof, and the second stage extracts the features of the context of the candidate entity;
(4) semantic matching: using four vector representations of the conceptual embedding of the entity to be disambiguated generated in the step (3), the context embedding of the candidate entity and the word vector embedding of the candidate entity and the entity to be disambiguated generated in the step (2) as input, and obtaining the matching degree by adopting a self-attention matching network based on a deep structured semantic matching model (DSSM); and obtaining a final disambiguation result according to the ordering of the matching results, completing entity linkage between the entity in the text and the entity in the UCL knowledge space, and further updating the content of the UCL knowledge space.
2. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: when the basic knowledge base is constructed in the step (1), selecting a Chinese Wikipedia offline database as a basic knowledge base, and acquiring entities, attributes and associated information from entry pages of Wikipedia; aiming at the incomplete problem of the Wikipedia data, the Baidu encyclopedia data is used as the supplement of the entity, and the content of the knowledge base is further improved.
3. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: when the UCL knowledge space is constructed in the step (1), crawl news of a mainstream Chinese news portal site by using a crawler technology, clear and semantically analyze the obtained news information, index entities in the news into UCL entities, integrate the UCL entities into a basic entity library, and construct links between UCLs and between the entities and the UCLs, so that the construction of the initial UCL knowledge space is completed.
4. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: in the step (2), for the acquisition of the candidate entity, firstly, link candidate filtering is completed, and a target object possibly linked with the entity is acquired from a knowledge base in a mode of selecting an entity nominal item dictionary; given an object m to be disambiguated and a candidate entity e, a word vector representation of the entity to be disambiguated is generated by averaging hidden representations of words or characters in the textAnd word vector representations of candidate entities
5. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: the feature extraction process of the entity to be disambiguated in the first stage in the step (3) is realized based on three modules, and specifically comprises a knowledge feature network module, a subgraph construction module and a conceptualization module.
6. The method for entity disambiguation based on the UCL knowledge space of claim 5, wherein: the knowledge characteristic network in the step (3) is constructed by depending on the corresponding relation among the words in the sentence where the entity to be disambiguated is located, the entity to be disambiguated and the candidate concepts; wherein, the words and the entities to be disambiguated are obtained by the way of vocabulary analysis and entity recognition, and the candidate concepts are obtained by the way of knowledge space matching; the knowledge characteristic network describes three relations, namely a concept-entity relation, a concept-concept relation and a word-concept relation; constructing a conceptualized subgraph by acquiring the three relations; obtaining the best fit by restarting the calculation of the random walk algorithmProper concepts, namely converting concepts corresponding to the entities to be disambiguated into conceptualized embedding
7. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: the second stage of the feature extraction process for the candidate entity in the step (3) comprises the following processes: firstly, generating a positive sample according to the description of an entity in a UCL knowledge space, and segmenting a description text; counting word frequencies in the positive samples, and generating negative samples by utilizing layered random sampling; generating a training sample by utilizing an entity co-occurrence data set and S-P-O triple data in a knowledge space so as to learn the relation between a candidate entity and an enhanced entity representation, wherein the S-P-O triple comes from a UCL node stored in the knowledge space, and a news event indexed by the UCL naturally carries triple information; then, generating vector representation of the entity and the context information thereof through a CBOW model, then updating the embedded representation of the entity through a Skip-Gram model, enhancing the relation between the entities, and training to obtain the embedded representation of the candidate entity
8. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: the semantic matching process of the entity in the step (4) adopts a deep structured semantic model based on a self-attention matching network; representing the four vectors of the entity to be disambiguated and the candidate entity obtained in the step (2) and the step (3): to-be-disambiguated entity word vectorCandidate entity word vectorConceptual vector of entity to be disambiguatedAnd candidate entity context vectorsJoining in a network '' to map the vectors to a continuous semantic space to learn the low latitude vector representation of the text, and further modeling the semantic similarity between the vectors; calculating the similarity loss by using cosine similarity through a multilayer full-connection network as a model for generating a low-latitude semantic vector; and finally, finding a candidate entity which is most matched with the entity to be disambiguated semantically, and completing the disambiguation task.
9. An apparatus for entity disambiguation based on a UCL knowledge space, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein: the computer program when loaded into a processor implements the method for entity disambiguation based on the UCL knowledge space of any of claims 1-8.
10. The UCL knowledge-space-based entity disambiguation apparatus of claim 1, wherein: the computer program comprises a UCL knowledge space construction module, a vector generation representation module, a feature extraction module and a semantic matching module; the UCL knowledge space construction module is used for constructing a basic knowledge base by acquiring Wikipedia and Baidu encyclopedia knowledge, indexing news information by UCL and storing the news information in the knowledge base as content supplement of the UCL knowledge space, so that the construction of the UCL knowledge space is completed; the generating vector representation module is used for extracting an entity to be disambiguated from the text and acquiring a candidate entity set for entity disambiguation from a UCL knowledge space; the characteristic extraction module is used for extracting characteristics of the entity to be disambiguated and enhancing the embedded representation of the entity to be disambiguated by utilizing concept knowledge corresponding to the context information of the entity; embedding and enhancing the candidate entity by utilizing the fact knowledge in the UCL knowledge space; the semantic matching module is used for obtaining a matching result by utilizing a deep structured semantic model based on a self-attention matching network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110475291.XA CN113111136B (en) | 2021-04-29 | 2021-04-29 | Entity disambiguation method and device based on UCL knowledge space |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110475291.XA CN113111136B (en) | 2021-04-29 | 2021-04-29 | Entity disambiguation method and device based on UCL knowledge space |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113111136A true CN113111136A (en) | 2021-07-13 |
CN113111136B CN113111136B (en) | 2024-04-16 |
Family
ID=76720462
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110475291.XA Active CN113111136B (en) | 2021-04-29 | 2021-04-29 | Entity disambiguation method and device based on UCL knowledge space |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113111136B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113656556A (en) * | 2021-08-20 | 2021-11-16 | 上海大参林医疗健康科技有限公司 | Text feature extraction method and knowledge graph construction method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090144609A1 (en) * | 2007-10-17 | 2009-06-04 | Jisheng Liang | NLP-based entity recognition and disambiguation |
CN111428047A (en) * | 2020-03-19 | 2020-07-17 | 东南大学 | Knowledge graph construction method and device based on UC L semantic indexing |
CN111581973A (en) * | 2020-04-24 | 2020-08-25 | 中国科学院空天信息创新研究院 | Entity disambiguation method and system |
-
2021
- 2021-04-29 CN CN202110475291.XA patent/CN113111136B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090144609A1 (en) * | 2007-10-17 | 2009-06-04 | Jisheng Liang | NLP-based entity recognition and disambiguation |
CN111428047A (en) * | 2020-03-19 | 2020-07-17 | 东南大学 | Knowledge graph construction method and device based on UC L semantic indexing |
CN111581973A (en) * | 2020-04-24 | 2020-08-25 | 中国科学院空天信息创新研究院 | Entity disambiguation method and system |
Non-Patent Citations (2)
Title |
---|
IKUYA YAMADA 等: "Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation", ARXIV, 10 June 2016 (2016-06-10), pages 1 - 10 * |
怀宝兴 等: "一种基于概率主题模型的命名实体链接方法", 软件学报, vol. 25, no. 9, 31 December 2014 (2014-12-31), pages 2076 - 2087 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113656556A (en) * | 2021-08-20 | 2021-11-16 | 上海大参林医疗健康科技有限公司 | Text feature extraction method and knowledge graph construction method |
CN113656556B (en) * | 2021-08-20 | 2023-08-15 | 广州天宸健康科技有限公司 | Text feature extraction method and knowledge graph construction method |
Also Published As
Publication number | Publication date |
---|---|
CN113111136B (en) | 2024-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
US11631007B2 (en) | Method and device for text-enhanced knowledge graph joint representation learning | |
CN110633409B (en) | Automobile news event extraction method integrating rules and deep learning | |
JP7468929B2 (en) | How to acquire geographical knowledge | |
CN110598005B (en) | Public safety event-oriented multi-source heterogeneous data knowledge graph construction method | |
CN114064918B (en) | Multi-modal event knowledge graph construction method | |
CN104318340B (en) | Information visualization methods and intelligent visible analysis system based on text resume information | |
CN106845411B (en) | Video description generation method based on deep learning and probability map model | |
CN108874878A (en) | A kind of building system and method for knowledge mapping | |
CN111143576A (en) | Event-oriented dynamic knowledge graph construction method and device | |
CN110633366A (en) | Short text classification method, device and storage medium | |
CN110516239B (en) | Segmentation pooling relation extraction method based on convolutional neural network | |
CN115269865A (en) | Knowledge graph construction method for auxiliary diagnosis | |
CN116127095A (en) | Question-answering method combining sequence model and knowledge graph | |
CN116127090B (en) | Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction | |
CN112597285A (en) | Man-machine interaction method and system based on knowledge graph | |
CN114238653A (en) | Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education | |
CN112417170B (en) | Relationship linking method for incomplete knowledge graph | |
CN113869040A (en) | Voice recognition method for power grid dispatching | |
CN113111136B (en) | Entity disambiguation method and device based on UCL knowledge space | |
CN113392183A (en) | Characterization and calculation method of children domain map knowledge | |
CN117290478A (en) | Knowledge graph question-answering method, device, equipment and storage medium | |
CN111859955A (en) | Public opinion data analysis model based on deep learning | |
CN111813927A (en) | Sentence similarity calculation method based on topic model and LSTM | |
CN116340507A (en) | Aspect-level emotion analysis method based on mixed weight and double-channel graph convolution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |