CN113111136A

CN113111136A - Entity disambiguation method and device based on UCL knowledge space

Info

Publication number: CN113111136A
Application number: CN202110475291.XA
Authority: CN
Inventors: 杨鹏; 常欣辰; 范路平; 于晓潭
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2021-07-13
Anticipated expiration: 2041-04-29
Also published as: CN113111136B

Abstract

The invention provides an entity disambiguation method and device based on a UCL knowledge space, which comprises the steps of firstly, constructing a basic knowledge base to complete the construction of the UCL knowledge space; then acquiring a candidate entity set related to the entity to be disambiguated from a UCL knowledge space, and generating the candidate entity and the embedded representation of the entity to be disambiguated by using a word vector representation method; then extracting conceptual features of the entity to be disambiguated and the context thereof, and extracting features of the context of the candidate entity; finally, four vector representations generated before are used as input, and a self-attention matching network based on a deep structured semantic matching model (DSSM) is adopted to obtain the matching degree; and obtaining a final disambiguation result according to the ordering of the matching results, and completing entity linkage between the entity in the text and the entity in the UCL knowledge space. The invention can solve the problem of less entity related information in the short text and improve the accuracy of entity disambiguation.

Description

Entity disambiguation method and device based on UCL knowledge space

Technical Field

The invention relates to an entity disambiguation method and device based on a UCL knowledge space, and belongs to the technical field of knowledge graph construction in the Internet.

Background

With the rapid development of the internet, the number of network news is increased rapidly, knowledge information contained in news is more and more complicated, and a proper carrier is urgently needed to effectively store and manage the news information. A knowledge-graph is able to form a graph database by associating entities together by constructing "entity-relationship-entity" triplets and "entity-attribute (value)" key-value pairs. The unified Content tag ucl (unified Content tag) defined by the national standard "unified Content tag format specification" (GB/T35304-. The webpage news content with great disorder can be effectively organized by filling the description attribute set of the UCL. Based on the above premises, a UCL Knowledge Space (ucll Knowledge Space, UCLKS) arises at the discretion, and the UCL Knowledge Space is based on basic Knowledge bases such as wikipedia, hundred-degree encyclopedia and the like, and supplements entity Knowledge with network news content. The process of associating entities in the news text with existing entities in the UCL knowledge space is entity disambiguation, or entity linking. Entity disambiguation is a necessary prerequisite for subsequent knowledge space updates and for the implementation of knowledge space-based applications.

Over the past few decades, researchers have conducted a great deal of research into entity disambiguation, mainly involving rule-based methods, traditional statistical methods, and methods of deep learning. In the case of rich text context, the most advanced algorithms have achieved very good disambiguation. However, the performance of these methods on short text as well as spoken text drops dramatically. The missing context of these short texts is essential for disambiguation. In recent years, many scholars have endeavored to improve the effect of entity disambiguation by exploring more abundant contextual information. These methods are mainly focused on better utilization of existing text, but the problem of information loss is still not solved. It is difficult to improve the disambiguation result substantially if the source text cannot be augmented by adding external information.

Disclosure of Invention

In order to solve the problems, the invention provides an entity disambiguation method and device based on a UCL knowledge space, which respectively perform knowledge expansion on an entity to be disambiguated and a candidate entity to improve the model effect. For an entity to be disambiguated, giving the entity a concept vector representation and formalizing the concept vector representation into a concept subgraph; for candidate entities, an entity vector representation is given to the entity to formalize it as factual knowledge in the UCL knowledge space. And finally, obtaining the entity disambiguation result by adopting a deep structured semantic model based on self-attention matching. The method can effectively solve the problem of low disambiguation precision caused by insufficient entity context information in the short text, and improve the effect of entity disambiguation.

In order to achieve the aim, the invention provides an entity disambiguation method based on UCL knowledge space, which comprises the steps of firstly, constructing a basic knowledge base by acquiring Wikipedia and Baidu encyclopedia knowledge, indexing news information by UCL and storing the news information in the knowledge base as content supplement of the UCL knowledge space, thereby completing construction of the UCL knowledge space; extracting an entity to be disambiguated in the text by using an open domain three-tuple extraction method, and acquiring a candidate entity set for entity disambiguation from a UCL knowledge space; then, extracting the characteristics of the entity to be disambiguated, and enhancing the embedded representation of the entity to be disambiguated by using the concept knowledge corresponding to the context information of the entity; embedding and enhancing the candidate entity by utilizing the fact knowledge in the UCL knowledge space; and finally, obtaining a matching result by utilizing a deep structured semantic model based on the self-attention matching network.

Specifically, the invention provides the following technical scheme:

an entity disambiguation method based on UCL knowledge space comprises the following steps:

(1) constructing a UCL knowledge space: acquiring entities, entity basic attributes and associations among the entities from an open offline database by using an information extraction related technology, and constructing a basic knowledge base; acquiring network news, indexing the network news by using UCL, calculating the association between an entity and the news as the supplement of a knowledge base, and completing the construction of UCL knowledge space;

(2) vector representation generation: acquiring a candidate entity set related to the entity to be disambiguated from a UCL knowledge space, and generating an embedded representation of the candidate entity and the entity to be disambiguated by using a word vector representation method as the input of the semantic matching module in the step (4);

(3) characteristic extraction: not only the content information of the text is considered, but also the text is supplemented by introducing external knowledge; the first stage extracts the conceptual features of the entity to be disambiguated and the context thereof, and the second stage extracts the features of the context of the candidate entity;

(4) semantic matching: and (3) utilizing four vector representations of the conceptual embedding of the entity to be disambiguated generated in the step (3), the context embedding of the candidate entity and the word vector embedding of the candidate entity and the entity to be disambiguated generated in the step (2) as input, and obtaining the matching degree by adopting a self-attention matching network based on a deep structured semantic matching model (DSSM). And obtaining a final disambiguation result according to the ordering of the matching results, completing entity linkage between the entity in the text and the entity in the UCL knowledge space, and further updating the content of the UCL knowledge space.

Preferably, when the basic knowledge base is constructed in the step (1), selecting a Chinese Wikipedia offline database as the basic knowledge base, and acquiring entities, attributes and associated information from entry pages of Wikipedia; aiming at the incomplete problem of the Wikipedia data, the Baidu encyclopedia data is used as the supplement of the entity, and the content of the knowledge base is further improved.

Preferably, when the construction of the UCL knowledge space is completed in the step (1), crawl news of a mainstream chinese news portal site by using a crawler technology, perform cleaning and semantic parsing on the obtained news information, index entities in the news into UCL entities, blend the UCL entities into a basic entity library, and construct links between UCLs and between the entities and the UCL, thereby completing the construction of an initial UCL knowledge space.

Preferably, in the step (2), for the acquisition of the candidate entity, firstly, link candidate filtering is completed, and a target object possibly linked with the entity is acquired from the knowledge base by selecting a manner of using an entity reference term dictionary; given an object m to be disambiguated and a candidate entity e, a word vector representation of the entity to be disambiguated is generated by averaging hidden representations of words or characters in the text

And word vector representations of candidate entities

Preferably, the feature extraction process of the entity to be disambiguated in the first stage in the step (3) is implemented based on three modules, specifically including a knowledge feature network module, a subgraph construction module and a conceptualization module.

Preferably, the knowledge feature network in the step (3) is constructed by depending on the corresponding relation among the words in the sentence where the entity to be disambiguated is located, the entity to be disambiguated itself and the candidate concepts; wherein, the words and the entities to be disambiguated are obtained by the way of vocabulary analysis and entity recognition, and the candidate concepts are obtained by the way of knowledge space matching; the knowledge characteristic network describes three relations, namely a concept-entity relation, a concept-concept relation and a word-concept relation; constructing a conceptualized subgraph by acquiring the three relations; obtaining the most suitable concept by restarting the calculation of the random walk algorithm, and converting the concept corresponding to the entity to be disambiguated into conceptualized embedding

Preferably, the feature extraction process for the candidate entity in the second stage in step (3) includes the following processes: firstly, generating a positive sample according to the description of an entity in a UCL knowledge space, and segmenting a description text; counting word frequencies in the positive samples, and generating negative samples by utilizing layered random sampling; generating a training sample by utilizing an entity co-occurrence data set and S-P-O triple data in a knowledge space so as to learn the relation between a candidate entity and an enhanced entity representation, wherein the S-P-O triple comes from a UCL node stored in the knowledge space, and a news event indexed by the UCL naturally carries triple information; then, generating vector representation of the entity and the context information thereof through a CBOW model, then updating the embedded representation of the entity through a Skip-Gram model, enhancing the relation between the entities and training the entitiesEmbedded representation to candidate entities

Preferably, the semantic matching process of the entity in the step (4) adopts a deep structured semantic model based on a self-attention matching network; representing the four vectors of the entity to be disambiguated and the candidate entity obtained in the step (2) and the step (3): to-be-disambiguated entity word vector

Candidate entity word vector

Conceptual vector of entity to be disambiguated

And candidate entity context vectors

Join the network "map the vectors to a continuous semantic space to learn the low latitude vector representation of the text, and then model the semantic similarity between the vectors. Calculating the similarity loss by using cosine similarity through a multilayer full-connection network as a model for generating a low-latitude semantic vector; and finally, finding a candidate entity which is most matched with the entity to be disambiguated semantically, and completing the disambiguation task.

The invention also provides an entity disambiguation device based on the UCL knowledge space, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the entity disambiguation method based on the UCL knowledge space when being loaded to the processor.

Preferably, the computer program comprises a UCL knowledge space construction module, a vector generation representation module, a feature extraction module and a semantic matching module; the UCL knowledge space construction module is used for constructing a basic knowledge base by acquiring Wikipedia and Baidu encyclopedia knowledge, indexing news information by UCL and storing the news information in the knowledge base as content supplement of the UCL knowledge space, so that the construction of the UCL knowledge space is completed; the generating vector representation module is used for extracting an entity to be disambiguated from the text and acquiring a candidate entity set for entity disambiguation from a UCL knowledge space; the characteristic extraction module is used for extracting characteristics of the entity to be disambiguated and enhancing the embedded representation of the entity to be disambiguated by utilizing concept knowledge corresponding to the context information of the entity; embedding and enhancing the candidate entity by utilizing the fact knowledge in the UCL knowledge space; the semantic matching module is used for obtaining a matching result by utilizing a deep structured semantic model based on a self-attention matching network.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) the method indexes the news text by using the characteristic that the UCL content format is matched with the news element, and constructs the UCL knowledge space by fusing the basic knowledge base. Compared with the traditional encyclopedia knowledge base, the semantic association of the entities in the knowledge base is increased through the abundant semantic information of the UCL.

(2) The invention utilizes the knowledge information in the UCL knowledge space to supplement the related entities of the candidate entities, thereby enhancing the context expression of the candidate entities. After the entity disambiguation work is finished, forward feedback is generated for the UCL knowledge space, and the related information of the entity in the UCL knowledge space is updated. As the magnitude of the entity disambiguation task increases, the effect of entity disambiguation will also increase.

(3) The invention introduces the knowledge characteristic network of the entity to be disambiguated, adds conceptualization attributes to the entity to be disambiguated, and solves the problem of insufficient entity link context information in the short text through conceptualization representation of fine granularity. The invention can solve the problem of less entity related information in the short text and improve the accuracy of entity disambiguation.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the present invention.

Fig. 2 is a diagram of an example of a UCL knowledge space node according to an embodiment of the present invention.

Fig. 3 is a diagram illustrating an example of entity disambiguation tasks according to an embodiment of the present invention.

Fig. 4 is a block diagram of the overall architecture of entity disambiguation based on the UCL knowledge space according to an embodiment of the present invention.

Fig. 5 is a block diagram of a conceptual feature extraction module for an entity to be disambiguated according to an embodiment of the present invention.

Fig. 6 is a diagram illustrating a candidate entity context embedding vector generation structure according to an embodiment of the present invention.

Detailed Description

The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.

As shown in fig. 1, an entity disambiguation method based on a UCL knowledge space disclosed in the embodiment of the present invention includes the following specific implementation steps:

step 1, constructing a UCL knowledge space. Acquiring entities, entity basic attributes and associations among the entities from an open offline database by using an information extraction related technology, and constructing a basic knowledge base; and acquiring network news, indexing the network news by using the UCL, and completing the construction of a UCL knowledge space as the supplement of a knowledge base. The steps are the premise work of the invention, and the specific steps are as follows:

and a substep 1-1, constructing a basic knowledge base. The method comprises the steps of establishing a basic database by using an offline database of Chinese Wikipedia as source data, crawling classification pages and content pages of the Wikipedia, and acquiring information such as entities, attributes, associations and the like from entry pages of the Wikipedia. The links of the classification pages can be linked to specific entity content pages and next-level sub-classification pages, and the content pages are content detail pages and comprise structured data and related entity links. And then, entity and relationship data of the encyclopedia are crawled, a plurality of different meaning items of the same entity are displayed on an encyclopedia entry interface, and the encyclopedia entry interface can be used as a candidate entity source for subsequent entity disambiguation work by extracting the content. And (4) completing the fusion of the Wikipedia knowledge and the Baidu encyclopedia knowledge by utilizing related technologies such as entity fusion, entity alignment and the like to form a basic knowledge base.

And a substep 1-2, crawling news of a mainstream Chinese news portal site by using a crawler tool, cleaning and semantically analyzing crawled news information, and obtaining element information of news such as a title, an author, time and an abstract and entity and relationship information in a news body through a natural language processing tool. And then indexing the news content by using a UCL indexing tool, and indexing an entity in the news as a UCL entity to obtain a UCL node.

Substeps 1-3 calculate the frequency of occurrence of each entity in the news as shown in equation 1. count (e)_i) The number of times of occurrence of the entity is represented, and the denominator is the number of times of occurrence of all the entities. After the word frequency calculation is completed, entities with lower word frequency are filtered out, so that the subsequent calculation is simplified.

Substeps 1-4, distinguishing the position of the entity, and setting the position weight as location (e) according to the different areas of the entity_i). As shown in equation 2, P is the number of news paragraphs and P is the current entity e_i(e) location when the total number of news paragraphs does not exceed two_i) Is a fixed value; when the news total paragraph exceeds two segments, the entity scores in the first and last segments are the same, and the scores of the other segments are unified into one fourth of the score of the first and last segments.

And a substep 1-5 of extracting a central sentence set by using a TextRank algorithm, wherein the set is marked as sending ═ s₁,s₂,...,s_nS in the set_iRepresents a central sentence, which is composed of entities. The more times an entity appears in different central sentences, the higher the relative weight of the representative entity. As shown in equation 3, n is the number of central sentences, I (e)_i∈s_t) To indicate a function, an entity e is represented_iWhether or not there is a current central sentence s_tIn (1).

Substeps 1-6, calculating the weight parameter values of the three parts, and providing an entity semantic weight calculation formula in UCL after combination, wherein the formula is shown as a formula 4:

EW(e_i)＝Avg(location(e_i))×(η·freq(e_i)+(1-η)·center(e_i) Equation 4)

Wherein eta is an adjusting parameter and ranges from 0 to 1. Avg (location (e)_i) Represent the average location weight of an entity, because the same entity may appear at different locations in an article multiple times, the weighted average of the location weights of the entities needs to be calculated by taking the frequency of appearance of the entity at each location as a weight. After calculating EW (e) of all entities_i) Then, the UCL semantic weight of each entity is obtained through normalization.

And substeps 1-7, establishing a connection between an entity in the UCL and an entity in a basic knowledge base, and adding the UCL node into the knowledge base to complete the construction of a UCL knowledge space. Fig. 2 illustrates the relationship of UCL nodes to entity nodes in the UCL knowledge space, where square nodes represent UCLs indexing news content and circular nodes represent basic entities. The numerical value on the connecting line of the UCL node and the entity node represents the semantic weight of the entity in the UCL indexed news, and the connecting line between the entities represents the relationship between the entities. Taking the disambiguation process of the entity "eight heaven parts" in the work "eight heaven parts" which is most popular in the national future in the figure 3 as an example, after the knowledge space is constructed, the "eight heaven parts" have 18 meaning items in the knowledge space.

And 2, generating vector representation. And (4) acquiring a candidate entity set related to the entity to be disambiguated from the UCL knowledge space, and generating an embedded representation of the candidate entity and the entity to be disambiguated by using a word vector representation method as input of the semantic matching module in the step (4). The method comprises the following specific steps:

and a substep 2-1, aiming at the entity m to be disambiguated, acquiring a target object possibly linked with the entity from a knowledge base by using an entity named item dictionary mode, wherein the named item dictionary restrains alias names existing in the entity through a self-defining method, and reserving all target entities corresponding to the word with one word ambiguity.

Substep 2-2, given an object m to be disambiguated and a candidate entity e obtained in substep 2-1, generating a word vector representation of the entity to be disambiguated by averaging hidden representations of words or characters in the text in which the entity to be disambiguated is located

And word vector representations of candidate entities

And 3, extracting the characteristics, not only considering the content information of the text, but also supplementing the text by introducing external knowledge. And utilizing concepts possibly corresponding to the entity to be disambiguated in the UCL knowledge space and relevant context information of the candidate entity in the UCL knowledge space as external supplement of the text, thereby enriching the vector representation of the entity.

The first stage of feature extraction is carried out on the entity to be disambiguated, the entity to be disambiguated is marked by using concepts, and the concepts of the entity are used as additional features of the entity to be disambiguated. A knowledge Feature network KFN (knowledge Feature network) is constructed by using a text entity and a corresponding concept, a subgraph with an entity and a concept node is generated based on the KFN, and a proper entity concept to be disambiguated is obtained by using a restart random walk algorithm. Representing the conceptualization in a vector form as a supplementary feature of an entity to be disambiguated;

and in the second stage, feature extraction is carried out on the candidate entities, the existing knowledge of a UCL knowledge space is utilized to learn the relation between the entity context text and the entities at the same time, and a CBOW model with a sigmoid function layer is used to generate vector representation of the candidate entities.

The specific implementation steps are as follows:

substep 3-1, feature extraction of the entity to be disambiguated. As shown in fig. 4, this stage is implemented by using three modules, which are a knowledge Feature network kfn (knowledge Feature network) module, a subgraph construction module, and a conceptualization module. Given a text object, namely ' eight heaven parts ' in fig. 3, which should be regarded as the best-selling work in the gold world ', a knowledge feature network is first constructed by using text and factual knowledge. Including three types of content knowledge of entities to be disambiguated in the text, namely "eight heaven portions", other words in the text, namely "mediocre", "popular", "work", and candidate concepts.

And a substep 3-2, constructing the knowledge characteristic network by depending on the corresponding relation among the words in the sentence where the entity to be disambiguated is located, the entity to be disambiguated and the candidate concepts. The words and the entities to be disambiguated are obtained through vocabulary analysis and entity recognition, and the candidate concepts are obtained through knowledge space matching. The KFN describes three concept relationships, namely a concept-entity relationship, a concept-concept relationship and a word-concept relationship. The specific acquisition steps are as follows:

(1) the concept-entity relationship is represented by the generation probability of the concept c to the entity e, and the p (c | e) probability is obtained by the page click rate statistics based on the Wikipedia entity page, as shown in formula 5, N_PV(e) Is the statistical number of entity e, e' is any entity belonging to concept c:

(2) concept-concept relationships are represented by transition probabilities between two concepts, e.g. c_iAnd c_jProbability p (c)_i|c_j) Is calculated based on the co-occurrence frequency of the entities under two concepts, as shown in equation 6:

wherein, as shown in formula 7, the co-occurrence probability N (e)_j,e_i) Based on the anchor link statistical calculation of the Baidu encyclopedia, w is the statistical Baidu encyclopediaThe window size of the co-occurrence frequency of the middle entity pair is set to 25 in the invention.

N(e_j,e_i)＝freq_w(e_j,e_i) Equation 7

(3) The word-concept relationship is represented by the labeling probability between the word w and the related concept c. The probability is calculated based on word frequency and word-concept co-occurrence frequency, as shown in equation 8, where n (w) is a frequency statistic of the occurrence of words w:

sub-step 3-3, the invention uses a Restart Random Walk algorithm (RWR) on the knowledge characteristic network to get the proper concept of the entity to be disambiguated. Firstly, the weights of nodes and edges are initialized through the following formulas 9 and 10, wherein l represents the edge of the knowledge characteristic network, n represents the node of the knowledge characteristic network, T represents the number of entities in the knowledge characteristic network, T represents the entity node, and c represents the concept node.

Subsequently, the nodes and edges are iteratively updated by the following equations 11, 12:

N^k＝(1-α)E′×N^k-1+αN⁰equation 11

E^k←(1-β)N^k+βE^kEquation 12

Where α, β are the hyper-parameters adjusted on the development set. Finally, by normalizing the edge weights, the concept type with the highest weight is obtained, as shown in equation 13:

as shown in fig. 5, a conceptualized sub-graph structure is obtained by calculation, in which relationships among entities, words, and concepts and conversion probabilities are included, and a concept having the highest final conversion probability is used as an object for conceptualization embedding to generate a vector

In the illustration, the concept of highest conversion probability corresponding to the "eight dragon" entities and the "best selling" is the "novel".

And a substep 3-4 of feature extraction for the candidate entity context. As shown in fig. 6, the present invention proposes a new feature learning method, which learns the candidate entity context and the entity relationship contained in the knowledge space at the same time, and then generates the context vector representation corresponding to the candidate entity through the CBOW model with the activation function sigmoid. The method comprises the following specific steps:

(1) a positive sample is first generated based on the associated description for the entity in the UCL knowledge space, the description including text or entities associated with the candidate entity. And then, performing word segmentation on the entity description text, counting the word frequency of the positive sample, and generating a negative sample by utilizing hierarchical random sampling.

(2) In the above steps, in order to enhance the relationship between the candidate entity and the related entity, the co-occurrence data of the entity and the S-P-O triplet data in the UCL knowledge space are used to generate training samples. The S-P-O triplets are from news text indexed by UCL associated with the candidate entity in UCL knowledge space, and the triplet information contained in the news event is stored in the knowledge space. The entity semantic weight calculated in the sub-step 1-6 can be used as supplementary information of the candidate entity context.

(3) And obtaining an entity sequence related to the candidate entities as a training sample, and generating an embedded representation of each entity by utilizing a CBOW model. Finally, the context embedded representation of the entity is updated by using the Skip-Gram model, and then a vector of candidate entity contexts is obtained

And 4, semantic matching. As shown in FIG. 4, the entity word vector to be disambiguated obtained by the training of substep 2-2, substep 3-3, and substep 3-4

Candidate entity word vector

Conceptual vector of entity to be disambiguated

And candidate entity context vectors

Adding the Semantic Model into a Deep Structured Semantic Model (DSSM) based on a self-attention matching network. The method comprises the following specific steps:

and a substep 4-1, capturing key information of the original sentence by the input word vector sequence through an LSTM layer, and respectively outputting a semantic vector. And then, respectively aggregating the vector representation related to the entity to be disambiguated and the vector representation related to the candidate entity into a single high-dimensional word vector by utilizing a two-layer self-attention mechanism. And then, mapping the high-dimensional semantic vectors to a continuous semantic space to learn the low-latitude vector representation of the text by utilizing a full-connection layer, and further modeling the semantic similarity between the vectors. Cosine similarity is used to calculate similarity loss, i.e. association scores between entities, through a multi-layer Fully Connected networks (FC) as a model for generating low latitude semantic vectors.

And a substep 4-2, repeating the steps for all the entities in the candidate entity set in sequence, and obtaining the highest-score entity according to the association score sequence, namely the candidate entity which is most matched with the entity to be disambiguated, so as to complete the entity disambiguation work. And finally, completing the link between the entity to be disambiguated and the corresponding entity in the knowledge space according to the entity disambiguation result.

Based on the same inventive concept, the invention further provides an entity disambiguation apparatus based on the UCL knowledge space, which comprises a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and when the computer program is loaded to the processor, the entity disambiguation apparatus based on the UCL knowledge space realizes the entity disambiguation method based on the UCL knowledge space. The entity disambiguation device based on the UCL knowledge space comprises a UCL knowledge space construction module, a vector generation expression module, a feature extraction module and a semantic matching module, wherein the UCL knowledge space construction module is used for constructing a basic knowledge base by acquiring Wikipedia and Baidu encyclopedia knowledge, indexing news information by UCL and storing the news information in the knowledge base to serve as content supplement of the UCL knowledge space, so that the construction of the UCL knowledge space is completed, and the content in the step 1 is specifically realized; the generated vector representation module is used for extracting an entity to be disambiguated from the text, acquiring a candidate entity set for entity disambiguation from a UCL knowledge space, and specifically realizing the content of the step 2; the characteristic extraction module is used for extracting characteristics of the entity to be disambiguated and enhancing the embedded representation of the entity to be disambiguated by utilizing concept knowledge corresponding to the context information of the entity; then embedding and enhancing the candidate entity by utilizing the existing fact knowledge in the UCL knowledge space, and specifically realizing the content of the step 3; and the semantic matching module is used for obtaining a matching result by utilizing a deep structured semantic model based on the self-attention matching network, and specifically realizing the content of the step 4.

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims

1. An entity disambiguation method based on UCL knowledge space is characterized by comprising the following steps:

(4) semantic matching: using four vector representations of the conceptual embedding of the entity to be disambiguated generated in the step (3), the context embedding of the candidate entity and the word vector embedding of the candidate entity and the entity to be disambiguated generated in the step (2) as input, and obtaining the matching degree by adopting a self-attention matching network based on a deep structured semantic matching model (DSSM); and obtaining a final disambiguation result according to the ordering of the matching results, completing entity linkage between the entity in the text and the entity in the UCL knowledge space, and further updating the content of the UCL knowledge space.

2. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: when the basic knowledge base is constructed in the step (1), selecting a Chinese Wikipedia offline database as a basic knowledge base, and acquiring entities, attributes and associated information from entry pages of Wikipedia; aiming at the incomplete problem of the Wikipedia data, the Baidu encyclopedia data is used as the supplement of the entity, and the content of the knowledge base is further improved.

3. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: when the UCL knowledge space is constructed in the step (1), crawl news of a mainstream Chinese news portal site by using a crawler technology, clear and semantically analyze the obtained news information, index entities in the news into UCL entities, integrate the UCL entities into a basic entity library, and construct links between UCLs and between the entities and the UCLs, so that the construction of the initial UCL knowledge space is completed.

4. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: in the step (2), for the acquisition of the candidate entity, firstly, link candidate filtering is completed, and a target object possibly linked with the entity is acquired from a knowledge base in a mode of selecting an entity nominal item dictionary; given an object m to be disambiguated and a candidate entity e, a word vector representation of the entity to be disambiguated is generated by averaging hidden representations of words or characters in the text

And word vector representations of candidate entities

5. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: the feature extraction process of the entity to be disambiguated in the first stage in the step (3) is realized based on three modules, and specifically comprises a knowledge feature network module, a subgraph construction module and a conceptualization module.

6. The method for entity disambiguation based on the UCL knowledge space of claim 5, wherein: the knowledge characteristic network in the step (3) is constructed by depending on the corresponding relation among the words in the sentence where the entity to be disambiguated is located, the entity to be disambiguated and the candidate concepts; wherein, the words and the entities to be disambiguated are obtained by the way of vocabulary analysis and entity recognition, and the candidate concepts are obtained by the way of knowledge space matching; the knowledge characteristic network describes three relations, namely a concept-entity relation, a concept-concept relation and a word-concept relation; constructing a conceptualized subgraph by acquiring the three relations; obtaining the best fit by restarting the calculation of the random walk algorithmProper concepts, namely converting concepts corresponding to the entities to be disambiguated into conceptualized embedding

7. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: the second stage of the feature extraction process for the candidate entity in the step (3) comprises the following processes: firstly, generating a positive sample according to the description of an entity in a UCL knowledge space, and segmenting a description text; counting word frequencies in the positive samples, and generating negative samples by utilizing layered random sampling; generating a training sample by utilizing an entity co-occurrence data set and S-P-O triple data in a knowledge space so as to learn the relation between a candidate entity and an enhanced entity representation, wherein the S-P-O triple comes from a UCL node stored in the knowledge space, and a news event indexed by the UCL naturally carries triple information; then, generating vector representation of the entity and the context information thereof through a CBOW model, then updating the embedded representation of the entity through a Skip-Gram model, enhancing the relation between the entities, and training to obtain the embedded representation of the candidate entity

8. The UCL knowledge-space-based entity disambiguation method of claim 1, characterized in that: the semantic matching process of the entity in the step (4) adopts a deep structured semantic model based on a self-attention matching network; representing the four vectors of the entity to be disambiguated and the candidate entity obtained in the step (2) and the step (3): to-be-disambiguated entity word vector

Candidate entity word vector

Conceptual vector of entity to be disambiguated

And candidate entity context vectors

Joining in a network '' to map the vectors to a continuous semantic space to learn the low latitude vector representation of the text, and further modeling the semantic similarity between the vectors; calculating the similarity loss by using cosine similarity through a multilayer full-connection network as a model for generating a low-latitude semantic vector; and finally, finding a candidate entity which is most matched with the entity to be disambiguated semantically, and completing the disambiguation task.

9. An apparatus for entity disambiguation based on a UCL knowledge space, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein: the computer program when loaded into a processor implements the method for entity disambiguation based on the UCL knowledge space of any of claims 1-8.

10. The UCL knowledge-space-based entity disambiguation apparatus of claim 1, wherein: the computer program comprises a UCL knowledge space construction module, a vector generation representation module, a feature extraction module and a semantic matching module; the UCL knowledge space construction module is used for constructing a basic knowledge base by acquiring Wikipedia and Baidu encyclopedia knowledge, indexing news information by UCL and storing the news information in the knowledge base as content supplement of the UCL knowledge space, so that the construction of the UCL knowledge space is completed; the generating vector representation module is used for extracting an entity to be disambiguated from the text and acquiring a candidate entity set for entity disambiguation from a UCL knowledge space; the characteristic extraction module is used for extracting characteristics of the entity to be disambiguated and enhancing the embedded representation of the entity to be disambiguated by utilizing concept knowledge corresponding to the context information of the entity; embedding and enhancing the candidate entity by utilizing the fact knowledge in the UCL knowledge space; the semantic matching module is used for obtaining a matching result by utilizing a deep structured semantic model based on a self-attention matching network.