CN107545033B - Knowledge base entity classification calculation method based on representation learning - Google Patents

Knowledge base entity classification calculation method based on representation learning Download PDF

Info

Publication number
CN107545033B
CN107545033B CN201710608234.8A CN201710608234A CN107545033B CN 107545033 B CN107545033 B CN 107545033B CN 201710608234 A CN201710608234 A CN 201710608234A CN 107545033 B CN107545033 B CN 107545033B
Authority
CN
China
Prior art keywords
entity
word
category
representing
entities
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710608234.8A
Other languages
Chinese (zh)
Other versions
CN107545033A (en
Inventor
李涓子
侯磊
金海龙
张鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201710608234.8A priority Critical patent/CN107545033B/en
Publication of CN107545033A publication Critical patent/CN107545033A/en
Application granted granted Critical
Publication of CN107545033B publication Critical patent/CN107545033B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a computing device for knowledge base entity classification based on representation learning, and relates to the field of text classification and knowledge base completion. The method comprises the following steps: constructing a co-occurrence network containing different levels of information for entities in a knowledge base, and encoding co-occurrence information among words, entities, words, categories, words and categories into the network; learning vector representations of entities and categories based on the constructed co-occurrence network using a network-based representation learning method; learning a mapping matrix for entities and categories based on the vector representation obtained by learning by using a learning ordering algorithm, wherein semantically related entities and categories are close to each other in a semantic space; and automatically distributing categories for the entities in the knowledge base by using a top-down searching method to obtain a category path. The method of the invention is beneficial to solving the problems existing in the existing entity classification method.

Description

Knowledge base entity classification calculation method based on representation learning
Technical Field
The invention relates to the technical field of text classification and knowledge base completion, in particular to a knowledge base entity classification calculation method based on representation learning.
Background
This section is intended to introduce the reader to various aspects of art that may be related to various aspects of the present invention, and is believed to provide the reader with useful background information to facilitate a better understanding of the various aspects of the present invention. Accordingly, it should be understood that the description in this section is for purposes of illustration and is not an admission of prior art.
In recent years, knowledge bases have attracted increasing research interest. Most of the existing knowledge bases are imperfect, and a plurality of researchers are dedicated to the work of completing the knowledge bases. Assigning classes to entities in a knowledge base is an important task for completion of the knowledge base. The entity category information plays a very important role in the knowledge base, and is beneficial to tasks such as a question and answer system, a recommendation system, relationship extraction and the like. The current main direction of research is to assign fine-grained categories to entities, as fine-grained categories can provide richer semantic information.
The existing research generally adopts a multi-classification algorithm in machine learning to classify entities in a knowledge base, i.e., an entity classification task in the knowledge base is regarded as a traditional text classification problem in natural language processing. The main steps are that some characteristics based on the knowledge base are defined, and then the prediction of the category is realized by using the traditional multi-classification algorithm. In recent years, the rapid development of learning technology is shown, and great help is provided for entity classification tasks, and the common method is to respectively define characteristics for entities and classes, and then map the characteristics of the entities and the classes into the same semantic space, so as to realize the reasoning calculation of the entity classes, and obtain better effect.
However, existing entity classification algorithms face 2 major problems: firstly, effective characteristics are difficult to design for entities in the knowledge base, different from entities appearing in context, the included semantic information is less, and the entities in the knowledge base include rich text information and structural information, so that the entities in the knowledge base need to be represented in a reasonable mode; secondly, the hierarchical relationship among the categories is not fully considered, the categories in the knowledge base form a tree structure containing corresponding structural information, and the hierarchical structure of the classification tree is not fully considered in the existing method.
Disclosure of Invention
The technical problem to be solved is how to provide a calculation method for knowledge base entity classification based on representation learning.
Aiming at the defects in the prior art, the invention provides a calculation method for entity classification of a knowledge base based on representation learning, which can better solve the problems in the entity classification method in the prior knowledge base.
In a first aspect, the present invention provides a computing device for knowledge base entity classification based on representation learning, comprising the steps of:
a: for entities in a given class labeled knowledge base, constructing a co-occurrence network with 4 layers of words, entities, words, classes, words and entities, and integrating semantic information into 4 heterogeneous co-occurrence networks;
b: based on the 4 heterogeneous co-occurrence networks, learning to obtain vector representation of each entity and each category by using a network-based representation learning algorithm;
c: based on the vector representation of the entities and the categories, learning a mapping matrix of the entities and the categories by using a learning sorting algorithm, and mapping the entities and the categories to the same semantic space;
d: and calculating the similarity between the entity and the category according to the vector representation and the mapping matrix, and distributing a category path to the unmarked entity by using a top-down searching method.
Optionally, the step a includes:
a1: constructing word-word co-occurrence network GwwCo-occurrence information for describing the level of wording in the entity description, formally denoted as Gww=(V,Eww) Each node represents a word, the edge-to-ground weight ωijRepresenting the number of co-occurrences of two words in the text;
a2: constructing entity-word co-occurrence network GewIs a bipartite graph composed of entity and word, and is formally expressed as
Figure BDA0001358905330000021
Side-to-ground weight ωijRepresents a word wjIn an entity eiThe number of occurrences in the textual description of (1);
a3: constructing type-word co-occurrence network GtwIs a two composed of type and wordIs represented formally as
Figure BDA0001358905330000031
Side-to-ground weight ωijRepresents a word wjIn a type tiThe number of occurrences in (a);
a4: constructing entity-type co-occurrence network GetIs a bipartite graph composed of entity and type, formally expressed as
Figure BDA0001358905330000032
Entity eiAnd category tjThere is a side (omega) betweenij1) if and only if the entity eiBelong to the category tj
Wherein, ω isijRepresenting the weight on an edge; w is aiRepresents a word; t is tiRepresenting a category; e.g. of the typeiRepresents an entity eiA vector representation of (a); t is tiRepresents a category tiIs represented by a vector of (a).
Optionally, the step B includes the steps of:
based on 4 heterogeneous co-occurrence networks G that obtainww、Gew、GtwAnd GetLearning each entity e using PTE algorithmiAnd category tjA vector representation of (a);
b1: for any bipartite graph G ═ VA∪VB,E),VAAnd VBIs a set of disjoint points, E is a set of edges, defining vj∈VBGenerating vi∈VAThe conditional probability of (a) is:
Figure BDA0001358905330000033
wherein the content of the first and second substances,
Figure BDA0001358905330000034
and
Figure BDA0001358905330000035
are each viAnd vjFor arbitrary v, forj∈VBCan be defined as VAConditional distribution p (. | v) on all nodes inj);
B2: based on the conditional distribution for each point defined by B1, for vj,∈VBMake the condition distribution p (· | v)j) Near empirical distribution
Figure BDA0001358905330000036
The closeness between the two distributions is measured by the KL divergence:
Figure BDA0001358905330000037
wherein λj=∑iwijIndicating point vjThe calculation method of the empirical distribution is as follows:
Figure BDA0001358905330000038
the target function is simplified into O ═ Sigma(i,j)∈Ewijlog(p(vj|vi));
B3: based on the objective function defined by B2, for each bipartite graph defined in A, a corresponding objective function O is definedww、Oew、OetAnd OtwSumming the objective functions:
On=Oww+Oew+Oet+Otw
joint optimization to obtain vector representation of each entity and class, Eemb={eiAnd Temb={ti}。
Optionally, the step C includes the steps of:
c1: defining a priority relationship between the two categories;
c2: based on the priority relationship between the categories defined by C1, learning the mapping matrix of the entities and the categories, and mapping the entities and the categories into the same semantic space, wherein the semantically related entities and categories are also close to each other in the semantic space:
Φe(ei)=U·ei
Φt(tj)=V·tj
wherein G iswwRepresenting a word-word co-occurrence network; v represents the set of all words; ewwRepresenting a set of edges in a word-word co-occurrence network; gewRepresenting an entity-word co-occurrence network; represents a collection of all entities; eewRepresenting a set of edges in an entity-word co-occurrence network; gtwRepresenting a category-word co-occurrence network;
Figure BDA0001358905330000041
represents a collection of all categories; etwRepresenting a set of edges in a category-word co-occurrence network; getRepresenting entity-class co-occurrence networks; eetRepresenting a set of edges in an entity-class co-occurrence network; g denotes a bipartite graph, VAAnd VBIs a set of two disjoint points in graph G, E is a set of edges in graph G; p (v)i|vj) Represents VBA point v injGenerating VAA point v iniThe conditional probability of (a);
Figure BDA0001358905330000042
and
Figure BDA0001358905330000043
are each viAnd vjA vector representation of (a); exp is an exponential function; p (· | v)j) Represents VBA point v injGenerating VAConditional distribution of all nodes in the tree;
Figure BDA0001358905330000044
represents p (· | v)j) A corresponding empirical distribution; o isww、Oew、OetAnd OtwRespectively representing network representation learning method in word-word co-occurrence network GwwEntity-word co-occurrence network GewEntity-class co-occurrence network GetAnd category-word co-occurrence network GtwThe objective function of (1). O isnRepresenting an overall objective function of the network representation learning method on four heterogeneous networks; u represents a mapping matrix or a projection matrix corresponding to the entity vector; phie(ei) Representing an entity vector eiIs calculated by using the projection matrix U; v represents a mapping matrix or a projection matrix corresponding to the category vector; phit(tj) Represents a category vector tjIs calculated using the projection matrix V; s (e)i,tj) Representing an entity eiAnd category tjThe similarity of (c).
Optionally, in the step C2, the priority relationship between the two categories includes:
Figure BDA0001358905330000051
wherein l (t)i,tj) Represents a category tiAnd category tjThe distance in the classification tree, the objective function based on the first priority relationship, is defined as:
Figure BDA0001358905330000052
Figure BDA0001358905330000053
Figure BDA0001358905330000054
wherein p (e) represents a category path of an entity, A (t)k) Represents a category tkThe node of the ancestor node of (c),
Figure BDA0001358905330000055
mapping the rank to a weight of a floating point number, s (e, t)k) Represents phie(e) And phit(tk) The inner product of (d).
Optionally, in the step C2, the priority relationship between the two categories includes:
Figure BDA0001358905330000056
Figure BDA0001358905330000057
Figure BDA0001358905330000058
wherein S (t)k) Represents a category tkThe sibling nodes of (1). And summing all entities with class marking information to obtain an objective function:
Figure BDA0001358905330000059
and solving the objective function by adopting a Stochastic Gradient Descent (SGD) algorithm, and learning to obtain mapping matrixes U and V of the entities and the classes.
Optionally, in the step D,
and B, predicting the category path of the unmarked entity by adopting a top-down search strategy based on the vector representation of the entity and the category obtained in the step B and the mapping matrix obtained in the step C.
Optionally, in the step D,
starting from a root node of the classification tree, finding the class which is the most matched with the current entity in each layer by calculating the similarity between the entity and the class, and recursively searching until the similarity is terminated at a leaf node or is lower than a certain threshold, wherein the similarity between the entity and the class is calculated in the following way:
s(ei,tj)=Φe(ei)·Φt(tj)
vector representation of entities and categories is used when calculating similarity (e)iAnd tj) And entity and mapping matrix (U and V) of the classification, the whole process is a search process from top to bottom, the predicted result naturally forms a classification path, meet the requirement of the entity classification task of fine granularity.
According to the technical scheme, the method for calculating the entity classification of the knowledge base based on representation learning, provided by the invention, constructs an information network by using the text description of the entity, and then obtains the low-dimensional dense vector representation of the entity and the class through learning in the network, so that the characteristics do not need to be manually defined for the entity, and the problem of entity representation is effectively solved; by utilizing a learning sequencing algorithm, the priority relationship between the two types is defined, the entity and the category are mapped into the same semantic space, the hierarchical relationship between the categories is fully considered, and the problem of hierarchical classification is effectively solved. The method starts from large-scale texts, constructs networks containing different information, obtains vector representation of entities and categories by using a representation learning algorithm, does not need to manually define characteristics, and effectively solves the problem of difficult entity representation in a knowledge base. On the other hand, a learning sorting algorithm is adopted, the entity and the category are mapped into the same semantic space by defining the priority relationship between the two categories, so that top-down category reasoning is realized, the hierarchical relationship between the categories is effectively considered in the model, and the method is suitable for the problem of hierarchical classification.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for the description of the embodiments or the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart illustrating a method for computing knowledge base entity classification based on representation learning according to an embodiment of the present invention;
FIG. 2 is a flow chart of a computing method for knowledge base entity classification based on representation learning according to another embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1 and 2, the invention provides a computing device flow diagram based on knowledge base entity classification representing learning. As shown in fig. 1, the method includes:
step A: 4 heterogeneous co-occurrence networks are constructed, namely word-word (word-word), entity-word (entity-word), category-word (type-word) and entity-category (entity-type) co-occurrence networks, and each network can be regarded as a bipartite graph.
The step A specifically comprises the following steps:
a1: constructing word-word co-occurrence network GwwCo-occurrence information for describing the level of wording in the entity description, formally denoted as Gww=(V,Eww) Each node represents a word, the edge-to-ground weight ωijRepresenting the number of co-occurrences of two words in the text (given the co-occurrence window).
A2: constructing entity-word co-occurrence network GewIs a bipartite graph composed of entity and word, and is formally expressed as
Figure BDA0001358905330000071
Side-to-ground weight ωijRepresents a word wjIn an entity eiThe number of occurrences in the text description of (1).
A3: constructing type-word co-occurrence network GtwIs a bipartite graph composed of type and word, formally represented as
Figure BDA0001358905330000081
Side-to-ground weight ωijRepresents a word wjIn a type tiThe specific calculation mode is to respectively calculate wjPresent in each of the categories tiThe number of times in the text description of the entity in (a) and summing them to obtain wjIn type tiTotal number of occurrences in all entities below.
A4: constructing entity-type co-occurrence network GetIs a bipartite graph composed of entity and type, formally expressed as
Figure BDA0001358905330000082
Entity eiAnd category tjThere is a side (omega) betweenij1) if and only if the entity eiBelong to the category tj
And B: based on 4 heterogeneous co-occurrence networks G obtained in the step Aww、Gew、GtwAnd GetLearning each entity e using a network-based representation learning algorithmiAnd category tjThe vectors of (1) are representative of semantically similar entities having similar representations, and semantically similar categories having similar representations.
The step B specifically comprises the following steps:
b1: for any bipartite graph G ═ VA∪VB,E),VAAnd VBIs a set of disjoint points, E is a set of edges, defining vj∈VBGenerating vi∈VAThe conditional probability of (a) is:
Figure BDA0001358905330000083
wherein the content of the first and second substances,
Figure BDA0001358905330000084
and
Figure BDA0001358905330000085
are each viAnd vjIs represented by a vector of (a). To pairAt any vj∈VBCan be defined as VAConditional distribution p (. | v) on all nodes inj)。
B2: based on the conditional distribution for each point defined by B1, for vj,∈VBMake the condition distribution p (· | v)j) Near empirical distribution
Figure BDA0001358905330000086
The closeness between the two distributions is measured by the KL divergence:
Figure BDA0001358905330000087
wherein λj=∑iwijIndicating point vjThe calculation method of the empirical distribution is as follows:
Figure BDA0001358905330000088
the target function is simplified into O ═ Sigma(i,j)∈Ewijlog(p(vj|vi))。
B3: based on the objective function defined by B2, for each bipartite graph defined in A, a corresponding objective function O is definedww、Oew、OetAnd OtwSumming the objective functions:
On=Oww+Oew+Oet+Otw
joint optimization to obtain vector representation of each entity and class, Eemb={eiAnd Temb={ti}。
And C: and B, based on the vector representation of the entities and the categories obtained in the step B, utilizing a Learning ordering algorithm (Learning to Rank) to learn a mapping matrix of the entities and the categories, and mapping the entities and the categories into the same semantic space, wherein the entities and the categories which are similar semantically are also close to each other in the semantic space.
The step C specifically comprises the following steps:
c1: a precedence relationship between the two categories is defined. First, in the category path corresponding to an entity, a more specific category is closer to the entity than a more general category, which is called an operator order. Second, the correct class is closer to the entity than the sibling class in the classification tree, called the filing order.
C2: based on the priority relationship between the categories defined by C1, learning the mapping matrix of the entities and the categories, and mapping the entities and the categories into the same semantic space, wherein the semantically related entities and categories are also close to each other in the semantic space:
Φe(ei)=U·ei
Φt(tj)=V·tj
step D: and B, predicting the category path of the unmarked entity by adopting a top-down search strategy based on the vector representation of the entity and the category obtained in the step B and the mapping matrix obtained in the step C. Starting from a root node of the classification tree, finding the class which is the most matched with the current entity in each layer by calculating the similarity between the entity and the class, and recursively searching until the similarity is terminated at a leaf node or is lower than a certain threshold, wherein the similarity between the entity and the class is calculated in the following way:
s(ei,tj)=Φe(ei)·Φt(tj)
vector representation of entities and categories is used when calculating similarity (e)iAnd tj) And entity and category mapping matrices (U and V), the whole process is a top-down search process, and a category path is naturally formed by a predicted result.
The following is a description of the formula letters involved in the present invention:
ωijbroadly refers to the weight (unlimited subscript) on an edge.
wiA term (without limitation a subscript) is intended to be generic.
eiGenerally refers to an entity (without limitation a subscript).
tiBroadly refers to a category(without limitation to subscripts).
eiBroadly refers to an entity eiThe vector of (a) represents (without subscript).
tiBroadly refers to a category tiThe vector of (a) represents (without subscript).
GwwRepresenting a word-word co-occurrence network.
V denotes the set of all words.
EwwRepresenting a collection of edges in a word-word co-occurrence network.
GewRepresenting an entity-word co-occurrence network.
Representing a collection of all entities.
EewRepresenting a collection of edges in an entity-word co-occurrence network.
GtwA category-word co-occurrence network is represented.
Figure BDA0001358905330000101
Representing a collection of all categories.
EtwRepresenting a set of edges in a category-word co-occurrence network.
GetRepresenting entity-class co-occurrence networks.
EetRepresenting a collection of edges in an entity-class co-occurrence network.
G generally denotes a bipartite graph, VAAnd VBIs the set of two disjoint points in graph G, and E is the set of edges in graph G.
p(vi|vj) Represents VBA point v injGenerating VAA point v iniThe conditional probability of (2).
Figure BDA0001358905330000102
And
Figure BDA0001358905330000103
are each viAnd vjIs represented by a vector of (a).
exp is an exponential function.
p(·|vj) Represents VBA point v injGenerating VAThe condition distribution of all nodes in the tree.
Figure BDA0001358905330000104
Represents p (· | v)j) A corresponding empirical distribution.
Oww、Oew、OetAnd OtwRespectively representing network representation learning method in word-word co-occurrence network GwwEntity-word co-occurrence network GewEntity-class co-occurrence network GetAnd category-word co-occurrence network GtwThe objective function of (1). O isnThe representation network represents the overall objective function of the learning method on four heterogeneous networks.
U represents a mapping matrix or a projection matrix corresponding to the entity vector.
Φe(ei) Representing an entity vector eiIs calculated using the projection matrix U.
V represents a mapping matrix or a projection matrix corresponding to the category vector.
Φt(tj) Represents a category vector tjIs calculated using the projection matrix V.
s(ei,tj) Representing an entity eiAnd category tjThe similarity of (c).
Experiments are carried out by adopting the method of the invention, and the specific experimental process is as follows:
1. introduction of data sets. The data set is constructed by using the classification tree of Dbpedia and the text description in Wikipedia, each entry in Wikipedia has a unique category path (a path in the classification tree corresponding to Dbpedia), and the text of wikipedia is used as the text description of each entity. A total of 3 data sets were constructed: (1) the text information of each wiki is used as the text description of the entity. (2) The abstract portion of each wiki entry is used as a textual description of the entity. (3) And carrying out word stem processing on the text of the entry, and using the word stem as the text description of the entity. The relevant information of the data set is shown in table 1.
TABLE 1 correlation statistics of data sets
Data set Full text Abstract Word drying
Types 451 451 450
Entities 3,087,751 2,536,198 2,847,568
Words 31,752 17,451 25,430
Getedges 7,757,347 6,340,495 7,190,233
Gewedges 418,527,303 247,165,283 334,632,976
Gtwedges 6,743,100 3,184,492 4,730,374
Gwwedges 377,267,923 147,490,406 224,829,203
For full-text datasets, low frequency words are filtered by a threshold of 1500. For the summarized dataset, low frequency words are filtered by a threshold of 1000. The training data and test data were divided at the scale of 80/20.
2. And (4) setting an experiment. As with previous work, Strict-F1, Mi-F1 and Ma-F1 were used to evaluate the effect of the experiment. The comparison method comprises the following steps: a Tipalo model, an SDType model, a FIGMENT model, a CUTE model and a CE/HCE model; and comparative experiments on their own. The first 4 of these are traditional entity classification algorithms, and CE/HCE is an entity classification algorithm based on representation learning. Self-contained contrast experiments are used to test the role of a word-word network.
3. Results and analysis of the experiments
Using the above data sets and experimental setup, we tested the method of the present disclosure on each data set and compared it to the mainstream method above (the method of the present disclosure is denoted by EFHET). As shown in table 2, the evaluation results of the entity classification are shown. On each data set, the EFHET is obviously superior to the comparison method under 3 evaluation indexes, and the accuracy and the stability of the method disclosed by the invention are proved.
TABLE 2 analysis of entity Classification results in the knowledge base
Figure BDA0001358905330000121
And (5) analyzing an experimental result. First, the EFHET approach performs better than several popular entity classification algorithms. The EFHET method mainly utilizes more structured information, semantically related entities have similar representation and semantically close categories also have similar representation in the network-based representation learning process; in the process of learning the mapping matrix of the entity and the category, a bridge between the entity and the category is established through the defined priority relationship between the two categories, and the method has strong distinguishing capability in the classifying process, so the effect is better.
In addition, EFHET has significant advantages over CE/HCE, which is a representation-based learning method. The main reason is that the CE/HCE method depends on entity pairs appearing in the context, the co-occurrence relationship between the entities is different from the co-occurrence relationship between words, the co-occurrence relationship is very sparse, and the CE/HCE method has great noise and small data volume and can influence the experimental effect. And the EFHET starts from a large-scale text, only utilizes the naive text information, starts from a word angle, has larger data scale and naturally better effect.
Finally, in the comparison of the network and the network, the word-word network can be seen to have certain help to the final experiment effect. For example, "computer" and "computer" are similar words, and similar expressions can be obtained in the course of expression learning. "computer" and "computer" may each often appear in different entities, but these entities are likely to have similar representations because of the relationship between "computer" and "computer". The Word-Word network solves the problem of synonyms to a certain extent, and improves the final result to a certain extent.
In summary, the method for calculating the entity classification of the knowledge base based on representation learning provided by the invention constructs an information network by using the text description of the entity, and then learns from the network to obtain the low-dimensional dense vector representation of the entity and the class, without manually defining the characteristics for the entity, thereby effectively solving the problem of entity representation; by utilizing a learning sequencing algorithm, the priority relationship between the two types is defined, the entity and the category are mapped into the same semantic space, the hierarchical relationship between the categories is fully considered, and the problem of hierarchical classification is effectively solved. The method starts from large-scale texts, constructs networks containing different information, obtains vector representation of entities and categories by using a representation learning algorithm, does not need to manually define characteristics, and effectively solves the problem of difficult entity representation in a knowledge base. On the other hand, a learning sorting algorithm is adopted, the entity and the category are mapped into the same semantic space by defining the priority relationship between the two categories, so that top-down category reasoning is realized, the hierarchical relationship between the categories is effectively considered in the model, and the method is suitable for the problem of hierarchical classification.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element. The terms "upper", "lower", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the referred devices or elements must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. Unless expressly stated or limited otherwise, the terms "mounted," "connected," and "connected" are intended to be inclusive and mean, for example, that they may be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In the description of the present invention, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description. Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention is not limited to any single aspect, nor is it limited to any single embodiment, nor is it limited to any combination and/or permutation of these aspects and/or embodiments. Moreover, each aspect and/or embodiment of the present invention may be utilized alone or in combination with one or more other aspects and/or embodiments thereof.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims (8)

1. A method for computing a classification of an entity based on a knowledge base of representation learning, comprising:
a: for entities in a given class labeled knowledge base, constructing a co-occurrence network with 4 layers of words, entities, words, classes, words and entities, and integrating semantic information into 4 heterogeneous co-occurrence networks;
b: based on the 4 heterogeneous co-occurrence networks, learning to obtain vector representation of each entity and each category by using a network-based representation learning algorithm;
c: based on the vector representation of the entities and the categories, learning a mapping matrix of the entities and the categories by using a learning sorting algorithm, and mapping the entities and the categories to the same semantic space;
d: and calculating the similarity between the entity and the category according to the vector representation and the mapping matrix, and distributing a category path to the unmarked entity by using a top-down searching method.
2. The method of claim 1, wherein step a comprises:
a1: constructing word-word co-occurrence network GwwCo-occurrence information for describing the level of wording in the entity description, formally denoted as Gww=(V,Eww) Each node represents a word, the edge-to-ground weight ωijRepresenting the number of co-occurrences of two words in the text;
a2: constructing entity-word co-occurrence network GewIs a bipartite graph composed of entity and word, formally denoted as Gew=(∪V,Eew) On the side ofThe ground weight ωijRepresents a word wjIn an entity eiThe number of occurrences in the textual description of (1);
a3: constructing type-word co-occurrence network GtwIs a bipartite graph composed of type and word, formally represented as
Figure FDA0002547442530000011
Side-to-ground weight ωijRepresents a word wjIn a type tiThe number of occurrences in (a);
a4: constructing entity-type co-occurrence network GetIs a bipartite graph composed of entity and type, formally expressed as
Figure FDA0002547442530000012
Entity eiAnd category tjThere is a side omegaij1, if and only if the entity eiBelong to the category tj
Wherein, ω isijRepresenting the weight on an edge; w is aiRepresents a word; t is tiRepresenting a category; e.g. of the typeiRepresents an entity; gwwRepresenting a word-word co-occurrence network; v represents the set of all words; ewwRepresenting a set of edges in a word-word co-occurrence network; gewRepresenting an entity-word co-occurrence network; represents a collection of all entities; eewRepresenting a set of edges in an entity-word co-occurrence network; gtwRepresenting a category-word co-occurrence network;
Figure FDA0002547442530000021
represents a collection of all categories; etwRepresenting a set of edges in a category-word co-occurrence network; getRepresenting entity-class co-occurrence networks; eetRepresenting a collection of edges in an entity-class co-occurrence network.
3. The method of claim 1, wherein the step B comprises the steps of:
based on obtained4 heterogeneous co-occurrence networks Gww、Gew、GtwAnd GetLearning each entity e using PTE algorithmiAnd category tjA vector representation of (a);
b1: for any bipartite graph G ═ VA∪VB,E),VAAnd VBIs a set of disjoint points, E is a set of edges, defining vj∈VBGenerating vi∈VAThe conditional probability of (a) is:
Figure FDA0002547442530000022
wherein the content of the first and second substances,
Figure FDA0002547442530000023
and
Figure FDA0002547442530000024
are each viAnd vjFor arbitrary v, forj∈VBCan be defined as VAConditional distribution p (. | v) on all nodes inj);
B2: based on the conditional distribution for each point defined by B1, for vj,∈VBMake the condition distribution p (· | v)j) Near empirical distribution
Figure FDA0002547442530000025
The closeness between the two distributions is measured by the KL divergence:
Figure FDA0002547442530000026
wherein λj=∑iwijIndicating point vjThe calculation method of the empirical distribution is as follows:
Figure FDA0002547442530000027
the target function is simplified into O ═ Sigma(i,j)∈Ewijlog(p(vj|vi));
B3: defining a corresponding objective function O for each bipartite graph defined in the step A based on the objective function defined in B2ww、Oew、OetAnd OtwSumming the objective functions:
On=Oww+Oew+Oet+Otw
joint optimization to obtain vector representation of each entity and class, Eemb={eiAnd Temb={ti};
Wherein, Oww、Oew、OetAnd OtwRespectively representing network representation learning method in word-word co-occurrence network GwwEntity-word co-occurrence network GewEntity-class co-occurrence network GetAnd category-word co-occurrence network GtwAn objective function of (1); omegaijRepresenting the weight on one edge, j representing the number corresponding to the entity different from i, i' representing the set VAOr VBNumber, V, corresponding to any one of the entitiesAAnd VBRespectively, two disjoint sets of points in graph G are shown, a representing a first point set category and B representing a second point set category.
4. The method of claim 1, wherein said step C comprises the steps of:
c1: defining a priority relationship between the two categories;
c2: based on the priority relationship between the categories defined by C1, learning the mapping matrix of the entities and the categories, and mapping the entities and the categories into the same semantic space, wherein the semantically related entities and categories are also close to each other in the semantic space:
Φe(ei)=U·ei
Φt(tj)=C·tj
wherein G iswwRepresenting a word-word co-occurrence network; v represents the set of all words; ewwRepresenting a set of edges in a word-word co-occurrence network; gewRepresenting an entity-word co-occurrence network; represents a collection of all entities; eewRepresenting a set of edges in an entity-word co-occurrence network; gtwRepresenting a category-word co-occurrence network;
Figure FDA0002547442530000031
represents a collection of all categories; etwRepresenting a set of edges in a category-word co-occurrence network; getRepresenting entity-class co-occurrence networks; eetRepresenting a set of edges in an entity-class co-occurrence network; g denotes a bipartite graph, VAAnd VBIs a set of two disjoint points in graph G, E is a set of edges in graph G; p (v)i|vj) Represents VBA point v injGenerating VAA point v iniThe conditional probability of (a);
Figure FDA0002547442530000032
and
Figure FDA0002547442530000033
are each viAnd vjA vector representation of (a); exp is an exponential function; p (· | v)j) Represents VBA point v injGenerating VAConditional distribution of all nodes in the tree;
Figure FDA0002547442530000034
represents p (· | v)j) A corresponding empirical distribution; o isww、Oew、OetAnd OtwRespectively representing network representation learning method in word-word co-occurrence network GwwEntity-word co-occurrence network GewEntity-class co-occurrence network GetAnd category-word co-occurrence network GtwAn objective function of (3), OnRepresenting an overall objective function of the network representation learning method on four heterogeneous networks; u representsA mapping matrix or a projection matrix corresponding to the entity vector; phie(ei) Representing an entity vector eiIs calculated by using the projection matrix U; phit(tj) Represents a category vector tjIs calculated using C; s (e)i,tj) Representing an entity eiAnd category tjThe similarity of (c).
5. The method according to claim 4, wherein in the step C2, the priority relationship between two categories comprises:
Figure FDA0002547442530000041
wherein l (t)i,tj) Represents a category tiAnd category tjThe distance in the classification tree, the objective function based on the first priority relationship, is defined as:
Figure FDA0002547442530000042
Figure FDA0002547442530000043
Figure FDA0002547442530000044
wherein p (e) represents a category path of an entity, A (t)k) Represents a category tkThe node of the ancestor node of (c),
Figure FDA0002547442530000045
mapping the rank to a weight of a floating point number, s (e, t)k) Represents phie(e) And phit(tk) Root represents the root node in the classification tree.
6. The method according to claim 4, wherein in the step C2, the priority relationship between two categories comprises:
Figure FDA0002547442530000046
Figure FDA0002547442530000047
Figure FDA0002547442530000048
wherein S (t)k) Represents a category tkThe sibling nodes of (a), (b), (c) represent a class path of an entity, s (e, t)k) Represents phie(e) And phit(tk) Inner product of, tk′And (3) representing an ancestor category or a brother category of any category of the entity e, and summing all entities with category marking information to obtain an objective function:
Figure FDA0002547442530000051
and solving the objective function by adopting a random gradient descent (SGD) algorithm, and learning to obtain mapping matrixes U and C of the entities and the classes.
7. The method according to claim 1, wherein in step D,
and B, predicting the category path of the unmarked entity by adopting a top-down search strategy based on the vector representation of the entity and the category obtained in the step B and the mapping matrix obtained in the step C.
8. The method according to claim 1, wherein in step D,
starting from a root node of the classification tree, finding the class which is the most matched with the current entity in each layer by calculating the similarity between the entity and the class, and recursively searching until the similarity is terminated at a leaf node or is lower than a certain threshold, wherein the similarity between the entity and the class is calculated in the following way:
s(ei,tj)=Φe(ei)·Φt(tj)
wherein phie(ei) Representing an entity vector eiOf a mapping function of phit(tj) Represents a category vector tjA mapping function of (a);
when calculating the similarity, the vector representation e of the entity and the category is usediAnd tjAnd entity and category mapping matrixes U and C, wherein the whole process is a top-down searching process, and a category path is naturally formed by a predicted result so as to meet the requirement of a fine-grained entity classification task.
CN201710608234.8A 2017-07-24 2017-07-24 Knowledge base entity classification calculation method based on representation learning Active CN107545033B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710608234.8A CN107545033B (en) 2017-07-24 2017-07-24 Knowledge base entity classification calculation method based on representation learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710608234.8A CN107545033B (en) 2017-07-24 2017-07-24 Knowledge base entity classification calculation method based on representation learning

Publications (2)

Publication Number Publication Date
CN107545033A CN107545033A (en) 2018-01-05
CN107545033B true CN107545033B (en) 2020-12-01

Family

ID=60970776

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710608234.8A Active CN107545033B (en) 2017-07-24 2017-07-24 Knowledge base entity classification calculation method based on representation learning

Country Status (1)

Country Link
CN (1) CN107545033B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228877B (en) * 2018-01-22 2020-08-04 北京师范大学 Knowledge base completion method and device based on learning sorting algorithm
CN112487195B (en) * 2019-09-12 2023-06-27 医渡云(北京)技术有限公司 Entity ordering method, entity ordering device, entity ordering medium and electronic equipment
CN111259215B (en) * 2020-02-14 2023-06-27 北京百度网讯科技有限公司 Multi-mode-based topic classification method, device, equipment and storage medium
CN111522959B (en) * 2020-07-03 2021-05-28 科大讯飞(苏州)科技有限公司 Entity classification method, system and computer readable storage medium
CN112699676B (en) * 2020-12-31 2024-04-12 中国农业银行股份有限公司 Address similarity relation generation method and device
CN114781471B (en) * 2021-06-02 2022-12-27 清华大学 Entity record matching method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591988A (en) * 2012-01-16 2012-07-18 宋胜利 Short text classification method based on semantic graphs
US8990200B1 (en) * 2009-10-02 2015-03-24 Flipboard, Inc. Topical search system
CN104462253A (en) * 2014-11-20 2015-03-25 武汉数为科技有限公司 Topic detection or tracking method for network text big data
CN104615687A (en) * 2015-01-22 2015-05-13 中国科学院计算技术研究所 Entity fine granularity classifying method and system for knowledge base updating
CN104915397A (en) * 2015-05-28 2015-09-16 国家计算机网络与信息安全管理中心 Method and device for predicting microblog propagation tendencies
CN105824802A (en) * 2016-03-31 2016-08-03 清华大学 Method and device for acquiring knowledge graph vectoring expression
CN106909622A (en) * 2017-01-20 2017-06-30 中国科学院计算技术研究所 Knowledge mapping vector representation method, knowledge mapping relation inference method and system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101339551B (en) * 2007-07-05 2013-01-30 日电(中国)有限公司 Natural language query demand extension equipment and its method
CN102750316B (en) * 2012-04-25 2015-10-28 北京航空航天大学 Based on the conceptual relation label abstracting method of semantic co-occurrence patterns
US9292797B2 (en) * 2012-12-14 2016-03-22 International Business Machines Corporation Semi-supervised data integration model for named entity classification
CN103699663B (en) * 2013-12-27 2017-02-08 中国科学院自动化研究所 Hot event mining method based on large-scale knowledge base
CN106919689B (en) * 2017-03-03 2018-05-11 中国科学技术信息研究所 Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8990200B1 (en) * 2009-10-02 2015-03-24 Flipboard, Inc. Topical search system
CN102591988A (en) * 2012-01-16 2012-07-18 宋胜利 Short text classification method based on semantic graphs
CN104462253A (en) * 2014-11-20 2015-03-25 武汉数为科技有限公司 Topic detection or tracking method for network text big data
CN104615687A (en) * 2015-01-22 2015-05-13 中国科学院计算技术研究所 Entity fine granularity classifying method and system for knowledge base updating
CN104915397A (en) * 2015-05-28 2015-09-16 国家计算机网络与信息安全管理中心 Method and device for predicting microblog propagation tendencies
CN105824802A (en) * 2016-03-31 2016-08-03 清华大学 Method and device for acquiring knowledge graph vectoring expression
CN106909622A (en) * 2017-01-20 2017-06-30 中国科学院计算技术研究所 Knowledge mapping vector representation method, knowledge mapping relation inference method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Measuring the Influence from User-Generated Content to News via Cross-dependence Topic Modeling;Lei Hou 等;《International Conference on Database Systems for Advanced Applications》;20150409;125-141 *
基于研究热点的语义标注知识资源聚合研究;崔娜娜 等;《情报探索》;20160515(第5期);127-134 *
面向文本分类的中文文本语义表示方法;宋胜利 等;《西安电子科技大学学报(自然科学版)》;20121116;第40卷(第2期);89-97,129 *

Also Published As

Publication number Publication date
CN107545033A (en) 2018-01-05

Similar Documents

Publication Publication Date Title
CN107545033B (en) Knowledge base entity classification calculation method based on representation learning
CN111488734B (en) Emotional feature representation learning system and method based on global interaction and syntactic dependency
WO2023000574A1 (en) Model training method, apparatus and device, and readable storage medium
CN109670039B (en) Semi-supervised e-commerce comment emotion analysis method based on three-part graph and cluster analysis
CN111291556B (en) Chinese entity relation extraction method based on character and word feature fusion of entity meaning item
CN108681557B (en) Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint
CN107220311B (en) Text representation method for modeling by utilizing local embedded topics
Huang et al. Large-scale heterogeneous feature embedding
CN111191466A (en) Homonymous author disambiguation method based on network characterization and semantic characterization
CN108470025A (en) Partial-Topic probability generates regularization own coding text and is embedded in representation method
Zarei et al. Detecting community structure in complex networks using genetic algorithm based on object migrating automata
Gao et al. Clustering algorithms for detecting functional modules in protein interaction networks
Lan et al. Benchmarking of computational methods for predicting circRNA-disease associations
Dong et al. Predicting protein complexes using a supervised learning method combined with local structural information
CN113539372A (en) Efficient prediction method for LncRNA and disease association relation
CN117349494A (en) Graph classification method, system, medium and equipment for space graph convolution neural network
Sun et al. Graph embedding with rich information through heterogeneous network
CN113392334B (en) False comment detection method in cold start environment
Xiao et al. Non-local attention learning on large heterogeneous information networks
CN116991986B (en) Language model light weight method, device, computer equipment and storage medium
CN116825234B (en) Multi-mode information fusion medicine molecule activity prediction method and electronic equipment
Azondekon Modeling the Complexity and Dynamics of the Malaria Research Collaboration Network in Benin, West Africa: papers indexed in the Web Of Science (1996—2016)
Li et al. Learning diffusion on global graph: A PDE-directed approach for feature detection on geometric shapes
CN113850811B (en) Three-dimensional point cloud instance segmentation method based on multi-scale clustering and mask scoring
LIU et al. Community detection in networks based on information bottleneck clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant