Disclosure of Invention
In order to solve the above problem, according to an embodiment of the present invention, a method for classifying entities based on knowledge base update is provided, including:
step 1), identifying an entity from a text;
step 2), constructing a dependency graph by taking the identified entities, the entities related to the entities in the knowledge base and the classification of the related entities in the knowledge base as nodes, wherein the weight value of an edge in the dependency graph represents the correlation degree between two nodes connected by the edge;
and 3) obtaining the classification of the identified entity by executing the restart random walk on the dependency graph.
In the above method, step 2) includes:
step 21), obtaining related entities of the identified entities in the knowledge base according to the semantic compatibility, and obtaining the classification of the related entities in the knowledge base; the semantic compatibility degree represents the similarity between the context information of the identified entity and the description text of the related entity;
step 22), the identified entities, the entities related to the entities in the knowledge base and the classification of the related entities in the knowledge base are used as nodes;
step 23), adding edges between the nodes representing the identified entities and the nodes representing the related entities, wherein the weight of the edges is the semantic compatibility between the identified entities and the related entities;
adding an edge between the node representing the related entity and the node representing the classification, wherein the weight of the edge indicates whether the related entity belongs to the classification;
adding edges between nodes representing related entities, wherein the weight of the edges is the semantic relevance between the related entities;
edges are added among nodes representing the classification, and the weight of the edges is the correlation degree among the classifications.
In the above method, the semantic compatibility is calculated according to the following formula:
wherein SC (em, e) represents the semantic compatibility of the identified entity em and the related entity e in the knowledge base, X represents the context information of em, T represents the description text of e,representing the TF-IDF vector composed of all bitters contained in the text,representing a vectorBiterm represents co-occurring word pairs in the text. Wherein the context information of the identified entity is comprised of words that occur before and after the text.
In the above method, step 21) includes:
and taking the entity with the semantic compatibility larger than 0 with the identified entity in the knowledge base as the related entity.
In the above method, the semantic relatedness between related entities is calculated according to the following formula:
wherein, SR (e) 1 ,e 2 ) Representing related entities e in a knowledge base 1 And e 2 Semantic relatedness of 1 And I 2 Respectively representing occurrences of entities e in texts describing entities in a knowledge base 1 And e 2 Z represents the set of all entities contained in the knowledge base, | · | represents the size of the set.
In the above method, the correlation between the classifications is calculated according to the following formula:
wherein, CR (c) 1 ,c 2 ) Represents a classification c 1 And c 2 The degree of correlation between the two signals is determined,andrespectively representing the classes c in the knowledge base 1 And c 2 Represents the size of the collection, |, of the entity of (1).
In the above method, step 3) includes:
step 31), initializing the distribution state of the nodes in the dependency graph according to the following formula:
wherein n represents the total number of nodes,representing the initial distribution state of the node i; if k = i, then r i(k) =1, otherwise r i(k) K is a natural number and 1. Ltoreq. K.ltoreq.n;
step 32), calculating the state transition probability matrix a =(a ij ):
Wherein, a ij Representing the probability of transferring from the node i to the node j in the process of restarting the random walk, wherein i and j are natural numbers and satisfy the conditions that i is greater than or equal to 1 and j is less than or equal to n; w is a ij Is the weight of the edge between node i and node j;represents the sum of the weights of all edges connecting node i;
step 33), for each node, carrying out state transition to the neighbor node iteratively until the distribution state of each node in the dependency graph does not change along with the increase of the iteration times; wherein the distribution state of the node i after the t-th iterationIs represented as follows:
wherein, the first and the second end of the pipe are connected with each other,representing the distribution state of the node i after the t iteration, wherein t is a natural number, i is a natural number, and i is more than or equal to 1 and less than or equal to n;representing the distribution state of the node i after the t-1 iteration; mu represents the probability of returning to the starting node i after the t-th iteration, is called a restart factor, is a real number andrepresents a restart vector of node i andif k = i, v i(k) =1, otherwise v i(k) K is a natural number and 1. Ltoreq. K.ltoreq.n;
and step 34) obtaining corresponding classification according to the distribution state of the nodes.
In the above method, step 34) includes:
in the distribution state of the nodes representing the identified entities, sorting the nodes representing the classification according to the values of the components corresponding to the nodes;
and obtaining the classification corresponding to the identified entity according to the sorting result.
According to an embodiment of the present invention, there is also provided a system for classifying entities based on fine granularity of knowledge base update, including:
the entity identification device is used for identifying the entity from the text;
the dependency graph constructing device is used for constructing a dependency graph by taking the identified entities, the entities related to the entities in the knowledge base and the classification of the related entities in the knowledge base as nodes, wherein the weight value of an edge in the dependency graph represents the correlation degree between two nodes connected by the edge; and an iteration device for obtaining the classification to which the identified entity belongs by performing a restart random walk on the dependency graph.
The method can overcome the defect that the entity is difficult to realize fine-grained classification under the condition of lacking entity context in the prior art, provides powerful evidence support for entity fine-grained classification in the same text by modeling semantic correlation among entities appearing in the same text and the relation between the text entity and a knowledge base entity and classification thereof, and improves the accuracy of entity fine-grained classification by restarting a random walk algorithm.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
According to one embodiment of the invention, a knowledge base update-oriented entity fine-grained classification method is provided.
In general terms, the method comprises: identifying an entity from the text; constructing a dependency graph by taking the identified entities, the entities related to the entities in the knowledge base and the classification of the related entities in the knowledge base as nodes, wherein the weight value of an edge in the dependency graph represents the correlation degree between two nodes connected by the edge; and obtaining a classification to which the identified entity belongs by performing a restart random walk on the dependency graph. The method is based on the theory of distribution assumptions, i.e., the greater the semantic relevance of the context in which two entities are located, the greater the probability that they belong to the same class.
Referring now to FIG. 1, the steps of the method of the present invention are described.
Step 101: inputting a text document to be processed and a target knowledge base
The text document D and the target knowledge base KB to be processed are selected, and the system input is initialized.
As described above, the Knowledge Base (KB) is composed of entities, classifications, relationships, attributes, etc. that describe Knowledge, and thus the target Knowledge Base KB can be modeled as follows:
KB=<C,E,P,R>
wherein C represents a classification set contained in the target knowledge base; e and P represent the set of entities and their attributes belonging to the class, respectively, and R is a function defining the relationship between the class, instance, and attribute. In set E, each entity E may be represented in the form:
e=<name,aliases,T>
wherein, name represents the name of entity e; aliases represent a set of aliases for entity e; t represents the description text of entity e. Attribute set P of entity e e And the classification set C to which the entity e belongs e Can be obtained by a function R of a knowledge base KB and satisfies
The target knowledge base in the above form can be modeled using various existing encyclopedia database resources, for example, a target knowledge base that uses the wikipedia-based created knowledge base as input in this step.
Step 102: extracting entities contained in text documents
Using the named entity recognition tool, a set of all entities contained in the text document D is extracted.
The set of all entities contained in the text document D can be written as:
EM={em i i is an integer, i is more than or equal to 0 and less than or equal to | D | }
Wherein | D | is the length of the text document; each element em in the set is represented in the form:
em=<name,D,X>
wherein, name represents the name of em; d represents a source text document of em, and X represents a context describing em. In one embodiment, X is represented by a window of words with em appearing around the text document D, the window size being k (k being an integer and 0 n k ≦ D |), i.e., the length of context X is 2k (X is made up of k words appearing before and k words after the text document D), preferably k = min (50, | D |).
Those skilled in the art will appreciate that various named entity recognition tools are available to extract entities in text. In one embodiment, stanford NER is utilized as a named entity recognition tool.
Step 103: creating dependency graphs
And creating a dependency graph according to the entity set EM and the target knowledge base KB extracted from the text document D, so that semantic correlation among different entities in the text document D and dependency between the entities in the text document D and the entities in the knowledge base KB and the belonged classes thereof are uniformly modeled.
Referring to FIG. 2, in one embodiment, creating a dependency graph includes the following sub-steps:
step 1031: the set of entities EM and the target knowledge base KB identified from the text document D are input.
Step 1032: a candidate entity is selected.
According to the Semantic Compatibility (SC) of the text describing the entities, for each entity EM e EM a set of candidate entities is selected in the knowledge base KB that is semantically compatible with it, denoted as:
ES em ={e∈E|SC(em,e)>0}
where SC (em, e) represents semantic compatibility between em and knowledge base entity e. In one embodiment, the semantic compatibility is calculated in a Biterm-based cosine similarity manner:
wherein SC (em, e) is a real number and is more than or equal to 0 and less than or equal to 10; x is context information describing em; t is a description text of e; sim (X, T) is the similarity of X and T;TF-IDF vectors composed of all bitterm contained in the text,is the norm of the vector and one bitterm is a co-occurring word pair in the text. For example, given the text "apple app store" which gets the three words "apple", "app", "store" by word segmentation, the text contains a set of bitters as { apple app, apple store, app store }.
According to the above formula, if SC (em, e)>, 0, selecting e as a candidate entity of em, thereby obtaining a candidate entity set ES compatible with em semantics em 。
Step 1033: a candidate classification is selected.
Obtaining a set of classifications in the knowledge base KB to which each candidate entity e selected in step 1032 belongs based on a relationship definition function R in the knowledge base KBIt is taken as a candidate classification set.
Step 1034: and establishing nodes and side information in the dependency graph.
The node set in the dependency graph includes a set of all entities (text entities for short) extracted from the text document D, a set of candidate entities (knowledge base entities for short) semantically compatible with the extracted entities, and a set of classifications (knowledge base classifications for short) to which the candidate entities belong.
After the nodes in the graph are established, edges and weights are distributed among the nodes, which specifically includes:
1. and adding a connecting edge between the node representing the text entity em and the node e representing the knowledge base entity compatible with the semantics of the text entity em, wherein the weight value of the edge is the semantic compatibility SC (em, e) between the nodes.
2. Adding a connecting edge between a node representing the knowledge base entity e and a node representing the class c to which the node belongs, wherein the weight value of the edge is the Affiliation Relationship (AR) between the nodes, if the entity belongs to the class, the weight value is 1.0, and if the entity does not belong to the class, the weight value is 0.0.
3. At two nodes representing knowledge base entitiese 1 And e 2 And adding connecting edges, wherein the weight of each edge is the Semantic Relevance (SR) between the edges. It is worth noting that here the semantic relevance between entities in the same text is indirectly measured by the semantic relevance between knowledge base entities.
In one embodiment, entity e is computed based on a normalized Google distance (google distance) 1 And e 2 Semantic relatedness SR (e) therebetween 1 ,e 2 ):
Wherein, SR (e) 1 ,e 2 ) Is real number and 0 ≦ SR (e) 1 ,e 2 )≤1.0;I 1 And I 2 Respectively representing the occurrence of an entity e in the text describing the entity in the knowledge base KB 1 And e 2 Z represents the set of all entities contained in the knowledge base KB, | · | represents the size of the set.
4. At two nodes c representing classes of the knowledge base 1 And c 2 Adding a connecting edge, and the weight of the edge is the Correlation (CR) between them. In one embodiment, class c is calculated using Jaccard coefficients 1 And c 2 Correlation degree CR (c) between 1 ,c 2 ):
Wherein, CR (c) 1 ,c 2 ) Is real number and 0 is equal to or less than CR (c) 1 ,c 2 )≤1.0,Andrespectively representing the classes c in the knowledge base KB 1 And c 2 Is a collection of entities, |, representsThe size of the collection.
By establishing nodes and connecting edges, a dependency graph is constructed for all entities EM in the text document D, denoted as G = (V, E, W). G is an undirected graph, where V is the set of vertices of the graph, including all entities in a given text, all entities in a knowledge base that are semantically compatible with those entities, and the set of classes to which those entities belong. E is an edge set among the nodes; w E → R (R is real number) is the weight on the edge.
Given a piece of text, "celebrity hall is a great tombstone for players, and is also a positive for players' lifetime, being the best acceptance beyond the champion ring. But because players must wait 5 years after retirement to enter the celebrity hall, fliers have not waited until 2009 to achieve such prosperity. However, this does not prevent jodan from flashing its name over the history of NBA and even the world's basket jar ". 3 different entities are identified using the named entity recognition tool: "celebrity hall", "jordan" and "NBA". By using the method provided by the invention, a dependency graph model is created for the 3 entities. As shown in fig. 3, the graph contains 12 nodes in total: 3 text entities, 6 knowledge base entities and 3 knowledge base classes, and contains 12 edges.
Step 104: jointly inferring classification information for entities from a created dependency graph
On the dependency graph created in the previous step, a random walk algorithm, such as a restart random walk algorithm, is executed. And continuously and iteratively making random walk on the dependence graph until the distribution state of the nodes in the graph does not change along with the increase of the iteration number, namely reaching a steady state. At the moment, according to the distribution state of the nodes representing the text entities, the corresponding classification labels are obtained, and therefore the fine-grained classification information of the text entities is deduced.
This step will be described in detail below with reference to fig. 4, in conjunction with one embodiment of the present invention:
step 1041: algorithm inputs are initialized.
The created dependency graph G = (V, E, W) is input.
Step 1042: the distribution state of the nodes in the dependency graph is initialized.
The number of nodes in the graph G is n = | V |, the number of edges is m = | E |, the node numbers in the graph G are 1, \ 8230, i, \ 8230, n (i is a natural number, and i is more than or equal to 1 and less than or equal to n).
Setting the distribution state of the nodes i in the dependency graph at the initial time of the algorithmThe distribution state is a column vector of dimension n x 1 with respect to all nodes included in the graph G, where n is the number of nodes in the graph G.Is recorded as:
wherein for each component r in the vector i(k) The values of (A) are as follows: if k = i, then r i(k) =1, otherwise r i(k) K is a natural number and 1. Ltoreq. K.ltoreq.n.
Step 1043: adjacency matrix U = (U) according to dependency graph G = (V, E, W) ij ) Calculating a state transition probability matrix A = (a) in the random walk process ij ) I and j are natural numbers and satisfy 1 ≦ i, and j ≦ n. For the adjacency matrix U, U ij The values are as follows:
wherein, w ij The weight on the connecting side between node i and node j is determined by W: E → R (R is a real number) in G = (V, E, W).
For the state transition probability matrix A, a ij Representing the probability of transitioning from node i to node j during the restart of the random walk. The adjacency vector composed of the node i and all other nodes in the graph G = (V, E, W) isThe adjacent vector is the vector formed by the ith row elements in the adjacent matrix U, k is a natural number and is more than or equal to 1 and less than or equal to n. A is calculated according to the adjacent vector of the node i in the following way ij :
As can be seen from the above equation, if i = j or no connecting edge exists between nodes i and j, a ij =0; if there is a connecting edge between nodes i and j, then a ij The value of (d) is proportional to the weight on the edge between node i and node j, i.e., the ratio of the weight on the connecting edge between node i and node j to the sum of the weights on all connecting edges connecting node i.
Step 1044: on the dependency graph, starting from the starting node i, the state transition is continuously and iteratively carried out to the neighboring nodes around the starting node i. After the t-th iteration, the distribution state of the node i in the graphIs represented as follows:
wherein t is a natural number;representing the distribution state of the node i after the t-1 th iteration;representing the distribution state of the node i after the t-th iteration; μ represents the probability of returning to the departure node i after the t-th iteration (called restart factor, μ is real and 0)<μ&1, preferably 0.15);is the restart vector of node i, which is a gateA n x 1-dimensional column vector of all nodes contained in the graph G, n being the number of nodes in the graph G,is marked asWherein each component v of the vector i(k) The values of (A) are as follows: if k = i, v i(k) =1, otherwise v i(k) K is a natural number and 1. Ltoreq. K.ltoreq.n.
Repeating the step 1044 until the distribution state of each node i (i is a natural number and is more than or equal to 1 and less than or equal to i and less than or equal to n) in the dependency graphThe algorithm terminates when stability is reached. That is, the distribution state of node i in the dependency graphNo longer changes as the number of iterations t increases (the distribution of nodes reaches a steady state). At this time, according to the distribution state of the nodes representing the text entity, the corresponding classification labels are obtained, so as to deduce the specific classification information of the text entity.
In particular, the device may be used, as discussed above,is a column vector of dimension n x 1 for all nodes contained in the graph G. Distribution state for node i reaching steady stateIt is also an n x 1-dimensional column vector for all nodes contained in graph G, and thus the classification nodes in graph G are also contained in this vector. In the vectorIn (2), the value of the component corresponding to the classification node is represented as a node i after the random walk is restartedThe probability value of the entity belonging to the classification can be obtained through probability sorting, and the classification label corresponding to the entity represented by the node i is obtained (namely, the classification corresponding to the maximum probability is selected).
The classification of the text entity is jointly inferred by utilizing the classification information of the knowledge base to label the text entity with the classification of the knowledge base to which the text entity belongs, and through the mutual promotion effect of the inference of one entity classification in the same text on the classification inference of another entity, the inference of all the entities in the same text is realized at the same time.
According to an embodiment of the invention, the system for classifying the entity fine granularity facing the knowledge base update comprises an entity identification device, a dependency graph construction device and an iteration device.
Wherein the entity recognition device is adapted to recognize entities from text, e.g. a named entity recognition tool as described above. The dependency graph construction equipment is used for constructing the dependency graph by taking the identified entities, the entities related to the entities in the knowledge base and the classification of the related entities in the knowledge base as nodes. The iteration device is used for obtaining the classification of the identified entity by executing the restart random walk on the dependency graph.
In order to verify the effectiveness of the entity fine-grained classification method and system for knowledge base update provided by the invention, the inventor respectively adopts the existing latest entity classification technology (APOLLO) and the method provided by the invention to perform experiments on a real YAGO data set, and the experimental parameters are as follows:
the entities used in the experiment are composed of randomly selected data under 15 sub-directories classified by person in YAGO, wherein a maximum of 200 entities are randomly selected from each directory, and 2650 entities are selected as a final data set DSec in total. The ratio ρ =0.8 of the data used for training in DSec to the total data, the number of iterations t =10, the restart factor μ =0.15, and the window size k =50 are set.
The following results were obtained through experiments: the classification accuracy rate of the existing APOLLO technology is 0.7254, and the accuracy rate of the classification result obtained by the method and the system provided by the invention is 0.7708. Compared with the existing APOLLO technology, the method and the system for classifying the entity fine granularity, which are provided by the invention, have the advantage that the accuracy is improved by about 4.5%.
In conclusion, the invention provides an entity fine-grained classification method and system facing knowledge base updating, the method models semantic correlation between entities appearing in the same text based on a dependency graph, provides powerful evidence support for classification of entity fine-grained in the same text by utilizing the correlation, and improves accuracy of entity fine-grained classification by a combined inference method based on a restart random walk algorithm.
It should be understood that although the present description has been described in terms of various embodiments, not every embodiment includes only a single embodiment, and such description is for clarity purposes only, and those skilled in the art will recognize that the embodiments described herein may be combined as suitable to form other embodiments, as will be appreciated by those skilled in the art.
The above description is only an exemplary embodiment of the present invention, and is not intended to limit the scope of the present invention. Any equivalent changes, modifications and combinations that may be made by those skilled in the art without departing from the spirit and principles of the invention shall fall within the scope of the invention.