CN104615687B - A kind of entity fine grit classification method and system towards knowledge base update - Google Patents

A kind of entity fine grit classification method and system towards knowledge base update Download PDF

Info

Publication number
CN104615687B
CN104615687B CN201510033050.4A CN201510033050A CN104615687B CN 104615687 B CN104615687 B CN 104615687B CN 201510033050 A CN201510033050 A CN 201510033050A CN 104615687 B CN104615687 B CN 104615687B
Authority
CN
China
Prior art keywords
entities
entity
knowledge base
classification
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510033050.4A
Other languages
Chinese (zh)
Other versions
CN104615687A (en
Inventor
程学旗
王元卓
林海伦
贾岩涛
靳小龙
熊锦华
李曼玲
常雨骁
许洪波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201510033050.4A priority Critical patent/CN104615687B/en
Publication of CN104615687A publication Critical patent/CN104615687A/en
Application granted granted Critical
Publication of CN104615687B publication Critical patent/CN104615687B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Abstract

The present invention provides a kind of entity fine grit classification method and system towards knowledge base update.The described method includes:Entity is identified from text;The classification of relative entity and related entities in knowledge base builds dependency graph as node in the entity that will identify that, knowledge base, wherein in dependency graph while weights represent this while two nodes connecting between degree of correlation;And random walk is restarted by being performed on the dependency graph, the classification belonging to entity identified.The present invention can overcome the prior art and to improve the accuracy rate of entity fine grit classification be difficult to realize carry out fine grit classification to the entity in the case that entity context lacks the defects of.

Description

Entity fine-grained classification method and system for knowledge base updating
Technical Field
The invention relates to the technical field of information processing, in particular to a method and a system for classifying entity fine grit facing knowledge base updating.
Background
A knowledge base is an interconnected set of knowledge organized and managed in some form of knowledge representation. In the knowledge engineering field, elements of knowledge description generally include classification, entity, relationship, attribute and other elements, wherein the classification is used for semantic grouping or semantic labeling of knowledge items in a knowledge base. The knowledge base plays a vital role in many fields, for example, in information retrieval, the knowledge base can help a search engine to understand user query, sense the query intention of the user, perform query expansion, query question answering and the like; in addition, the knowledge base is widely applied to the fields of data analysis, public opinion monitoring, deep network resource discovery and the like. Although numerous knowledge bases exist today, they still have many limitations in the coverage and newness of knowledge, the underlying reason being that with the advent of the big data age, data is growing at an explosive rate, creating new knowledge every day in the Web. Therefore, in order to construct a high-quality knowledge base, newly generated knowledge is dynamically, automatically updated into an existing knowledge base in real time, and it becomes important to secure the expansion capability, coverage capability, and newness of the knowledge base.
The entity is used as an important component element of knowledge description, and a knowledge base is necessarily required to have the capability of automatically expanding the entity. To update a newly emerging entity into the knowledge base, the location of the entity in the knowledge base, i.e., the classification information to which the entity belongs in the knowledge base, needs to be first determined. After the classification of the entity is determined, the newly emerging entity is added to the knowledge base under that classification, thereby enriching the set of entities contained in the knowledge base. Currently, there are two main types of entity classification methods: entity coarse-grained classification and entity fine-grained classification.
Entity coarse-grained classification divides entities into coarse-grained categories, such as names of people, places, names of organizations, and so on. The entity classification model is mainly trained in a supervision mode, and a large amount of manually marked training data is needed. This approach cannot be applied directly to knowledge base-oriented entity classification because knowledge bases divide entities into hundreds or thousands of classes, it requires a larger scale of training data, and creating such a scale requires a large amount of manpower.
The entity fine-grained classification divides the entities into more detailed classes, and the entities are classified mainly by adopting heuristic rules or a weak supervision-based method. The method based on the heuristic rule directly marks the entity by the defined syntactic pattern, and the method is simple to operate, but needs manual maintenance and definition of a large number of rules. The method based on weak supervision extracts the context of the entity and calculates the classification information of the entity by using the lexical and syntactic characteristics of the context, however, the accuracy of the method is low, and the method can hardly infer the classification information of the entity in the case of lack of the context.
In summary, the existing entity coarse-grained classification method is not suitable for updating the knowledge base, and the existing entity fine-grained classification method has low accuracy.
Disclosure of Invention
In order to solve the above problem, according to an embodiment of the present invention, a method for classifying entities based on knowledge base update is provided, including:
step 1), identifying an entity from a text;
step 2), constructing a dependency graph by taking the identified entities, the entities related to the entities in the knowledge base and the classification of the related entities in the knowledge base as nodes, wherein the weight value of an edge in the dependency graph represents the correlation degree between two nodes connected by the edge;
and 3) obtaining the classification of the identified entity by executing the restart random walk on the dependency graph.
In the above method, step 2) includes:
step 21), obtaining related entities of the identified entities in the knowledge base according to the semantic compatibility, and obtaining the classification of the related entities in the knowledge base; the semantic compatibility degree represents the similarity between the context information of the identified entity and the description text of the related entity;
step 22), the identified entities, the entities related to the entities in the knowledge base and the classification of the related entities in the knowledge base are used as nodes;
step 23), adding edges between the nodes representing the identified entities and the nodes representing the related entities, wherein the weight of the edges is the semantic compatibility between the identified entities and the related entities;
adding an edge between the node representing the related entity and the node representing the classification, wherein the weight of the edge indicates whether the related entity belongs to the classification;
adding edges between nodes representing related entities, wherein the weight of the edges is the semantic relevance between the related entities;
edges are added among nodes representing the classification, and the weight of the edges is the correlation degree among the classifications.
In the above method, the semantic compatibility is calculated according to the following formula:
wherein SC (em, e) represents the semantic compatibility of the identified entity em and the related entity e in the knowledge base, X represents the context information of em, T represents the description text of e,representing the TF-IDF vector composed of all bitters contained in the text,representing a vectorBiterm represents co-occurring word pairs in the text. Wherein the context information of the identified entity is comprised of words that occur before and after the text.
In the above method, step 21) includes:
and taking the entity with the semantic compatibility larger than 0 with the identified entity in the knowledge base as the related entity.
In the above method, the semantic relatedness between related entities is calculated according to the following formula:
wherein, SR (e) 1 ,e 2 ) Representing related entities e in a knowledge base 1 And e 2 Semantic relatedness of 1 And I 2 Respectively representing occurrences of entities e in texts describing entities in a knowledge base 1 And e 2 Z represents the set of all entities contained in the knowledge base, | · | represents the size of the set.
In the above method, the correlation between the classifications is calculated according to the following formula:
wherein, CR (c) 1 ,c 2 ) Represents a classification c 1 And c 2 The degree of correlation between the two signals is determined,andrespectively representing the classes c in the knowledge base 1 And c 2 Represents the size of the collection, |, of the entity of (1).
In the above method, step 3) includes:
step 31), initializing the distribution state of the nodes in the dependency graph according to the following formula:
wherein n represents the total number of nodes,representing the initial distribution state of the node i; if k = i, then r i(k) =1, otherwise r i(k) K is a natural number and 1. Ltoreq. K.ltoreq.n;
step 32), calculating the state transition probability matrix a =(a ij ):
Wherein, a ij Representing the probability of transferring from the node i to the node j in the process of restarting the random walk, wherein i and j are natural numbers and satisfy the conditions that i is greater than or equal to 1 and j is less than or equal to n; w is a ij Is the weight of the edge between node i and node j;represents the sum of the weights of all edges connecting node i;
step 33), for each node, carrying out state transition to the neighbor node iteratively until the distribution state of each node in the dependency graph does not change along with the increase of the iteration times; wherein the distribution state of the node i after the t-th iterationIs represented as follows:
wherein, the first and the second end of the pipe are connected with each other,representing the distribution state of the node i after the t iteration, wherein t is a natural number, i is a natural number, and i is more than or equal to 1 and less than or equal to n;representing the distribution state of the node i after the t-1 iteration; mu represents the probability of returning to the starting node i after the t-th iteration, is called a restart factor, is a real number andrepresents a restart vector of node i andif k = i, v i(k) =1, otherwise v i(k) K is a natural number and 1. Ltoreq. K.ltoreq.n;
and step 34) obtaining corresponding classification according to the distribution state of the nodes.
In the above method, step 34) includes:
in the distribution state of the nodes representing the identified entities, sorting the nodes representing the classification according to the values of the components corresponding to the nodes;
and obtaining the classification corresponding to the identified entity according to the sorting result.
According to an embodiment of the present invention, there is also provided a system for classifying entities based on fine granularity of knowledge base update, including:
the entity identification device is used for identifying the entity from the text;
the dependency graph constructing device is used for constructing a dependency graph by taking the identified entities, the entities related to the entities in the knowledge base and the classification of the related entities in the knowledge base as nodes, wherein the weight value of an edge in the dependency graph represents the correlation degree between two nodes connected by the edge; and an iteration device for obtaining the classification to which the identified entity belongs by performing a restart random walk on the dependency graph.
The method can overcome the defect that the entity is difficult to realize fine-grained classification under the condition of lacking entity context in the prior art, provides powerful evidence support for entity fine-grained classification in the same text by modeling semantic correlation among entities appearing in the same text and the relation between the text entity and a knowledge base entity and classification thereof, and improves the accuracy of entity fine-grained classification by restarting a random walk algorithm.
Drawings
Embodiments of the invention are further described below with reference to the accompanying drawings, in which:
FIG. 1 is a flow diagram of a method for knowledge base update oriented entity fine-grained classification according to one embodiment of the invention;
FIG. 2 is a flow diagram of a method of creating a dependency graph model according to one embodiment of the present invention;
FIG. 3 is an example of a dependency graph according to one embodiment of the present invention;
FIG. 4 is a flow diagram of a method of federated inference entity classifications in accordance with one embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
According to one embodiment of the invention, a knowledge base update-oriented entity fine-grained classification method is provided.
In general terms, the method comprises: identifying an entity from the text; constructing a dependency graph by taking the identified entities, the entities related to the entities in the knowledge base and the classification of the related entities in the knowledge base as nodes, wherein the weight value of an edge in the dependency graph represents the correlation degree between two nodes connected by the edge; and obtaining a classification to which the identified entity belongs by performing a restart random walk on the dependency graph. The method is based on the theory of distribution assumptions, i.e., the greater the semantic relevance of the context in which two entities are located, the greater the probability that they belong to the same class.
Referring now to FIG. 1, the steps of the method of the present invention are described.
Step 101: inputting a text document to be processed and a target knowledge base
The text document D and the target knowledge base KB to be processed are selected, and the system input is initialized.
As described above, the Knowledge Base (KB) is composed of entities, classifications, relationships, attributes, etc. that describe Knowledge, and thus the target Knowledge Base KB can be modeled as follows:
KB=<C,E,P,R>
wherein C represents a classification set contained in the target knowledge base; e and P represent the set of entities and their attributes belonging to the class, respectively, and R is a function defining the relationship between the class, instance, and attribute. In set E, each entity E may be represented in the form:
e=<name,aliases,T>
wherein, name represents the name of entity e; aliases represent a set of aliases for entity e; t represents the description text of entity e. Attribute set P of entity e e And the classification set C to which the entity e belongs e Can be obtained by a function R of a knowledge base KB and satisfies
The target knowledge base in the above form can be modeled using various existing encyclopedia database resources, for example, a target knowledge base that uses the wikipedia-based created knowledge base as input in this step.
Step 102: extracting entities contained in text documents
Using the named entity recognition tool, a set of all entities contained in the text document D is extracted.
The set of all entities contained in the text document D can be written as:
EM={em i i is an integer, i is more than or equal to 0 and less than or equal to | D | }
Wherein | D | is the length of the text document; each element em in the set is represented in the form:
em=<name,D,X>
wherein, name represents the name of em; d represents a source text document of em, and X represents a context describing em. In one embodiment, X is represented by a window of words with em appearing around the text document D, the window size being k (k being an integer and 0 n k ≦ D |), i.e., the length of context X is 2k (X is made up of k words appearing before and k words after the text document D), preferably k = min (50, | D |).
Those skilled in the art will appreciate that various named entity recognition tools are available to extract entities in text. In one embodiment, stanford NER is utilized as a named entity recognition tool.
Step 103: creating dependency graphs
And creating a dependency graph according to the entity set EM and the target knowledge base KB extracted from the text document D, so that semantic correlation among different entities in the text document D and dependency between the entities in the text document D and the entities in the knowledge base KB and the belonged classes thereof are uniformly modeled.
Referring to FIG. 2, in one embodiment, creating a dependency graph includes the following sub-steps:
step 1031: the set of entities EM and the target knowledge base KB identified from the text document D are input.
Step 1032: a candidate entity is selected.
According to the Semantic Compatibility (SC) of the text describing the entities, for each entity EM e EM a set of candidate entities is selected in the knowledge base KB that is semantically compatible with it, denoted as:
ES em ={e∈E|SC(em,e)>0}
where SC (em, e) represents semantic compatibility between em and knowledge base entity e. In one embodiment, the semantic compatibility is calculated in a Biterm-based cosine similarity manner:
wherein SC (em, e) is a real number and is more than or equal to 0 and less than or equal to 10; x is context information describing em; t is a description text of e; sim (X, T) is the similarity of X and T;TF-IDF vectors composed of all bitterm contained in the text,is the norm of the vector and one bitterm is a co-occurring word pair in the text. For example, given the text "apple app store" which gets the three words "apple", "app", "store" by word segmentation, the text contains a set of bitters as { apple app, apple store, app store }.
According to the above formula, if SC (em, e)&gt, 0, selecting e as a candidate entity of em, thereby obtaining a candidate entity set ES compatible with em semantics em
Step 1033: a candidate classification is selected.
Obtaining a set of classifications in the knowledge base KB to which each candidate entity e selected in step 1032 belongs based on a relationship definition function R in the knowledge base KBIt is taken as a candidate classification set.
Step 1034: and establishing nodes and side information in the dependency graph.
The node set in the dependency graph includes a set of all entities (text entities for short) extracted from the text document D, a set of candidate entities (knowledge base entities for short) semantically compatible with the extracted entities, and a set of classifications (knowledge base classifications for short) to which the candidate entities belong.
After the nodes in the graph are established, edges and weights are distributed among the nodes, which specifically includes:
1. and adding a connecting edge between the node representing the text entity em and the node e representing the knowledge base entity compatible with the semantics of the text entity em, wherein the weight value of the edge is the semantic compatibility SC (em, e) between the nodes.
2. Adding a connecting edge between a node representing the knowledge base entity e and a node representing the class c to which the node belongs, wherein the weight value of the edge is the Affiliation Relationship (AR) between the nodes, if the entity belongs to the class, the weight value is 1.0, and if the entity does not belong to the class, the weight value is 0.0.
3. At two nodes representing knowledge base entitiese 1 And e 2 And adding connecting edges, wherein the weight of each edge is the Semantic Relevance (SR) between the edges. It is worth noting that here the semantic relevance between entities in the same text is indirectly measured by the semantic relevance between knowledge base entities.
In one embodiment, entity e is computed based on a normalized Google distance (google distance) 1 And e 2 Semantic relatedness SR (e) therebetween 1 ,e 2 ):
Wherein, SR (e) 1 ,e 2 ) Is real number and 0 ≦ SR (e) 1 ,e 2 )≤1.0;I 1 And I 2 Respectively representing the occurrence of an entity e in the text describing the entity in the knowledge base KB 1 And e 2 Z represents the set of all entities contained in the knowledge base KB, | · | represents the size of the set.
4. At two nodes c representing classes of the knowledge base 1 And c 2 Adding a connecting edge, and the weight of the edge is the Correlation (CR) between them. In one embodiment, class c is calculated using Jaccard coefficients 1 And c 2 Correlation degree CR (c) between 1 ,c 2 ):
Wherein, CR (c) 1 ,c 2 ) Is real number and 0 is equal to or less than CR (c) 1 ,c 2 )≤1.0,Andrespectively representing the classes c in the knowledge base KB 1 And c 2 Is a collection of entities, |, representsThe size of the collection.
By establishing nodes and connecting edges, a dependency graph is constructed for all entities EM in the text document D, denoted as G = (V, E, W). G is an undirected graph, where V is the set of vertices of the graph, including all entities in a given text, all entities in a knowledge base that are semantically compatible with those entities, and the set of classes to which those entities belong. E is an edge set among the nodes; w E → R (R is real number) is the weight on the edge.
Given a piece of text, "celebrity hall is a great tombstone for players, and is also a positive for players' lifetime, being the best acceptance beyond the champion ring. But because players must wait 5 years after retirement to enter the celebrity hall, fliers have not waited until 2009 to achieve such prosperity. However, this does not prevent jodan from flashing its name over the history of NBA and even the world's basket jar ". 3 different entities are identified using the named entity recognition tool: "celebrity hall", "jordan" and "NBA". By using the method provided by the invention, a dependency graph model is created for the 3 entities. As shown in fig. 3, the graph contains 12 nodes in total: 3 text entities, 6 knowledge base entities and 3 knowledge base classes, and contains 12 edges.
Step 104: jointly inferring classification information for entities from a created dependency graph
On the dependency graph created in the previous step, a random walk algorithm, such as a restart random walk algorithm, is executed. And continuously and iteratively making random walk on the dependence graph until the distribution state of the nodes in the graph does not change along with the increase of the iteration number, namely reaching a steady state. At the moment, according to the distribution state of the nodes representing the text entities, the corresponding classification labels are obtained, and therefore the fine-grained classification information of the text entities is deduced.
This step will be described in detail below with reference to fig. 4, in conjunction with one embodiment of the present invention:
step 1041: algorithm inputs are initialized.
The created dependency graph G = (V, E, W) is input.
Step 1042: the distribution state of the nodes in the dependency graph is initialized.
The number of nodes in the graph G is n = | V |, the number of edges is m = | E |, the node numbers in the graph G are 1, \ 8230, i, \ 8230, n (i is a natural number, and i is more than or equal to 1 and less than or equal to n).
Setting the distribution state of the nodes i in the dependency graph at the initial time of the algorithmThe distribution state is a column vector of dimension n x 1 with respect to all nodes included in the graph G, where n is the number of nodes in the graph G.Is recorded as:
wherein for each component r in the vector i(k) The values of (A) are as follows: if k = i, then r i(k) =1, otherwise r i(k) K is a natural number and 1. Ltoreq. K.ltoreq.n.
Step 1043: adjacency matrix U = (U) according to dependency graph G = (V, E, W) ij ) Calculating a state transition probability matrix A = (a) in the random walk process ij ) I and j are natural numbers and satisfy 1 ≦ i, and j ≦ n. For the adjacency matrix U, U ij The values are as follows:
wherein, w ij The weight on the connecting side between node i and node j is determined by W: E → R (R is a real number) in G = (V, E, W).
For the state transition probability matrix A, a ij Representing the probability of transitioning from node i to node j during the restart of the random walk. The adjacency vector composed of the node i and all other nodes in the graph G = (V, E, W) isThe adjacent vector is the vector formed by the ith row elements in the adjacent matrix U, k is a natural number and is more than or equal to 1 and less than or equal to n. A is calculated according to the adjacent vector of the node i in the following way ij
As can be seen from the above equation, if i = j or no connecting edge exists between nodes i and j, a ij =0; if there is a connecting edge between nodes i and j, then a ij The value of (d) is proportional to the weight on the edge between node i and node j, i.e., the ratio of the weight on the connecting edge between node i and node j to the sum of the weights on all connecting edges connecting node i.
Step 1044: on the dependency graph, starting from the starting node i, the state transition is continuously and iteratively carried out to the neighboring nodes around the starting node i. After the t-th iteration, the distribution state of the node i in the graphIs represented as follows:
wherein t is a natural number;representing the distribution state of the node i after the t-1 th iteration;representing the distribution state of the node i after the t-th iteration; μ represents the probability of returning to the departure node i after the t-th iteration (called restart factor, μ is real and 0)<μ&1, preferably 0.15);is the restart vector of node i, which is a gateA n x 1-dimensional column vector of all nodes contained in the graph G, n being the number of nodes in the graph G,is marked asWherein each component v of the vector i(k) The values of (A) are as follows: if k = i, v i(k) =1, otherwise v i(k) K is a natural number and 1. Ltoreq. K.ltoreq.n.
Repeating the step 1044 until the distribution state of each node i (i is a natural number and is more than or equal to 1 and less than or equal to i and less than or equal to n) in the dependency graphThe algorithm terminates when stability is reached. That is, the distribution state of node i in the dependency graphNo longer changes as the number of iterations t increases (the distribution of nodes reaches a steady state). At this time, according to the distribution state of the nodes representing the text entity, the corresponding classification labels are obtained, so as to deduce the specific classification information of the text entity.
In particular, the device may be used, as discussed above,is a column vector of dimension n x 1 for all nodes contained in the graph G. Distribution state for node i reaching steady stateIt is also an n x 1-dimensional column vector for all nodes contained in graph G, and thus the classification nodes in graph G are also contained in this vector. In the vectorIn (2), the value of the component corresponding to the classification node is represented as a node i after the random walk is restartedThe probability value of the entity belonging to the classification can be obtained through probability sorting, and the classification label corresponding to the entity represented by the node i is obtained (namely, the classification corresponding to the maximum probability is selected).
The classification of the text entity is jointly inferred by utilizing the classification information of the knowledge base to label the text entity with the classification of the knowledge base to which the text entity belongs, and through the mutual promotion effect of the inference of one entity classification in the same text on the classification inference of another entity, the inference of all the entities in the same text is realized at the same time.
According to an embodiment of the invention, the system for classifying the entity fine granularity facing the knowledge base update comprises an entity identification device, a dependency graph construction device and an iteration device.
Wherein the entity recognition device is adapted to recognize entities from text, e.g. a named entity recognition tool as described above. The dependency graph construction equipment is used for constructing the dependency graph by taking the identified entities, the entities related to the entities in the knowledge base and the classification of the related entities in the knowledge base as nodes. The iteration device is used for obtaining the classification of the identified entity by executing the restart random walk on the dependency graph.
In order to verify the effectiveness of the entity fine-grained classification method and system for knowledge base update provided by the invention, the inventor respectively adopts the existing latest entity classification technology (APOLLO) and the method provided by the invention to perform experiments on a real YAGO data set, and the experimental parameters are as follows:
the entities used in the experiment are composed of randomly selected data under 15 sub-directories classified by person in YAGO, wherein a maximum of 200 entities are randomly selected from each directory, and 2650 entities are selected as a final data set DSec in total. The ratio ρ =0.8 of the data used for training in DSec to the total data, the number of iterations t =10, the restart factor μ =0.15, and the window size k =50 are set.
The following results were obtained through experiments: the classification accuracy rate of the existing APOLLO technology is 0.7254, and the accuracy rate of the classification result obtained by the method and the system provided by the invention is 0.7708. Compared with the existing APOLLO technology, the method and the system for classifying the entity fine granularity, which are provided by the invention, have the advantage that the accuracy is improved by about 4.5%.
In conclusion, the invention provides an entity fine-grained classification method and system facing knowledge base updating, the method models semantic correlation between entities appearing in the same text based on a dependency graph, provides powerful evidence support for classification of entity fine-grained in the same text by utilizing the correlation, and improves accuracy of entity fine-grained classification by a combined inference method based on a restart random walk algorithm.
It should be understood that although the present description has been described in terms of various embodiments, not every embodiment includes only a single embodiment, and such description is for clarity purposes only, and those skilled in the art will recognize that the embodiments described herein may be combined as suitable to form other embodiments, as will be appreciated by those skilled in the art.
The above description is only an exemplary embodiment of the present invention, and is not intended to limit the scope of the present invention. Any equivalent changes, modifications and combinations that may be made by those skilled in the art without departing from the spirit and principles of the invention shall fall within the scope of the invention.

Claims (9)

1. A knowledge base update-oriented entity fine-grained classification method comprises the following steps:
step 1), identifying an entity from a text;
step 2), constructing a dependency graph by taking the identified entities, the entities related to the entities in the knowledge base and the classification of the related entities in the knowledge base as nodes, wherein the weight value of an edge in the dependency graph represents the correlation degree between two nodes connected by the edge;
the step 2) further comprises the following steps:
step 21), obtaining related entities of the identified entities in the knowledge base according to the semantic compatibility, and obtaining the classification of the related entities in the knowledge base; the semantic compatibility degree represents the similarity between the context information of the identified entity and the description text of the related entity;
step 22), the identified entities, the entities related to the entities in the knowledge base and the classification of the related entities in the knowledge base are used as nodes;
step 23), adding edges between the nodes representing the identified entities and the nodes representing the related entities, wherein the weight of the edges is the semantic compatibility between the identified entities and the related entities;
adding an edge between the node representing the related entity and the node representing the classification, wherein the weight of the edge indicates whether the related entity belongs to the classification;
adding edges between nodes representing related entities, wherein the weight of the edges is the semantic relevance between the related entities;
adding edges among the nodes representing the classifications, wherein the weight of the edges is the correlation between the classifications;
and 3) obtaining the classification of the identified entity by executing the restart random walk on the dependency graph.
2. The method of claim 1, wherein the semantic compatibility is calculated according to:
wherein SC (em, e) represents the semantic compatibility of the identified entity em and the related entity e in the knowledge base, X represents the context information of em, T represents the description text of e,representing the TF-IDF vector composed of all bitters contained in the text,representing a vectorBiterm represents co-occurring word pairs in the text.
3. The method of claim 2, wherein the context information of the identified entity is comprised of words that appear before and after the text.
4. The method according to claim 2 or 3, wherein step 21) comprises:
and taking the entity with the semantic compatibility larger than 0 with the identified entity in the knowledge base as the related entity.
5. The method of claim 1, wherein the semantic relatedness between related entities is calculated according to the following formula:
wherein, SR (e) 1 ,e 2 ) Representing related entities e in a knowledge base 1 And e 2 Semantic relatedness of 1 And I 2 Respectively representing occurrences of entities e in texts describing entities in a knowledge base 1 And e 2 Z represents the set of all entities contained in the knowledge base, | · | represents the size of the set.
6. The method of claim 1, wherein the correlation between classifications is calculated according to:
wherein, CR (c) 1 ,c 2 ) Represents a classification c 1 And c 2 Degree of correlation between, E c1 And E c2 Respectively representing the classes c in the knowledge base 1 And c 2 Represents the size of the collection.
7. A method according to any one of claims 1-3, wherein step 3) comprises:
step 31), initializing the distribution state of the nodes in the dependency graph according to the following formula:
wherein n represents the total number of nodes,representing the initial distribution state of the node i; if k = i, then r i(k) =1, otherwise r i(k) K is a natural number and 1. Ltoreq. K.ltoreq.n;
step 32), calculate the state transition probability matrix a = (a) ij ):
Wherein, a ij Representing the probability of transferring from the node i to the node j in the process of restarting the random walk, wherein i and j are natural numbers and satisfy 1 ≤ i, and j ≤ n; w is a ij Is the weight of the edge between node i and node j;represents the sum of the weights of all edges connecting node i;
step 33), for each node, carrying out state transition to the neighbor node iteratively until the distribution state of each node in the dependency graph does not change along with the increase of the iteration times; wherein the distribution state of the node i after the t-th iterationIs represented as follows:
wherein, the first and the second end of the pipe are connected with each other,representing the distribution state of the node i after the t iteration, wherein t is a natural number, i is a natural number, and i is more than or equal to 1 and less than or equal to n;representing the distribution state of the node i after the t-1 iteration; mu represents the probability of returning to the starting node i after the t iteration, and mu is a real number and is more than 0 and less than 1;represents a restart vector of node i andif k = i, v i(k) =1, otherwise v i(k) K is a natural number and 1. Ltoreq. K.ltoreq.n;
and step 34) obtaining corresponding classification according to the distribution state of the nodes.
8. The method of claim 7, wherein step 34) comprises:
in the distribution state of the nodes representing the identified entities, sorting the nodes representing the classification according to the values of the components corresponding to the nodes;
and obtaining the classification corresponding to the identified entity according to the sorting result.
9. A knowledge base update-oriented entity fine-grained classification system comprises:
an entity identification device for identifying an entity from the text;
the dependency graph constructing device is used for constructing a dependency graph by taking the identified entities, the entities related to the entities in the knowledge base and the classification of the related entities in the knowledge base as nodes, wherein the weight value of an edge in the dependency graph represents the correlation degree between two nodes connected by the edge;
meanwhile, related entities of the identified entities in the knowledge base are obtained according to the semantic compatibility, and classification of the related entities in the knowledge base is obtained; the semantic compatibility degree represents the similarity between the context information of the identified entity and the description text of the related entity;
taking the identified entities, the entities related to the entities in the knowledge base and the classification of the related entities in the knowledge base as nodes;
adding edges between the nodes representing the identified entities and the nodes representing the related entities, wherein the weight of the edges is the semantic compatibility between the identified entities and the related entities;
adding an edge between the node representing the related entity and the node representing the classification, wherein the weight of the edge indicates whether the related entity belongs to the classification;
adding edges between nodes representing related entities, wherein the weight of the edges is the semantic relevance between the related entities;
adding edges among the nodes representing the classifications, wherein the weight of the edges is the correlation between the classifications; and
and the iteration equipment is used for obtaining the classification to which the identified entity belongs by executing the restart random walk on the dependency graph.
CN201510033050.4A 2015-01-22 2015-01-22 A kind of entity fine grit classification method and system towards knowledge base update Active CN104615687B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510033050.4A CN104615687B (en) 2015-01-22 2015-01-22 A kind of entity fine grit classification method and system towards knowledge base update

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510033050.4A CN104615687B (en) 2015-01-22 2015-01-22 A kind of entity fine grit classification method and system towards knowledge base update

Publications (2)

Publication Number Publication Date
CN104615687A CN104615687A (en) 2015-05-13
CN104615687B true CN104615687B (en) 2018-05-22

Family

ID=53150129

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510033050.4A Active CN104615687B (en) 2015-01-22 2015-01-22 A kind of entity fine grit classification method and system towards knowledge base update

Country Status (1)

Country Link
CN (1) CN104615687B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339401A (en) * 2015-07-16 2017-01-18 富士通株式会社 Method and equipment for confirming relationship between entities
CN107092605B (en) * 2016-02-18 2019-12-31 北大方正集团有限公司 Entity linking method and device
CN105677913B (en) * 2016-02-29 2019-04-26 哈尔滨工业大学 A kind of construction method of the Chinese semantic knowledge-base based on machine translation
CN105787105B (en) * 2016-03-21 2019-04-19 浙江大学 A kind of Chinese encyclopaedic knowledge map classification system construction method based on iterative model
CN108009184B (en) * 2016-10-27 2021-08-27 北大方正集团有限公司 Method and device for confusion detection of synonym instances of knowledge base
CN108170689A (en) * 2016-12-07 2018-06-15 富士通株式会社 The information processing unit and information processing method of semantization are carried out to entity
CN107545033B (en) * 2017-07-24 2020-12-01 清华大学 Knowledge base entity classification calculation method based on representation learning
CN107704892B (en) * 2017-11-07 2019-05-17 宁波爱信诺航天信息有限公司 A kind of commodity code classification method and system based on Bayesian model
CN108052625B (en) * 2017-12-18 2020-05-19 清华大学 Entity fine classification method
CN108460011B (en) * 2018-02-01 2022-03-25 北京百度网讯科技有限公司 Entity concept labeling method and system
CN108804599B (en) * 2018-05-29 2022-01-04 浙江大学 Rapid searching method for similar transaction modes
CN110019840B (en) * 2018-07-20 2021-06-15 腾讯科技(深圳)有限公司 Method, device and server for updating entities in knowledge graph
CN110427606A (en) * 2019-06-06 2019-11-08 福建奇点时空数字科技有限公司 A kind of professional entity similarity calculating method based on semantic model
CN110377744B (en) * 2019-07-26 2022-08-09 北京香侬慧语科技有限责任公司 Public opinion classification method and device, storage medium and electronic equipment
CN111428506B (en) * 2020-03-31 2023-02-21 联想(北京)有限公司 Entity classification method, entity classification device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103034693A (en) * 2012-12-03 2013-04-10 哈尔滨工业大学 Open-type entity and type identification method thereof
US8538916B1 (en) * 2010-04-09 2013-09-17 Google Inc. Extracting instance attributes from text
CN103678316A (en) * 2012-08-31 2014-03-26 富士通株式会社 Entity relationship classifying device and entity relationship classifying method
CN104182463A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic-based text classification method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8538916B1 (en) * 2010-04-09 2013-09-17 Google Inc. Extracting instance attributes from text
CN103678316A (en) * 2012-08-31 2014-03-26 富士通株式会社 Entity relationship classifying device and entity relationship classifying method
CN103034693A (en) * 2012-12-03 2013-04-10 哈尔滨工业大学 Open-type entity and type identification method thereof
CN104182463A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic-based text classification method

Also Published As

Publication number Publication date
CN104615687A (en) 2015-05-13

Similar Documents

Publication Publication Date Title
CN104615687B (en) A kind of entity fine grit classification method and system towards knowledge base update
CN110704743B (en) Semantic search method and device based on knowledge graph
CN109739994B (en) API knowledge graph construction method based on reference document
WO2019041521A1 (en) Apparatus and method for extracting user keyword, and computer-readable storage medium
CN108681557B (en) Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint
CN104035975B (en) It is a kind of to realize the method that remote supervisory character relation is extracted using Chinese online resource
CN111382276B (en) Event development context graph generation method
Xu et al. Scribble-supervised semantic segmentation inference
CN108229578B (en) Image data target identification method based on three layers of data, information and knowledge map framework
CN108038106B (en) Fine-grained domain term self-learning method based on context semantics
Gao et al. CNL: collective network linkage across heterogeneous social platforms
CN115357904B (en) Multi-class vulnerability detection method based on program slicing and graph neural network
CN109492027B (en) Cross-community potential character relation analysis method based on weak credible data
CN111666350A (en) Method for extracting medical text relation based on BERT model
CN103679034A (en) Computer virus analyzing system based on body and virus feature extraction method
Zhou et al. Rank2vec: learning node embeddings with local structure and global ranking
Yu et al. Hgprompt: Bridging homogeneous and heterogeneous graphs for few-shot prompt learning
Elfida et al. Enhancing to method for extracting Social network by the relation existence
CN107133274B (en) Distributed information retrieval set selection method based on graph knowledge base
CN110019653B (en) Social content representation method and system fusing text and tag network
CN110765276A (en) Entity alignment method and device in knowledge graph
CN115982390A (en) Industrial chain construction and iterative expansion development method
Chen et al. Scaling up Markov logic probabilistic inference for social graphs
Acosta-Mendoza et al. A new algorithm for approximate pattern mining in multi-graph collections
CN112463974A (en) Method and device for establishing knowledge graph

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Cheng Xueqi

Inventor after: Wang Yuanzhuo

Inventor after: Lin Hailun

Inventor after: Jia Yantao

Inventor after: Jin Xiaolong

Inventor after: Xiong Jinhua

Inventor after: Li Manling

Inventor after: Chang Yuxiao

Inventor after: Xu Hongbo

Inventor before: Cheng Xueqi

Inventor before: Wang Yuanzhuo

Inventor before: Lin Hailun

Inventor before: Jia Yantao

Inventor before: Xiong Jinhua

Inventor before: Li Manling

Inventor before: Chang Yuxiao

Inventor before: Xu Hongbo

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant