CN104615687A - Entity fine granularity classifying method and system for knowledge base updating - Google Patents

Entity fine granularity classifying method and system for knowledge base updating Download PDF

Info

Publication number
CN104615687A
CN104615687A CN201510033050.4A CN201510033050A CN104615687A CN 104615687 A CN104615687 A CN 104615687A CN 201510033050 A CN201510033050 A CN 201510033050A CN 104615687 A CN104615687 A CN 104615687A
Authority
CN
China
Prior art keywords
entity
node
knowledge base
classification
represent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510033050.4A
Other languages
Chinese (zh)
Other versions
CN104615687B (en
Inventor
程学旗
王元卓
林海伦
贾岩涛
熊锦华
李曼玲
常雨骁
许洪波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201510033050.4A priority Critical patent/CN104615687B/en
Publication of CN104615687A publication Critical patent/CN104615687A/en
Application granted granted Critical
Publication of CN104615687B publication Critical patent/CN104615687B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an entity fine granularity classifying method and system for knowledge base updating. The entity fine granularity classifying method for knowledge base updating comprises the steps that an entity is recognized in a text; the recognized entity, an entity, relevant to the recognized entity, in a knowledge base and the category, of the relevant entity, in the knowledge serve as nodes to construct a depending diagram, wherein the weight of an edge in the depending diagram shows the degree of correlation of the two nodes connected with the edge; the category of the recognized entity is obtained by executing restarting random walking on the depending diagram. According to the entity fine granularity classifying method and system for knowledge base updating, the defect that in the prior art, fine granularity classification of an entity can not be achieved easily under the condition that the context of the entity is not sufficient is overcome, and the accuracy of entity fine granularity classification is improved.

Description

A kind of entity fine grit classification method and system towards knowledge base update
Technical field
The present invention relates to technical field of information processing, be specifically related to a kind of entity fine grit classification method and system towards knowledge base update.
Background technology
Knowledge base is the knowledge collection interknited adopting certain knowledge representation mode organization and management.In knowledge engineering field, the key element of knowledge description generally comprises the key elements such as classification, entity, relation, attribute, and wherein classification is used for carrying out semanteme grouping or semantic tagger to the knowledge item in knowledge base.Knowledge base plays vital effect in a lot of field, and such as, in information retrieval, the knowledge base engine that can assist search is understood user's inquiry, perception user query intention, carried out query expansion and inquiry question and answer etc.; In addition, knowledge base is also widely used in the fields such as data analysis, public sentiment monitoring, dark net resource discovering.Although there is numerous knowledge base at present, still there is many restrictions in them, basic reason is in the coverage rate and timeliness n of knowledge, and along with the arrival of large data age, data just increase with detonation velocity, and in Web, every day all can produce new knowledge.Therefore, in order to construct high-quality knowledge base, the knowledge newly produced dynamically, in real time, is automatically updated in existing knowledge base, and ensures that the extended capability of knowledge base, covering power and timeliness n become most important.
Entity is as the important composition key element of knowledge description, and knowledge base must need the ability possessing automatic expansion entity.To emerging entity be updated in knowledge base, need first to determine the position of entity in knowledge base, be i.e. the classified information of entity belonging in knowledge base.After the classification determining entity, under emerging entity being added to this classification of knowledge base, thus the entity sets comprised in storehouse of enriching one's knowledge.At present, entity classification method mainly contains two classes: the classification of entity coarseness and entity fine grit classification.
Entity division is coarseness classification by the classification of entity coarseness, as name, place name, mechanism's name etc.Main employing has the mode of supervision to train entity classification model, needs the training data of a large amount of artificial marks.This mode cannot be applied directly to towards in the entity classification of knowledge base, and reason is that entity division is become hundreds and thousands of classifications by knowledge base, training data larger that its needs, and the training data creating scale like this needs a large amount of manpowers.
Entity division is finer classification by entity fine grit classification, mainly adopts heuristic rule or classifies to entity based on Weakly supervised method.Wherein, the method based on heuristic rule is directly that entity carries out classification mark by the syntactic pattern of definition, and this method is simple to operate, but needs manual maintenance and a large amount of rule of definition.The context of entity is extracted based on Weakly supervised method, utilize contextual morphology, classified information belonging to syntactic feature computational entity, but the accuracy rate of this method is lower, and this method will be difficult to the classified information inferring entity when context lacks.
In sum, existing entity coarseness sorting technique is not also suitable for the renewal of knowledge base, and existing entity fine grit classification method accuracy rate is lower.
Summary of the invention
For solving the problem, according to one embodiment of present invention, a kind of entity fine grit classification method towards knowledge base update is provided, comprises:
Step 1), from text, identify entity;
Step 2), using entity relative in the entity identified, knowledge base and related entities, the classification in knowledge base builds dependency graph as node, the weights on the limit wherein in dependency graph represent the degree of correlation between two nodes that this limit connects;
Step 3), by described dependency graph perform restart random walk, obtain the classification belonging to entity identified.
In said method, step 2) comprising:
Step 21), obtain the related entities of entity in knowledge base identified according to semantic compatible degree, and obtain the classification of this related entities in knowledge base; Wherein, semantic compatible degree represents the similarity of the contextual information of the entity identified and the description text of related entities;
Step 22), using identify entity, in knowledge base relative entity and the classification of related entities in knowledge base as node;
Step 23), add limit between the node of entity identified and the node representing related entities representing, the weights on limit are the semantic compatible degree between this entity identified and this related entities;
Between the node representing related entities and the node of presentation class, add limit, the weights on limit indicate this related entities whether to belong to this classification;
Between the node representing related entities, add limit, the weights on limit are the semantic relevancy between this related entities;
Between the node of presentation class, add limit, the weights on limit are the degree of correlation between this classification.
In said method, according to following formula computing semantic compatible degree:
SC ( em , e ) = sim ( X , T ) = V → ( X ) · V → ( T ) | V → ( X ) | · | V → ( T ) |
Wherein, SC (em, e) represents the semantic compatible degree of the related entities e in the entity em and knowledge base identified, and X represents the contextual information of em, and T represents the description text of e, represent the TF-IDF vector of all Biterm compositions comprised in text, represent vector mould, Biterm represents the word pair of co-occurrence in text.Wherein, the contextual information of the entity identified is made up of the word appeared at before and after described text.
In said method, step 21) comprising:
The entity of 0 is greater than as related entities with the semantic compatible degree of the entity identified using in knowledge base.
In said method, calculate the semantic relevancy between related entities according to following formula:
SR ( e 1 , e 2 ) = 1 - log ( max ( | I 1 | , | I 2 | ) ) - log ( | I 1 ∩ I 2 | ) log ( | Z | ) - log ( min ( | I 1 | , | I 2 | ) )
Wherein, SR (e 1, e 2) represent related entities e in knowledge base 1and e 2semantic relevancy, I 1and I 2represent respectively in knowledge base to describe in the text of entity and occur entity e 1and e 2the set of entity, Z represents the set of all entities comprised in knowledge base, || represent the size of set.
In said method, calculate the degree of correlation between classification according to following formula:
CR ( c 1 , c 2 ) = | E c 1 ∩ E c 2 | | E c 1 ∪ E c 2 |
Wherein, CR (c 1, c 2) presentation class c 1and c 2between the degree of correlation, with represent respectively in knowledge base and belong to classification c 1and c 2the set of entity, || represent the size of set.
In said method, step 3) comprising:
Step 31), the distribution of node in dependency graph according to following formula initialization:
r → i ( 0 ) = ( r i ( 1 ) , . . . , r i ( k ) , . . . , r i ( n ) )
Wherein, n represents node sum, represent the initial distribution state of node i; If k=i, then r i (k)=1, otherwise r i (k)=0, k is natural number and 1≤k≤n;
Step 32), computing mode transition probability matrix A=(a ij):
Wherein, a ijrepresent and restarting the probability transferring to node j in random walk process from node i, i, j are natural number and meet 1≤i, j≤n; w ijfor the weight on the limit between node i and node j; represent the weight sum on all limits of link node i;
Step 33), for each node, carry out state transfer to its neighbor node iteratively, until the distribution of each node does not change with the increase of iterations in described dependency graph; Wherein, in the distribution of the t time iteration postjunction i be expressed as follows:
r → i ( t ) = ( 1 - μ ) A r → i ( t - 1 ) + μ v → i
Wherein, represent the distribution at the t time iteration postjunction i, t is natural number, and i is natural number and 1≤i≤n; represent the distribution at the t-1 time iteration postjunction i; μ represents the probability returning the node i that sets out after the t time iteration, is called and restarts the factor, μ be real number and represent node i restart vector and if k=i, then v i (k)=1, otherwise v i (k)=0, k is natural number and 1≤k≤n;
Step 34), according to the distribution of node, obtain the classification of its correspondence.
In said method, step 34) comprising:
In the distribution representing the node of entity identified, the value of the node of presentation class by component corresponding to this node is sorted;
The classification that the entity obtaining identifying according to ranking results is corresponding.
According to one embodiment of present invention, a kind of entity fine grit classification system towards knowledge base update is also provided, comprises:
Entity recognition equipment, for identifying entity from text;
Dependency graph builds equipment, for using entity relative in the entity identified, knowledge base and related entities, the classification in knowledge base builds dependency graph as node, the weights on the limit wherein in dependency graph represent the degree of correlation between two nodes that this limit connects; And iteration equipment, for restarting random walk by performing on described dependency graph, obtain the classification belonging to entity identified.
The present invention can overcome the defect that prior art is difficult to when entity context lacks realize this entity being carried out to fine grit classification, by the semantic dependency between the entity that occurs in modeling one text, and text entities and knowledge base entity and the relation between classifying thereof, utilizing this semantic dependency and closing is that in one text, entity fine grit classification provides strong evidence support, and by restarting Random Walk Algorithm, improve the accuracy rate of entity fine grit classification.
Accompanying drawing explanation
Referring to accompanying drawing, embodiments of the present invention is further illustrated, wherein:
Fig. 1 is according to an embodiment of the invention towards the process flow diagram of the entity fine grit classification method of knowledge base update;
Fig. 2 is the process flow diagram of the method creating dependency graph model according to an embodiment of the invention;
Fig. 3 is the example of dependency graph according to an embodiment of the invention;
Fig. 4 is the process flow diagram of combining the method inferring entity classification according to an embodiment of the invention.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with accompanying drawing, by specific embodiment, the present invention is described in more detail.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.
According to one embodiment of present invention, a kind of entity fine grit classification method towards knowledge base update is provided.
Generally, the method comprises: from text, identify entity; Using entity relative in the entity identified, knowledge base and related entities, the classification in knowledge base builds dependency graph as node, and the weights on the limit wherein in dependency graph represent the degree of correlation between two nodes that this limit connects; And, restarting random walk by performing on described dependency graph, obtaining the classification belonging to entity identified.The method is theoretical based on distributional assumption, and namely the context semantic dependency at two entity places is larger, then they to belong to other possibility of same class larger.
Refer now to Fig. 1, describe each step of the inventive method.
Step 101: input the text document and object knowledge storehouse that will carry out processing
Selection will carry out the text document D that processes and object knowledge storehouse KB, and initialization system inputs.
As described above, knowledge base (Knowledge Base, KB) is made up of the key element such as entity, classification, relation, attribute of Description of Knowledge, therefore object knowledge storehouse KB can be modeled as following form:
KB=<C,E,P,R>
Wherein, C represents the classification set comprised in object knowledge storehouse; E and P represents respectively and belongs to the entity of classification and the set of attribute thereof, and R is the function of the relation between defining classification, example, attribute.In set E, each entity e can represent by following form:
e=<name,aliases,T>
Wherein, the name of name presentation-entity e; The set of the another name of aliases presentation-entity e; The description text of T presentation-entity e.The community set P of entity e ewith the classification set C belonging to entity e efunction R by knowledge base KB tries to achieve, and meets
Existing various encyclopaedia database resource can be utilized to come the object knowledge storehouse of the above-mentioned form of modeling, such as, adopt the knowledge base created based on wikipedia as the object knowledge storehouse inputted in this step.
Step 102: extract the entity comprised in text document
Utilize named entity recognition instrument, extract the set of all entities comprised in text document D.
The set of all entities comprised in text document D can be designated as:
EM={em i| i is integer, 0≤i≤| D|}
Wherein, | D| is the length of text document; The following form of each element em in set represents:
em=<name,D,X>
Wherein, name represents the name of em; D represents the source text document of em, and X represents the context describing em.In one embodiment, X is represented with the word window that em appears at around text document D, window size is k (k is integer and 0<k≤| D|), namely the length of context X is 2k (X is made up of k word before appearing at text document D and k word after text document D), preferably, k=min (50, | D|).
It will be understood by those skilled in the art that and existing various named entity recognition instrument can be utilized to extract the entity in text.In one embodiment, utilize Stanford NER as named entity recognition instrument.
Step 103: create dependency graph
According to the entity sets EM extracted from text document D and object knowledge storehouse KB, create dependency graph, thus the semantic dependency in unified Modeling text document D between different entities, and the entity in text document D and the entity in knowledge base KB and the dependence between affiliated classification thereof.
With reference to figure 2, in one embodiment, create dependency graph and comprise following sub-step:
Step 1031: input the entity sets EM and object knowledge storehouse KB that identify from text document D.
Step 1032: select candidate's entity.
According to the semantic compatible degree (Semantic Compatibility, SC) of the text of description entity, be that each entity em ∈ EM selects semantic compatible candidate's entity sets with it in knowledge base KB, be designated as:
ES em={e∈E|SC(em,e)>0}
Wherein, SC (em, e) represents the semantic compatible degree between em and knowledge base entity e.In one embodiment, the mode based on the cosine similarity of Biterm is adopted to calculate this semantic compatible degree:
SC ( em , e ) = sim ( X , T ) = V &RightArrow; ( X ) &CenterDot; V &RightArrow; ( T ) | V &RightArrow; ( X ) | &CenterDot; | V &RightArrow; ( T ) |
Wherein, SC (em, e) is for real number and 0≤SC (em, e)≤10; X is the contextual information describing em; T is the description text of e; The similarity that sim (X, T) is X and T; for the TF-IDF of all Biterm compositions comprised in text is vectorial, for vector field homoemorphism, and Biterm is the word pair of a co-occurrence in text.Such as, given text " apple application shop ", the text obtains three words " apple ", " application ", " shop " by participle, and the Biterm set that so text comprises is { apple is applied, apple shop, application shop }.
According to above-mentioned formula, if SC (em, e) >0, then select e as candidate's entity of em, thus obtain semantic compatible candidate's entity sets ES with em em.
Step 1033: select candidate classification.
According to the contextual definition function R in knowledge base KB, obtain the set of the classification of each candidate's entity e belonging in knowledge base KB selected in step 1032 it can be used as candidate classification set.
Step 1034: establish the node in dependency graph and connect side information.
Node set in dependency graph comprises the set of all entities (abbreviation text entities) extracted from text document D, the set with the compatible candidate's entity (being called for short knowledge base entity) of Entity Semantics extracted, and the set of classification (abbreviation knowledge-based classification) belonging to candidate's entity.
After establishing the node in figure, point accessory limit and weight between these nodes, specifically comprise:
1, between the knowledge base entity node e that the node and representative that represent text entities em are compatible with its semanteme, add even limit, the weights on limit are the semantic compatible degree SC (em, e) between them.
2, interpolation even limit between the node of knowledge base entity e and the node representing the classification c belonging to it is being represented, weights on limit are belonging relation (the Attachment Relatedness between them, AR), if entity belongs to this classification, then weights are 1.0, if do not belong to, weights are 0.0.
3, two node e of knowledge base entity are being represented 1and e 2between add connect limit, the weights on limit are the semantic relevancy (Semantic Relatedness, SR) between them.It should be noted that and carry out the semantic dependency in indirect measurement one text between entity at this by the semantic dependency between knowledge base entity.
In one embodiment, based on normalized Google distance (google distance) computational entity e 1and e 2between semantic relevancy SR (e 1, e 2):
SR ( e 1 , e 2 ) = 1 - log ( max ( | I 1 | , | I 2 | ) ) - log ( | I 1 &cap; I 2 | ) log ( | Z | ) - log ( min ( | I 1 | , | I 2 | ) )
Wherein, SR (e 1, e 2) for real number and 0≤SR (e 1, e 2)≤1.0; I 1and I 2represent respectively in knowledge base KB, describe in the text of entity and occur entity e 1and e 2the set of entity, Z represents the set of all entities comprised in knowledge base KB, || represent the size of set.
4, two node c of knowledge-based classification are being represented 1and c 2between add connect limit, the weights on limit are the degree of correlation (Correlation, CR) between them.In one embodiment, Jaccard coefficient calculations classification c is adopted 1and c 2between degree of correlation CR (c 1, c 2):
CR ( c 1 , c 2 ) = | E c 1 &cap; E c 2 | | E c 1 &cup; E c 2 |
Wherein, CR (c 1, c 2) for real number and 0≤CR (c 1, c 2)≤1.0, with represent respectively in knowledge base KB and belong to classification c 1and c 2the set of entity, || represent the size of set.
By establishing node and Lian Bian, constructing the dependency graph about entity EM all in text document D, being designated as G=(V, E, W).G is a non-directed graph, and wherein V is the vertex set of figure, comprises all entities in entities all in given text, the knowledge base compatible with these Entity Semantics, and the set of classification belonging to these entities.E is the limit set between these nodes; W:E → R (R is real number) is the weights on limit.
" for sportsman, Hall of Fame is great monument to given one section of text, and being also the affirmative for sportsman's career, is accreditation best except champion's ring.But because sportsman wants that entering Hall of Fame all must wait until retired latter 5 years, so fleet-footed runner is until just waited until such special honours for 2009.But, this does not hinder the name of Jordon glittering in the history of NBA and even whole world basketball circles ".Named entity recognition tool identification is utilized to go out 3 different entities: " Hall of Fame ", " Jordon ", " NBA ".Utilize method provided by the invention, to these 3 entity set-up dependency graph models.As shown in Figure 3, in figure, altogether comprise 12 nodes: 3 text entities, 6 knowledge base entities and 3 knowledge-based classification, and comprise 12 limits.
Step 104: according to the dependency graph created, combines the classified information inferring entity
On the dependency graph that previous step creates, perform Random Walk Algorithm, as restarted Random Walk Algorithm.Constantly on dependency graph, do random walk iteratively, until the distribution of node does not change along with the increase of iterations in figure, till namely reaching steady state (SS).Now, according to the distribution of node representing text entities, obtain the tag along sort of its correspondence, thus infer the fine grit classification information of text entities.
Below with reference to Fig. 4, in conjunction with one embodiment of the present of invention, this step is specifically described:
Step 1041: initialization algorithm inputs.
The dependency graph G=(V, E, W) that input creates.
Step 1042: the distribution of the node in initialization dependency graph.
In note figure G, the number of node is n=|V|, and the number on limit is that the node numbering in m=|E|, G is respectively 1 ..., i ..., n (i is natural number and 1≤i≤n).
Arrange algorithm initial time dependency graph in the distribution of node i this distribution is the column vector of n x 1 dimension about all nodes comprised in figure G, and wherein n is the number of node in figure G. be designated as:
r &RightArrow; i ( 0 ) = ( r i ( 1 ) , . . . , r i ( k ) , . . . , r i ( n ) )
Wherein, for each component r in this vector i (k)value as follows: if k=i, then r i (k)=1, otherwise r i (k)=0, k is natural number and 1≤k≤n.
Step 1043: according to the adjacency matrix U=(u of dependency graph G=(V, E, W) ij), calculate state transition probability matrix A=(a in random walk process ij), i, j are natural number and meet 1≤i, j≤n.For adjacency matrix U, u ijvalue is as follows:
Wherein, w ijfor the weight of the Lian Bianshang between node i and node j, determined by the W:E → R (R is real number) in G=(V, E, W).
For state transition probability matrix A, a ijrepresenting is restarting in random walk process, transfers to the probability of node j from node i.The adjacency vector that in note figure G=(V, E, W), node i and other all nodes form is adjacency vector is the vector of the i-th row element composition in adjacency matrix U, and k is natural number and 1≤k≤n.According to the adjacency vector of node i, calculate a in the following manner ij:
From above formula, connect limit, then a if do not exist between i=j or node i and j ij=0; If exist between node i and j and connect limit, then a ijvalue and weight between node i and node j on limit proportional, be to connect weight on limit and link node i between node i and node j all connect weight on limits and ratio.
Step 1044: on dependency graph, follow node i starts, and constantly carries out state transfer to the neighbor node around it iteratively.After the t time iteration, the distribution of node i in figure be expressed as follows:
r &RightArrow; i ( t ) = ( 1 - &mu; ) A r &RightArrow; i ( t - 1 ) + &mu; v &RightArrow; i
Wherein, t is natural number; represent the distribution at the t-1 time iteration postjunction i; represent the distribution at the t time iteration postjunction i; μ represents the probability (be called and restart the factor, μ is real number and 0< μ <1, is preferably 0.15) returning the node i that sets out after the t time iteration; be node i restart vector, be a column vector tieed up about n x 1 of all nodes comprised in figure G, n is the number of node in figure G, be designated as each component v wherein in vector i (k)value as follows: if k=i, then v i (k)=1, otherwise v i (k)=0, k is natural number and 1≤k≤n.
Repeated execution of steps 1044, until the distribution of each node i (i is natural number and 1≤i≤n) in dependency graph termination algorithm when reaching stable.That is, the distribution of the node i in dependency graph no longer change (distribution of node reaches steady state (SS)) along with the increase of iterations t.Now, according to the distribution of node representing text entities, obtain the tag along sort of its correspondence, thus infer the classified information that text entity is concrete.
Particularly, according to as discussed above, it is the column vector of n x 1 dimension about all nodes comprised in figure G.For the distribution of node i reaching steady state (SS) it is also the column vector of n x 1 dimension about all nodes comprised in figure G, and therefore the classification node of scheming in G is also contained in this vector.At vector in, the value of the component of classification corresponding to node belongs to the probable value of this classification as the entity that the postjunction i by restarting random walk represents, can obtain the tag along sort (namely selecting the classification that maximum probability is corresponding) corresponding to entity that node i represents by probability sorting.
Combining the classification of inferring text entities is to utilize the classified information of knowledge base to mark the knowledge-based classification belonging to it to text entities, by the mutual promoting action that the deduction of an entity classification in one text is inferred the classification of another entity, realize the deduction of the classification to entities all in one text simultaneously.
According to one embodiment of present invention, also provide a kind of entity fine grit classification system towards knowledge base update, comprise Entity recognition equipment, dependency graph builds equipment and iteration equipment.
Wherein Entity recognition equipment is used for identifying entity from text, such as, and named entity recognition instrument as described above.Dependency graph builds equipment for the classification in knowledge base builds dependency graph as node using entity relative in the entity identified, knowledge base and related entities.Iteration equipment, for restarting random walk by performing on described dependency graph, obtains the classification belonging to entity identified.
For verifying the validity of the entity fine grit classification method and system towards knowledge base update provided by the invention, inventor adopts existing up-to-date entity classification technology (APOLLO) and method provided by the invention respectively, true YAGO data set is tested, and experiment parameter is as follows:
Testing entity used is that under utilizing 15 sub-directories of person classification in YAGO, Stochastic choice data out form, and wherein random selection at most 200 entities from each catalogue, select 2650 entities altogether as final data set DSec.Arrange in DSec ratio ρ=0.8 that the data being used as to train account for total data, iterations t=10, restarts factor mu=0.15, window size k=50.
Obtain following result through experiment: adopt existing APOLLO technology classification accuracy rate to be 0.7254, and adopt method and system provided by the invention obtain classification results accuracy rate be 0.7708.Adopt entity fine grit classification method and system provided by the invention compared with the existing APOLLO technology of employing, accuracy rate improves about 4.5%.
To sum up, the invention provides a kind of entity fine grit classification method and system towards knowledge base update, the method is based on dependency graph, semantic dependency between the entity occurred in modeling one text, and utilize this correlativity to provide strong evidence support for the fine-grained classification of entity in one text, by based on the associating estimating method restarting Random Walk Algorithm, realize the lifting of the accuracy rate of entity fine grit classification.
Be to be understood that, although this instructions describes according to each embodiment, but not each embodiment only comprises an independently technical scheme, this narrating mode of instructions is only for clarity sake, those skilled in the art should by instructions integrally, technical scheme in each embodiment also through appropriately combined, can form other embodiments that it will be appreciated by those skilled in the art that.
The foregoing is only the schematic embodiment of the present invention, and be not used to limit scope of the present invention.Any those skilled in the art, the equivalent variations done under the prerequisite not departing from design of the present invention and principle, amendment and combination, all should belong to the scope of protection of the invention.

Claims (10)

1., towards an entity fine grit classification method for knowledge base update, comprising:
Step 1), from text, identify entity;
Step 2), using entity relative in the entity identified, knowledge base and related entities, the classification in knowledge base builds dependency graph as node, the weights on the limit wherein in dependency graph represent the degree of correlation between two nodes that this limit connects;
Step 3), by described dependency graph perform restart random walk, obtain the classification belonging to entity identified.
2. method according to claim 1, wherein, step 2) comprising:
Step 21), obtain the related entities of entity in knowledge base identified according to semantic compatible degree, and obtain the classification of this related entities in knowledge base; Wherein, semantic compatible degree represents the similarity of the contextual information of the entity identified and the description text of related entities;
Step 22), using identify entity, in knowledge base relative entity and the classification of related entities in knowledge base as node;
Step 23), add limit between the node of entity identified and the node representing related entities representing, the weights on limit are the semantic compatible degree between this entity identified and this related entities;
Between the node representing related entities and the node of presentation class, add limit, the weights on limit indicate this related entities whether to belong to this classification;
Between the node representing related entities, add limit, the weights on limit are the semantic relevancy between this related entities;
Between the node of presentation class, add limit, the weights on limit are the degree of correlation between this classification.
3. method according to claim 2, wherein, according to following formula computing semantic compatible degree:
SC ( em , e ) = sim ( X , T ) = V &RightArrow; ( X ) &CenterDot; V &RightArrow; ( T ) | V &RightArrow; ( X ) | &CenterDot; | V &RightArrow; ( T ) |
Wherein, SC (em, e) represents the semantic compatible degree of the related entities e in the entity em and knowledge base identified, and X represents the contextual information of em, and T represents the description text of e, represent the TF-IDF vector of all Biterm compositions comprised in text, represent vector mould, Biterm represents the word pair of co-occurrence in text.
4. method according to claim 3, wherein, the contextual information of the entity identified is made up of the word appeared at before and after described text.
5. the method according to claim 3 or 4, wherein, step 21) comprising:
The entity of 0 is greater than as related entities with the semantic compatible degree of the entity identified using in knowledge base.
6. method according to claim 2, wherein, calculates the semantic relevancy between related entities according to following formula:
SR ( e 1 , e 2 ) = 1 - log ( max ( | I 1 | , | I 2 | ) ) - log ( | I 1 &cap; I 2 | ) log ( | Z | ) - log ( min ( | I 1 | , | I 2 | ) )
Wherein, SR (e 1, e 2) represent related entities e in knowledge base 1and e 2semantic relevancy, I 1and I 2represent respectively in knowledge base to describe in the text of entity and occur entity e 1and e 2the set of entity, Z represents the set of all entities comprised in knowledge base, || represent the size of set.
7. method according to claim 2, wherein, calculates the degree of correlation between classification according to following formula:
CR ( c 1 , c 2 ) = | E c 1 &cap; E c 2 | | E c 1 &cup; E c 2 |
Wherein, CR (c 1, c 2) presentation class c 1and c 2between the degree of correlation, with represent respectively in knowledge base and belong to classification c 1and c 2the set of entity, || represent the size of set.
8. according to the method in claim 1-4 described in any one, wherein, step 3) comprising:
Step 31), the distribution of node in dependency graph according to following formula initialization:
r &RightArrow; i ( 0 ) = ( r i ( 1 ) , . . . , r i ( k ) , . . . , r i ( n ) )
Wherein, n represents node sum, represent the initial distribution state of node i; If k=i, then r i (k)=1, otherwise r i (k)=0, k is natural number and 1≤k≤n;
Step 32), computing mode transition probability matrix A=(a ij):
Wherein, a ijrepresent and restarting the probability transferring to node j in random walk process from node i, i, j are natural number and meet 1≤i, j≤n; w ijfor the weight on the limit between node i and node j; represent the weight sum on all limits of link node i;
Step 33), for each node, carry out state transfer to its neighbor node iteratively, until the distribution of each node does not change with the increase of iterations in described dependency graph; Wherein, in the distribution of the t time iteration postjunction i be expressed as follows:
r &RightArrow; i ( t ) = ( 1 - &mu; ) A r &RightArrow; i ( t - 1 ) + &mu; v &RightArrow; i
Wherein, represent the distribution at the t time iteration postjunction i, t is natural number, and i is natural number and 1≤i≤n; represent the distribution at the t-1 time iteration postjunction i; μ represents the probability returning the node i that sets out after the t time iteration, and μ is real number and 0< μ <1; represent node i restart vector and if k=i, then v i (k)=1, otherwise v i (k)=0, k is natural number and 1≤k≤n;
Step 34), according to the distribution of node, obtain the classification of its correspondence.
9. method according to claim 8, wherein, step 34) comprising:
In the distribution representing the node of entity identified, the value of the node of presentation class by component corresponding to this node is sorted;
The classification that the entity obtaining identifying according to ranking results is corresponding.
10., towards an entity fine grit classification system for knowledge base update, comprising:
Entity recognition equipment, for identifying entity from text;
Dependency graph builds equipment, for using entity relative in the entity identified, knowledge base and related entities, the classification in knowledge base builds dependency graph as node, the weights on the limit wherein in dependency graph represent the degree of correlation between two nodes that this limit connects; And
Iteration equipment, for restarting random walk by performing on described dependency graph, obtains the classification belonging to entity identified.
CN201510033050.4A 2015-01-22 2015-01-22 A kind of entity fine grit classification method and system towards knowledge base update Active CN104615687B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510033050.4A CN104615687B (en) 2015-01-22 2015-01-22 A kind of entity fine grit classification method and system towards knowledge base update

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510033050.4A CN104615687B (en) 2015-01-22 2015-01-22 A kind of entity fine grit classification method and system towards knowledge base update

Publications (2)

Publication Number Publication Date
CN104615687A true CN104615687A (en) 2015-05-13
CN104615687B CN104615687B (en) 2018-05-22

Family

ID=53150129

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510033050.4A Active CN104615687B (en) 2015-01-22 2015-01-22 A kind of entity fine grit classification method and system towards knowledge base update

Country Status (1)

Country Link
CN (1) CN104615687B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677913A (en) * 2016-02-29 2016-06-15 哈尔滨工业大学 Machine translation-based construction method for Chinese semantic knowledge base
CN105787105A (en) * 2016-03-21 2016-07-20 浙江大学 Iterative-model-based establishment method of Chinese encyclopedic knowledge graph classification system
CN106339401A (en) * 2015-07-16 2017-01-18 富士通株式会社 Method and equipment for confirming relationship between entities
CN107092605A (en) * 2016-02-18 2017-08-25 北大方正集团有限公司 A kind of entity link method and device
CN107545033A (en) * 2017-07-24 2018-01-05 清华大学 A kind of computational methods based on the knowledge base entity classification for representing study
CN107704892A (en) * 2017-11-07 2018-02-16 宁波爱信诺航天信息有限公司 A kind of commodity code sorting technique and system based on Bayesian model
CN108009184A (en) * 2016-10-27 2018-05-08 北大方正集团有限公司 Knowledge base example of the same name obscures the method and device of detection
CN108052625A (en) * 2017-12-18 2018-05-18 清华大学 A kind of entity sophisticated category method
CN108170689A (en) * 2016-12-07 2018-06-15 富士通株式会社 The information processing unit and information processing method of semantization are carried out to entity
CN108460011A (en) * 2018-02-01 2018-08-28 北京百度网讯科技有限公司 A kind of entitative concept mask method and system
CN108804599A (en) * 2018-05-29 2018-11-13 浙江大学 A kind of fast searching method of similar subgraph
CN110019840A (en) * 2018-07-20 2019-07-16 腾讯科技(深圳)有限公司 The method, apparatus and server that entity updates in a kind of knowledge mapping
CN110377744A (en) * 2019-07-26 2019-10-25 北京香侬慧语科技有限责任公司 A kind of method, apparatus, storage medium and the electronic equipment of public sentiment classification
CN110427606A (en) * 2019-06-06 2019-11-08 福建奇点时空数字科技有限公司 A kind of professional entity similarity calculating method based on semantic model
CN111428506A (en) * 2020-03-31 2020-07-17 联想(北京)有限公司 Entity classification method, entity classification device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103034693A (en) * 2012-12-03 2013-04-10 哈尔滨工业大学 Open-type entity and type identification method thereof
US8538916B1 (en) * 2010-04-09 2013-09-17 Google Inc. Extracting instance attributes from text
CN103678316A (en) * 2012-08-31 2014-03-26 富士通株式会社 Entity relationship classifying device and entity relationship classifying method
CN104182463A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic-based text classification method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8538916B1 (en) * 2010-04-09 2013-09-17 Google Inc. Extracting instance attributes from text
CN103678316A (en) * 2012-08-31 2014-03-26 富士通株式会社 Entity relationship classifying device and entity relationship classifying method
CN103034693A (en) * 2012-12-03 2013-04-10 哈尔滨工业大学 Open-type entity and type identification method thereof
CN104182463A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic-based text classification method

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339401A (en) * 2015-07-16 2017-01-18 富士通株式会社 Method and equipment for confirming relationship between entities
CN107092605A (en) * 2016-02-18 2017-08-25 北大方正集团有限公司 A kind of entity link method and device
CN105677913A (en) * 2016-02-29 2016-06-15 哈尔滨工业大学 Machine translation-based construction method for Chinese semantic knowledge base
CN105677913B (en) * 2016-02-29 2019-04-26 哈尔滨工业大学 A kind of construction method of the Chinese semantic knowledge-base based on machine translation
CN105787105A (en) * 2016-03-21 2016-07-20 浙江大学 Iterative-model-based establishment method of Chinese encyclopedic knowledge graph classification system
CN105787105B (en) * 2016-03-21 2019-04-19 浙江大学 A kind of Chinese encyclopaedic knowledge map classification system construction method based on iterative model
CN108009184A (en) * 2016-10-27 2018-05-08 北大方正集团有限公司 Knowledge base example of the same name obscures the method and device of detection
CN108009184B (en) * 2016-10-27 2021-08-27 北大方正集团有限公司 Method and device for confusion detection of synonym instances of knowledge base
CN108170689A (en) * 2016-12-07 2018-06-15 富士通株式会社 The information processing unit and information processing method of semantization are carried out to entity
CN107545033A (en) * 2017-07-24 2018-01-05 清华大学 A kind of computational methods based on the knowledge base entity classification for representing study
CN107545033B (en) * 2017-07-24 2020-12-01 清华大学 Knowledge base entity classification calculation method based on representation learning
CN107704892A (en) * 2017-11-07 2018-02-16 宁波爱信诺航天信息有限公司 A kind of commodity code sorting technique and system based on Bayesian model
CN107704892B (en) * 2017-11-07 2019-05-17 宁波爱信诺航天信息有限公司 A kind of commodity code classification method and system based on Bayesian model
CN108052625A (en) * 2017-12-18 2018-05-18 清华大学 A kind of entity sophisticated category method
CN108052625B (en) * 2017-12-18 2020-05-19 清华大学 Entity fine classification method
CN108460011A (en) * 2018-02-01 2018-08-28 北京百度网讯科技有限公司 A kind of entitative concept mask method and system
CN108460011B (en) * 2018-02-01 2022-03-25 北京百度网讯科技有限公司 Entity concept labeling method and system
CN108804599A (en) * 2018-05-29 2018-11-13 浙江大学 A kind of fast searching method of similar subgraph
CN108804599B (en) * 2018-05-29 2022-01-04 浙江大学 Rapid searching method for similar transaction modes
CN110019840A (en) * 2018-07-20 2019-07-16 腾讯科技(深圳)有限公司 The method, apparatus and server that entity updates in a kind of knowledge mapping
CN110427606A (en) * 2019-06-06 2019-11-08 福建奇点时空数字科技有限公司 A kind of professional entity similarity calculating method based on semantic model
CN110377744A (en) * 2019-07-26 2019-10-25 北京香侬慧语科技有限责任公司 A kind of method, apparatus, storage medium and the electronic equipment of public sentiment classification
CN111428506A (en) * 2020-03-31 2020-07-17 联想(北京)有限公司 Entity classification method, entity classification device and electronic equipment
CN111428506B (en) * 2020-03-31 2023-02-21 联想(北京)有限公司 Entity classification method, entity classification device and electronic equipment

Also Published As

Publication number Publication date
CN104615687B (en) 2018-05-22

Similar Documents

Publication Publication Date Title
CN104615687A (en) Entity fine granularity classifying method and system for knowledge base updating
CN106777274B (en) A kind of Chinese tour field knowledge mapping construction method and system
Xu et al. Topic based context-aware travel recommendation method exploiting geotagged photos
Jiang et al. Author topic model-based collaborative filtering for personalized POI recommendations
CN105183869B (en) Building knowledge mapping database and its construction method
CN104008203B (en) A kind of Users&#39; Interests Mining method for incorporating body situation
CN103761254B (en) Method for matching and recommending service themes in various fields
CN109299090B (en) Foundation centrality calculating method, system, computer equipment and storage medium
CN102902821B (en) The image high-level semantics mark of much-talked-about topic Network Based, search method and device
CN103116657B (en) A kind of individuation search method of network teaching resource
CN103064924A (en) Travel destination situation recommendation method based on geotagged photo excavation
CN106156286A (en) Type extraction system and method towards technical literature knowledge entity
CN107391542A (en) A kind of open source software community expert recommendation method based on document knowledge collection of illustrative plates
CN104572797A (en) Individual service recommendation system and method based on topic model
CN103678431A (en) Recommendation method based on standard labels and item grades
CN104239513A (en) Semantic retrieval method oriented to field data
CN110457404A (en) Social media account-classification method based on complex heterogeneous network
CN107944898A (en) The automatic discovery of advertisement putting building information and sort method
CN106960044A (en) A kind of Time Perception personalization POI based on tensor resolution and Weighted H ITS recommends method
Bagci et al. Random walk based context-aware activity recommendation for location based social networks
CN105678590A (en) topN recommendation method for social network based on cloud model
CN105654144A (en) Social network body constructing method based on machine learning
CN106233288A (en) Again rated position refinement and multifarious Search Results
CN109284443A (en) A kind of tourism recommended method and system based on crawler technology
CN105786897B (en) For providing the context aware body constructing method for paying close attention to information of the user based on context aware

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Cheng Xueqi

Inventor after: Wang Yuanzhuo

Inventor after: Lin Hailun

Inventor after: Jia Yantao

Inventor after: Jin Xiaolong

Inventor after: Xiong Jinhua

Inventor after: Li Manling

Inventor after: Chang Yuxiao

Inventor after: Xu Hongbo

Inventor before: Cheng Xueqi

Inventor before: Wang Yuanzhuo

Inventor before: Lin Hailun

Inventor before: Jia Yantao

Inventor before: Xiong Jinhua

Inventor before: Li Manling

Inventor before: Chang Yuxiao

Inventor before: Xu Hongbo

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant