CN104615687A

CN104615687A - Entity fine granularity classifying method and system for knowledge base updating

Info

Publication number: CN104615687A
Application number: CN201510033050.4A
Authority: CN
Inventors: 程学旗; 王元卓; 林海伦; 贾岩涛; 熊锦华; 李曼玲; 常雨骁; 许洪波
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2015-01-22
Filing date: 2015-01-22
Publication date: 2015-05-13
Anticipated expiration: 2035-01-22
Also published as: CN104615687B

Abstract

The invention provides an entity fine granularity classifying method and system for knowledge base updating. The entity fine granularity classifying method for knowledge base updating comprises the steps that an entity is recognized in a text; the recognized entity, an entity, relevant to the recognized entity, in a knowledge base and the category, of the relevant entity, in the knowledge serve as nodes to construct a depending diagram, wherein the weight of an edge in the depending diagram shows the degree of correlation of the two nodes connected with the edge; the category of the recognized entity is obtained by executing restarting random walking on the depending diagram. According to the entity fine granularity classifying method and system for knowledge base updating, the defect that in the prior art, fine granularity classification of an entity can not be achieved easily under the condition that the context of the entity is not sufficient is overcome, and the accuracy of entity fine granularity classification is improved.

Description

A kind of entity fine grit classification method and system towards knowledge base update

Technical field

The present invention relates to technical field of information processing, be specifically related to a kind of entity fine grit classification method and system towards knowledge base update.

Background technology

Knowledge base is the knowledge collection interknited adopting certain knowledge representation mode organization and management.In knowledge engineering field, the key element of knowledge description generally comprises the key elements such as classification, entity, relation, attribute, and wherein classification is used for carrying out semanteme grouping or semantic tagger to the knowledge item in knowledge base.Knowledge base plays vital effect in a lot of field, and such as, in information retrieval, the knowledge base engine that can assist search is understood user's inquiry, perception user query intention, carried out query expansion and inquiry question and answer etc.; In addition, knowledge base is also widely used in the fields such as data analysis, public sentiment monitoring, dark net resource discovering.Although there is numerous knowledge base at present, still there is many restrictions in them, basic reason is in the coverage rate and timeliness n of knowledge, and along with the arrival of large data age, data just increase with detonation velocity, and in Web, every day all can produce new knowledge.Therefore, in order to construct high-quality knowledge base, the knowledge newly produced dynamically, in real time, is automatically updated in existing knowledge base, and ensures that the extended capability of knowledge base, covering power and timeliness n become most important.

Entity is as the important composition key element of knowledge description, and knowledge base must need the ability possessing automatic expansion entity.To emerging entity be updated in knowledge base, need first to determine the position of entity in knowledge base, be i.e. the classified information of entity belonging in knowledge base.After the classification determining entity, under emerging entity being added to this classification of knowledge base, thus the entity sets comprised in storehouse of enriching one's knowledge.At present, entity classification method mainly contains two classes: the classification of entity coarseness and entity fine grit classification.

Entity division is coarseness classification by the classification of entity coarseness, as name, place name, mechanism's name etc.Main employing has the mode of supervision to train entity classification model, needs the training data of a large amount of artificial marks.This mode cannot be applied directly to towards in the entity classification of knowledge base, and reason is that entity division is become hundreds and thousands of classifications by knowledge base, training data larger that its needs, and the training data creating scale like this needs a large amount of manpowers.

Entity division is finer classification by entity fine grit classification, mainly adopts heuristic rule or classifies to entity based on Weakly supervised method.Wherein, the method based on heuristic rule is directly that entity carries out classification mark by the syntactic pattern of definition, and this method is simple to operate, but needs manual maintenance and a large amount of rule of definition.The context of entity is extracted based on Weakly supervised method, utilize contextual morphology, classified information belonging to syntactic feature computational entity, but the accuracy rate of this method is lower, and this method will be difficult to the classified information inferring entity when context lacks.

In sum, existing entity coarseness sorting technique is not also suitable for the renewal of knowledge base, and existing entity fine grit classification method accuracy rate is lower.

Summary of the invention

For solving the problem, according to one embodiment of present invention, a kind of entity fine grit classification method towards knowledge base update is provided, comprises:

Step 1), from text, identify entity;

Step 2), using entity relative in the entity identified, knowledge base and related entities, the classification in knowledge base builds dependency graph as node, the weights on the limit wherein in dependency graph represent the degree of correlation between two nodes that this limit connects;

Step 3), by described dependency graph perform restart random walk, obtain the classification belonging to entity identified.

In said method, step 2) comprising:

Step 21), obtain the related entities of entity in knowledge base identified according to semantic compatible degree, and obtain the classification of this related entities in knowledge base; Wherein, semantic compatible degree represents the similarity of the contextual information of the entity identified and the description text of related entities;

Step 22), using identify entity, in knowledge base relative entity and the classification of related entities in knowledge base as node;

Step 23), add limit between the node of entity identified and the node representing related entities representing, the weights on limit are the semantic compatible degree between this entity identified and this related entities;

Between the node representing related entities and the node of presentation class, add limit, the weights on limit indicate this related entities whether to belong to this classification;

Between the node representing related entities, add limit, the weights on limit are the semantic relevancy between this related entities;

Between the node of presentation class, add limit, the weights on limit are the degree of correlation between this classification.

In said method, according to following formula computing semantic compatible degree:

SC (em, e) = sim (X, T) = \frac{\overset{&RightArrow;}{V} (X) \cdot \overset{&RightArrow;}{V} (T)}{| \overset{&RightArrow;}{V} (X) | \cdot | \overset{&RightArrow;}{V} (T) |}

Wherein, SC (em, e) represents the semantic compatible degree of the related entities e in the entity em and knowledge base identified, and X represents the contextual information of em, and T represents the description text of e, represent the TF-IDF vector of all Biterm compositions comprised in text, represent vector mould, Biterm represents the word pair of co-occurrence in text.Wherein, the contextual information of the entity identified is made up of the word appeared at before and after described text.

In said method, step 21) comprising:

The entity of 0 is greater than as related entities with the semantic compatible degree of the entity identified using in knowledge base.

In said method, calculate the semantic relevancy between related entities according to following formula:

SR (e_{1}, e_{2}) = 1 - \frac{\log (\max (| I_{1} |, | I_{2} |)) - \log (| I_{1} \cap I_{2} |)}{\log (| Z |) - \log (\min (| I_{1} |, | I_{2} |))}

Wherein, SR (e ₁, e ₂) represent related entities e in knowledge base ₁and e ₂semantic relevancy, I ₁and I ₂represent respectively in knowledge base to describe in the text of entity and occur entity e ₁and e ₂the set of entity, Z represents the set of all entities comprised in knowledge base, || represent the size of set.

In said method, calculate the degree of correlation between classification according to following formula:

CR (c_{1}, c_{2}) = \frac{| E_{c_{1}} \cap E_{c_{2}} |}{| E_{c_{1}} \cup E_{c_{2}} |}

Wherein, CR (c ₁, c ₂) presentation class c ₁and c ₂between the degree of correlation, with represent respectively in knowledge base and belong to classification c ₁and c ₂the set of entity, || represent the size of set.

In said method, step 3) comprising:

Step 31), the distribution of node in dependency graph according to following formula initialization:

{\overset{&RightArrow;}{r}}_{i}^{(0)} = (r_{i (1)}, . . ., r_{i (k)}, . . ., r_{i (n)})

Wherein, n represents node sum, represent the initial distribution state of node i; If k=i, then r _{i (k)}=1, otherwise r _{i (k)}=0, k is natural number and 1≤k≤n;

Step 32), computing mode transition probability matrix A=(a _ij):

Wherein, a _ijrepresent and restarting the probability transferring to node j in random walk process from node i, i, j are natural number and meet 1≤i, j≤n; w _ijfor the weight on the limit between node i and node j; represent the weight sum on all limits of link node i;

Step 33), for each node, carry out state transfer to its neighbor node iteratively, until the distribution of each node does not change with the increase of iterations in described dependency graph; Wherein, in the distribution of the t time iteration postjunction i be expressed as follows:

{\overset{&RightArrow;}{r}}_{i}^{(t)} = (1 - μ) A {\overset{&RightArrow;}{r}}_{i}^{(t - 1)} + μ {\overset{&RightArrow;}{v}}_{i}

Wherein, represent the distribution at the t time iteration postjunction i, t is natural number, and i is natural number and 1≤i≤n; represent the distribution at the t-1 time iteration postjunction i; μ represents the probability returning the node i that sets out after the t time iteration, is called and restarts the factor, μ be real number and represent node i restart vector and if k=i, then v _{i (k)}=1, otherwise v _{i (k)}=0, k is natural number and 1≤k≤n;

Step 34), according to the distribution of node, obtain the classification of its correspondence.

In said method, step 34) comprising:

In the distribution representing the node of entity identified, the value of the node of presentation class by component corresponding to this node is sorted;

The classification that the entity obtaining identifying according to ranking results is corresponding.

According to one embodiment of present invention, a kind of entity fine grit classification system towards knowledge base update is also provided, comprises:

Entity recognition equipment, for identifying entity from text;

Dependency graph builds equipment, for using entity relative in the entity identified, knowledge base and related entities, the classification in knowledge base builds dependency graph as node, the weights on the limit wherein in dependency graph represent the degree of correlation between two nodes that this limit connects; And iteration equipment, for restarting random walk by performing on described dependency graph, obtain the classification belonging to entity identified.

The present invention can overcome the defect that prior art is difficult to when entity context lacks realize this entity being carried out to fine grit classification, by the semantic dependency between the entity that occurs in modeling one text, and text entities and knowledge base entity and the relation between classifying thereof, utilizing this semantic dependency and closing is that in one text, entity fine grit classification provides strong evidence support, and by restarting Random Walk Algorithm, improve the accuracy rate of entity fine grit classification.

Accompanying drawing explanation

Referring to accompanying drawing, embodiments of the present invention is further illustrated, wherein:

Fig. 1 is according to an embodiment of the invention towards the process flow diagram of the entity fine grit classification method of knowledge base update;

Fig. 2 is the process flow diagram of the method creating dependency graph model according to an embodiment of the invention;

Fig. 3 is the example of dependency graph according to an embodiment of the invention;

Fig. 4 is the process flow diagram of combining the method inferring entity classification according to an embodiment of the invention.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with accompanying drawing, by specific embodiment, the present invention is described in more detail.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.

According to one embodiment of present invention, a kind of entity fine grit classification method towards knowledge base update is provided.

Generally, the method comprises: from text, identify entity; Using entity relative in the entity identified, knowledge base and related entities, the classification in knowledge base builds dependency graph as node, and the weights on the limit wherein in dependency graph represent the degree of correlation between two nodes that this limit connects; And, restarting random walk by performing on described dependency graph, obtaining the classification belonging to entity identified.The method is theoretical based on distributional assumption, and namely the context semantic dependency at two entity places is larger, then they to belong to other possibility of same class larger.

Refer now to Fig. 1, describe each step of the inventive method.

Step 101: input the text document and object knowledge storehouse that will carry out processing

Selection will carry out the text document D that processes and object knowledge storehouse KB, and initialization system inputs.

As described above, knowledge base (Knowledge Base, KB) is made up of the key element such as entity, classification, relation, attribute of Description of Knowledge, therefore object knowledge storehouse KB can be modeled as following form:

KB＝<C,E,P,R>

Wherein, C represents the classification set comprised in object knowledge storehouse; E and P represents respectively and belongs to the entity of classification and the set of attribute thereof, and R is the function of the relation between defining classification, example, attribute.In set E, each entity e can represent by following form:

e＝<name,aliases,T>

Wherein, the name of name presentation-entity e; The set of the another name of aliases presentation-entity e; The description text of T presentation-entity e.The community set P of entity e _ewith the classification set C belonging to entity e _efunction R by knowledge base KB tries to achieve, and meets

Existing various encyclopaedia database resource can be utilized to come the object knowledge storehouse of the above-mentioned form of modeling, such as, adopt the knowledge base created based on wikipedia as the object knowledge storehouse inputted in this step.

Step 102: extract the entity comprised in text document

Utilize named entity recognition instrument, extract the set of all entities comprised in text document D.

The set of all entities comprised in text document D can be designated as:

EM={em _i| i is integer, 0≤i≤| D|}

Wherein, | D| is the length of text document; The following form of each element em in set represents:

em＝<name,D,X>

Wherein, name represents the name of em; D represents the source text document of em, and X represents the context describing em.In one embodiment, X is represented with the word window that em appears at around text document D, window size is k (k is integer and 0<k≤| D|), namely the length of context X is 2k (X is made up of k word before appearing at text document D and k word after text document D), preferably, k=min (50, | D|).

It will be understood by those skilled in the art that and existing various named entity recognition instrument can be utilized to extract the entity in text.In one embodiment, utilize Stanford NER as named entity recognition instrument.

Step 103: create dependency graph

According to the entity sets EM extracted from text document D and object knowledge storehouse KB, create dependency graph, thus the semantic dependency in unified Modeling text document D between different entities, and the entity in text document D and the entity in knowledge base KB and the dependence between affiliated classification thereof.

With reference to figure 2, in one embodiment, create dependency graph and comprise following sub-step:

Step 1031: input the entity sets EM and object knowledge storehouse KB that identify from text document D.

Step 1032: select candidate's entity.

According to the semantic compatible degree (Semantic Compatibility, SC) of the text of description entity, be that each entity em ∈ EM selects semantic compatible candidate's entity sets with it in knowledge base KB, be designated as:

ES _em＝{e∈E|SC(em,e)>0}

Wherein, SC (em, e) represents the semantic compatible degree between em and knowledge base entity e.In one embodiment, the mode based on the cosine similarity of Biterm is adopted to calculate this semantic compatible degree:

SC (em, e) = sim (X, T) = \frac{\overset{&RightArrow;}{V} (X) \cdot \overset{&RightArrow;}{V} (T)}{| \overset{&RightArrow;}{V} (X) | \cdot | \overset{&RightArrow;}{V} (T) |}

Wherein, SC (em, e) is for real number and 0≤SC (em, e)≤10; X is the contextual information describing em; T is the description text of e; The similarity that sim (X, T) is X and T; for the TF-IDF of all Biterm compositions comprised in text is vectorial, for vector field homoemorphism, and Biterm is the word pair of a co-occurrence in text.Such as, given text " apple application shop ", the text obtains three words " apple ", " application ", " shop " by participle, and the Biterm set that so text comprises is { apple is applied, apple shop, application shop }.

According to above-mentioned formula, if SC (em, e) >0, then select e as candidate's entity of em, thus obtain semantic compatible candidate's entity sets ES with em _em.

Step 1033: select candidate classification.

According to the contextual definition function R in knowledge base KB, obtain the set of the classification of each candidate's entity e belonging in knowledge base KB selected in step 1032 it can be used as candidate classification set.

Step 1034: establish the node in dependency graph and connect side information.

Node set in dependency graph comprises the set of all entities (abbreviation text entities) extracted from text document D, the set with the compatible candidate's entity (being called for short knowledge base entity) of Entity Semantics extracted, and the set of classification (abbreviation knowledge-based classification) belonging to candidate's entity.

After establishing the node in figure, point accessory limit and weight between these nodes, specifically comprise:

1, between the knowledge base entity node e that the node and representative that represent text entities em are compatible with its semanteme, add even limit, the weights on limit are the semantic compatible degree SC (em, e) between them.

2, interpolation even limit between the node of knowledge base entity e and the node representing the classification c belonging to it is being represented, weights on limit are belonging relation (the Attachment Relatedness between them, AR), if entity belongs to this classification, then weights are 1.0, if do not belong to, weights are 0.0.

3, two node e of knowledge base entity are being represented ₁and e ₂between add connect limit, the weights on limit are the semantic relevancy (Semantic Relatedness, SR) between them.It should be noted that and carry out the semantic dependency in indirect measurement one text between entity at this by the semantic dependency between knowledge base entity.

In one embodiment, based on normalized Google distance (google distance) computational entity e ₁and e ₂between semantic relevancy SR (e ₁, e ₂):

SR (e_{1}, e_{2}) = 1 - \frac{\log (\max (| I_{1} |, | I_{2} |)) - \log (| I_{1} \cap I_{2} |)}{\log (| Z |) - \log (\min (| I_{1} |, | I_{2} |))}

Wherein, SR (e ₁, e ₂) for real number and 0≤SR (e ₁, e ₂)≤1.0; I ₁and I ₂represent respectively in knowledge base KB, describe in the text of entity and occur entity e ₁and e ₂the set of entity, Z represents the set of all entities comprised in knowledge base KB, || represent the size of set.

4, two node c of knowledge-based classification are being represented ₁and c ₂between add connect limit, the weights on limit are the degree of correlation (Correlation, CR) between them.In one embodiment, Jaccard coefficient calculations classification c is adopted ₁and c ₂between degree of correlation CR (c ₁, c ₂):

CR (c_{1}, c_{2}) = \frac{| E_{c_{1}} \cap E_{c_{2}} |}{| E_{c_{1}} \cup E_{c_{2}} |}

Wherein, CR (c ₁, c ₂) for real number and 0≤CR (c ₁, c ₂)≤1.0, with represent respectively in knowledge base KB and belong to classification c ₁and c ₂the set of entity, || represent the size of set.

By establishing node and Lian Bian, constructing the dependency graph about entity EM all in text document D, being designated as G=(V, E, W).G is a non-directed graph, and wherein V is the vertex set of figure, comprises all entities in entities all in given text, the knowledge base compatible with these Entity Semantics, and the set of classification belonging to these entities.E is the limit set between these nodes; W:E → R (R is real number) is the weights on limit.

" for sportsman, Hall of Fame is great monument to given one section of text, and being also the affirmative for sportsman's career, is accreditation best except champion's ring.But because sportsman wants that entering Hall of Fame all must wait until retired latter 5 years, so fleet-footed runner is until just waited until such special honours for 2009.But, this does not hinder the name of Jordon glittering in the history of NBA and even whole world basketball circles ".Named entity recognition tool identification is utilized to go out 3 different entities: " Hall of Fame ", " Jordon ", " NBA ".Utilize method provided by the invention, to these 3 entity set-up dependency graph models.As shown in Figure 3, in figure, altogether comprise 12 nodes: 3 text entities, 6 knowledge base entities and 3 knowledge-based classification, and comprise 12 limits.

Step 104: according to the dependency graph created, combines the classified information inferring entity

On the dependency graph that previous step creates, perform Random Walk Algorithm, as restarted Random Walk Algorithm.Constantly on dependency graph, do random walk iteratively, until the distribution of node does not change along with the increase of iterations in figure, till namely reaching steady state (SS).Now, according to the distribution of node representing text entities, obtain the tag along sort of its correspondence, thus infer the fine grit classification information of text entities.

Below with reference to Fig. 4, in conjunction with one embodiment of the present of invention, this step is specifically described:

Step 1041: initialization algorithm inputs.

The dependency graph G=(V, E, W) that input creates.

Step 1042: the distribution of the node in initialization dependency graph.

In note figure G, the number of node is n=|V|, and the number on limit is that the node numbering in m=|E|, G is respectively 1 ..., i ..., n (i is natural number and 1≤i≤n).

Arrange algorithm initial time dependency graph in the distribution of node i this distribution is the column vector of n x 1 dimension about all nodes comprised in figure G, and wherein n is the number of node in figure G. be designated as:

{\overset{&RightArrow;}{r}}_{i}^{(0)} = (r_{i (1)}, . . ., r_{i (k)}, . . ., r_{i (n)})

Wherein, for each component r in this vector _{i (k)}value as follows: if k=i, then r _{i (k)}=1, otherwise r _{i (k)}=0, k is natural number and 1≤k≤n.

Step 1043: according to the adjacency matrix U=(u of dependency graph G=(V, E, W) _ij), calculate state transition probability matrix A=(a in random walk process _ij), i, j are natural number and meet 1≤i, j≤n.For adjacency matrix U, u _ijvalue is as follows:

Wherein, w _ijfor the weight of the Lian Bianshang between node i and node j, determined by the W:E → R (R is real number) in G=(V, E, W).

For state transition probability matrix A, a _ijrepresenting is restarting in random walk process, transfers to the probability of node j from node i.The adjacency vector that in note figure G=(V, E, W), node i and other all nodes form is adjacency vector is the vector of the i-th row element composition in adjacency matrix U, and k is natural number and 1≤k≤n.According to the adjacency vector of node i, calculate a in the following manner _ij:

From above formula, connect limit, then a if do not exist between i=j or node i and j _ij=0; If exist between node i and j and connect limit, then a _ijvalue and weight between node i and node j on limit proportional, be to connect weight on limit and link node i between node i and node j all connect weight on limits and ratio.

Step 1044: on dependency graph, follow node i starts, and constantly carries out state transfer to the neighbor node around it iteratively.After the t time iteration, the distribution of node i in figure be expressed as follows:

{\overset{&RightArrow;}{r}}_{i}^{(t)} = (1 - μ) A {\overset{&RightArrow;}{r}}_{i}^{(t - 1)} + μ {\overset{&RightArrow;}{v}}_{i}

Wherein, t is natural number; represent the distribution at the t-1 time iteration postjunction i; represent the distribution at the t time iteration postjunction i; μ represents the probability (be called and restart the factor, μ is real number and 0< μ <1, is preferably 0.15) returning the node i that sets out after the t time iteration; be node i restart vector, be a column vector tieed up about n x 1 of all nodes comprised in figure G, n is the number of node in figure G, be designated as each component v wherein in vector _{i (k)}value as follows: if k=i, then v _{i (k)}=1, otherwise v _{i (k)}=0, k is natural number and 1≤k≤n.

Repeated execution of steps 1044, until the distribution of each node i (i is natural number and 1≤i≤n) in dependency graph termination algorithm when reaching stable.That is, the distribution of the node i in dependency graph no longer change (distribution of node reaches steady state (SS)) along with the increase of iterations t.Now, according to the distribution of node representing text entities, obtain the tag along sort of its correspondence, thus infer the classified information that text entity is concrete.

Particularly, according to as discussed above, it is the column vector of n x 1 dimension about all nodes comprised in figure G.For the distribution of node i reaching steady state (SS) it is also the column vector of n x 1 dimension about all nodes comprised in figure G, and therefore the classification node of scheming in G is also contained in this vector.At vector in, the value of the component of classification corresponding to node belongs to the probable value of this classification as the entity that the postjunction i by restarting random walk represents, can obtain the tag along sort (namely selecting the classification that maximum probability is corresponding) corresponding to entity that node i represents by probability sorting.

Combining the classification of inferring text entities is to utilize the classified information of knowledge base to mark the knowledge-based classification belonging to it to text entities, by the mutual promoting action that the deduction of an entity classification in one text is inferred the classification of another entity, realize the deduction of the classification to entities all in one text simultaneously.

According to one embodiment of present invention, also provide a kind of entity fine grit classification system towards knowledge base update, comprise Entity recognition equipment, dependency graph builds equipment and iteration equipment.

Wherein Entity recognition equipment is used for identifying entity from text, such as, and named entity recognition instrument as described above.Dependency graph builds equipment for the classification in knowledge base builds dependency graph as node using entity relative in the entity identified, knowledge base and related entities.Iteration equipment, for restarting random walk by performing on described dependency graph, obtains the classification belonging to entity identified.

For verifying the validity of the entity fine grit classification method and system towards knowledge base update provided by the invention, inventor adopts existing up-to-date entity classification technology (APOLLO) and method provided by the invention respectively, true YAGO data set is tested, and experiment parameter is as follows:

Testing entity used is that under utilizing 15 sub-directories of person classification in YAGO, Stochastic choice data out form, and wherein random selection at most 200 entities from each catalogue, select 2650 entities altogether as final data set DSec.Arrange in DSec ratio ρ=0.8 that the data being used as to train account for total data, iterations t=10, restarts factor mu=0.15, window size k=50.

Obtain following result through experiment: adopt existing APOLLO technology classification accuracy rate to be 0.7254, and adopt method and system provided by the invention obtain classification results accuracy rate be 0.7708.Adopt entity fine grit classification method and system provided by the invention compared with the existing APOLLO technology of employing, accuracy rate improves about 4.5%.

To sum up, the invention provides a kind of entity fine grit classification method and system towards knowledge base update, the method is based on dependency graph, semantic dependency between the entity occurred in modeling one text, and utilize this correlativity to provide strong evidence support for the fine-grained classification of entity in one text, by based on the associating estimating method restarting Random Walk Algorithm, realize the lifting of the accuracy rate of entity fine grit classification.

Be to be understood that, although this instructions describes according to each embodiment, but not each embodiment only comprises an independently technical scheme, this narrating mode of instructions is only for clarity sake, those skilled in the art should by instructions integrally, technical scheme in each embodiment also through appropriately combined, can form other embodiments that it will be appreciated by those skilled in the art that.

The foregoing is only the schematic embodiment of the present invention, and be not used to limit scope of the present invention.Any those skilled in the art, the equivalent variations done under the prerequisite not departing from design of the present invention and principle, amendment and combination, all should belong to the scope of protection of the invention.

Claims

1., towards an entity fine grit classification method for knowledge base update, comprising:

Step 1), from text, identify entity;

2. method according to claim 1, wherein, step 2) comprising:

3. method according to claim 2, wherein, according to following formula computing semantic compatible degree:

SC (em, e) = sim (X, T) = \frac{\overset{&RightArrow;}{V} (X) \cdot \overset{&RightArrow;}{V} (T)}{| \overset{&RightArrow;}{V} (X) | \cdot | \overset{&RightArrow;}{V} (T) |}

Wherein, SC (em, e) represents the semantic compatible degree of the related entities e in the entity em and knowledge base identified, and X represents the contextual information of em, and T represents the description text of e, represent the TF-IDF vector of all Biterm compositions comprised in text, represent vector mould, Biterm represents the word pair of co-occurrence in text.

4. method according to claim 3, wherein, the contextual information of the entity identified is made up of the word appeared at before and after described text.

5. the method according to claim 3 or 4, wherein, step 21) comprising:

6. method according to claim 2, wherein, calculates the semantic relevancy between related entities according to following formula:

SR (e_{1}, e_{2}) = 1 - \frac{\log (\max (| I_{1} |, | I_{2} |)) - \log (| I_{1} \cap I_{2} |)}{\log (| Z |) - \log (\min ({| I}_{1} |, | I_{2} |))}

7. method according to claim 2, wherein, calculates the degree of correlation between classification according to following formula:

CR (c_{1}, c_{2}) = \frac{| E_{c_{1}} \cap E_{c_{2}} |}{| E_{c_{1}} \cup E_{c_{2}} |}

8. according to the method in claim 1-4 described in any one, wherein, step 3) comprising:

{\overset{&RightArrow;}{r}}_{i}^{(0)} = (r_{i (1)}, . . ., r_{i (k)}, . . ., r_{i (n)})

Step 32), computing mode transition probability matrix A=(a _ij):

{\overset{&RightArrow;}{r}}_{i}^{(t)} = (1 - μ) A {\overset{&RightArrow;}{r}}_{i}^{(t - 1)} + μ {\overset{&RightArrow;}{v}}_{i}

Wherein, represent the distribution at the t time iteration postjunction i, t is natural number, and i is natural number and 1≤i≤n; represent the distribution at the t-1 time iteration postjunction i; μ represents the probability returning the node i that sets out after the t time iteration, and μ is real number and 0< μ <1; represent node i restart vector and if k=i, then v _{i (k)}=1, otherwise v _{i (k)}=0, k is natural number and 1≤k≤n;

9. method according to claim 8, wherein, step 34) comprising:

10., towards an entity fine grit classification system for knowledge base update, comprising:

Entity recognition equipment, for identifying entity from text;

Dependency graph builds equipment, for using entity relative in the entity identified, knowledge base and related entities, the classification in knowledge base builds dependency graph as node, the weights on the limit wherein in dependency graph represent the degree of correlation between two nodes that this limit connects; And

Iteration equipment, for restarting random walk by performing on described dependency graph, obtains the classification belonging to entity identified.