CN102799577A

CN102799577A - Extraction method of semantic relation between Chinese entities

Info

Publication number: CN102799577A
Application number: CN2012102944371A
Authority: CN
Inventors: 钱龙华; 刘丹丹; 周国栋
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2012-08-17
Filing date: 2012-08-17
Publication date: 2012-11-28
Anticipated expiration: 2032-08-17
Also published as: CN102799577B

Abstract

The invention discloses an extraction method of a semantic relation between Chinese entities. The extraction method comprises the following steps of: carrying out syntactic analysis on natural statements to determine a complete syntactic tree of the natural statements; extracting a shortest path containing tree between two Chinese entities from the complete syntactic tree; extracting a path verb nearest to a second Chinese entity from the shortest path containing tree; respectively acquiring the semantic information of the two Chinese entities and the path verb; adding the three acquired semantic information into a root node of the shortest path containing tree according to a preset rule to determine the expanded shortest path containing tree to be a natural statement relation tree; and carrying out relation classification on the relation tree by utilizing a prestored classification model. According to the extraction method of the semantic relation between Chinese entities, which is disclosed by the invention, the relation tree contains abundant structured information and lexical semantic information and has better generality and semantic relation extraction overall performance, the dependence degree of a large-scale corpus is relieved, and meanwhile, the calculated amount of the system is lower.

Description

A kind of Chinese inter-entity semantic relation abstracting method

Technical field

The invention belongs to the text-processing technical field, relate in particular to a kind of Chinese inter-entity semantic relation abstracting method.

Background technology

Semantic relation extraction between named entity (can abbreviate entity relationship extraction or relation as extracts) is a research content in the information extraction; Its task is from natural language text, to extract existing semantic relation between two named entities, the physical location relation (PHYS.Located) that for example exists between two entities " Clinton " (PER-personage) in the phrase " US President Clinton's Pyongyang trip " and " Pyongyang " (the geographical political entity of GPE-).Semantic relation extracts as an application foundation Journal of Sex Research between named entity, natural language processings such as content understanding, question answering, automatic abstract and information filtering is used all have great importance.

The inter-entity semantic relation extracts the machine learning method that adopts directiveness usually, can be divided into based on the method for proper vector with based on the method for kernel function by the expression-form of relationship example.In entity relationship abstracting method, convert relationship example to the sorter acceptable proper vector that comprises vocabulary, sentence structure or semantic feature based on proper vector.Although this method speed is very fast, also very effective, yet because complicacy and changeability that the inter-entity semantic relation is expressed, the performance that its relation extracts is lower.In entity relationship abstracting method based on kernel function, be process object directly with the structure tree, calculate the similarity between each structure tree, re-use and support the sorter of kernel function to concern extraction.Because this method can make full use of structured features; Can explore implicit high-dimensional feature space in theory; Therefore, though the speed of its training and prediction is slower, people still hope through to the further research of kernel function with should be used for improving the performance that relation extracts.

The applicant has important effect through discovering lexical semantic information in relation extracts.Different but semantic two the close words of vocabulary; In two different sentences, can show identical semantic relation; For example; Chinese relationship example " his wife " and " her husband " all belongs to family relationship (PER-SOC.Family), and wherein entity " he " is two different vocabulary with " she ", " wife " with " husband ", but but has close semanteme.

Therefore, how utilizing lexical semantic information to improve the performance that Chinese inter-entity semantic relation extracts, and simplify computation process, reduction calculated amount as far as possible, is those skilled in the art's problem demanding prompt solutions thereby reduce system overhead.

Summary of the invention

In view of this, the method that the object of the present invention is to provide a kind of Chinese inter-entity semantic relation to extract is utilized lexical semantic information to improve the performance that relation extracts, and is reduced calculated amount as far as possible, thereby reduces system overhead.

For realizing above-mentioned purpose, the present invention provides following technical scheme:

A kind of Chinese inter-entity semantic relation abstracting method is used for the semantic relation in the natural statement extraction inter-entity that comprises two Chinese entities, and said method comprises:

Natural statement is carried out syntactic analysis, confirm the complete syntax tree of said natural statement;

The shortest path that in said complete syntax tree, extracts between said two Chinese entities comprises tree;

Comprise in the tree at said shortest path and to extract the nearest path verb of the distance second Chinese entity, the said second Chinese entity is the Chinese entity after said natural statement the position occurs and leans in two Chinese entities;

Obtain the semantic information of said two Chinese entities and path verb respectively;

Three semantic informations to obtaining according to preset rules add said shortest path and comprise under the root node of tree, confirm that shortest path after the expansion comprises tree and is the relational tree of said natural statement;

The disaggregated model that utilization prestores concerns classification to said relational tree.

Preferably, in said method, comprise the nearest path verb of the extraction distance second Chinese entity in the tree, specifically comprise at said shortest path:

Comprise definite residing node of the said second Chinese entity in the tree at said shortest path;

Begin from said second Chinese entity node of living in, in the upper strata node of said second Chinese entity node of living in, search the node of label for " VP ";

Search the child node of said label for the node of " VP ";

When said label be the label of child node of node of " VP " when " VV " or " VE ", obtain said label for the vocabulary at the child node place of the node of " VV " or " VE " as said path verb.

Preferably, in said method, said semantic information of searching said two Chinese entities and path verb respectively specifically comprises:

The mapping table that word that utilization prestores and basic meaning are former, it is former to search the basic meaning corresponding with said two Chinese entities and path verb respectively;

When finding the basic meaning corresponding when former, that the basic meaning that finds is former in semantic information with said two Chinese entities and path verb;

When word has a plurality of basic meanings when former, only comprise said word and the former mapping relations of its first basic meaning in the said mapping table.

Preferably, in said method, when in said mapping table, not finding the basic meaning corresponding with Chinese entity when former, said method also comprises:

The said former Chinese entity of basic meaning that do not find is carried out word segmentation processing, obtain a plurality of new Chinese fructifications;

Utilize said mapping table, search with said a plurality of new Chinese fructifications in the corresponding basic meaning of Chinese fructification after the position occurring and leaning on most former;

The basic meaning of the Chinese fructification that finds is former in the said semantic information that does not find the former Chinese entity of basic meaning.

Preferably, in said method,, the said semantic information assignment that does not find the former Chinese entity of basic meaning is " NULL " when not finding the basic meaning corresponding when former with the Chinese fructification after said position is leaned on most.

The word that utilization prestores and the mapping table of semantic coding are searched respectively and said two Chinese entities and the corresponding semantic coding of path verb;

When finding the semantic coding corresponding with said two Chinese entities and path verb, from the character of the preset figure place of its high-order intercepting, the character of the preset figure place that intercepting is gone out is as semantic information in said semantic coding;

When word comprised a plurality of semantic coding, said mapping table only comprised the mapping relations of said word and its first semantic coding.

Preferably, in said method, when in said mapping table, not finding the semantic coding corresponding with Chinese entity, said method also comprises:

The said Chinese entity that does not find semantic coding is carried out word segmentation processing, obtain a plurality of new Chinese fructifications;

Utilize said mapping table, search with said a plurality of new Chinese fructifications in the corresponding semantic coding of Chinese fructification after the position occurring and leaning on most;

In the semantic coding of the Chinese fructification that finds, from the character of the preset figure place of its high-order intercepting, the character of the preset figure place that intercepting is gone out is as the said semantic information that does not find the Chinese entity of semantic coding.

Preferably, in said method, when not finding the semantic coding corresponding, the said semantic information assignment that does not find the Chinese entity of semantic coding is " NULL " with the Chinese fructification after said position is leaned on most.

Preferably, in said method, said three semantic informations will obtaining according to preset rules add said shortest path and comprise under the root node of tree, specifically comprise:

Comprise at said shortest path and add three sign nodes under the root node of tree; The vocabulary that said three sign nodes are respectively applied for its child node place of sign is the semantic information of the first Chinese entity, the semantic information of the second Chinese entity and the semantic information of path verb, and the said first Chinese entity is the forward Chinese entity in position to occur at said natural statement in said two Chinese entities;

Said three semantic informations are write the child node place of its corresponding sign node respectively.

This shows; Beneficial effect of the present invention is: Chinese inter-entity semantic relation abstracting method disclosed by the invention; The semantic information of Chinese entity and path verb is extracted, the shortest path that it is right that its adding comprises Chinese entity is comprised form relational tree in the tree then.This relational tree has comprised abundant structures information and lexical semantic information, and its versatility is better, therefore compares with the structured message that only comprises syntax tree, and precision and recall rate that relation extracts all are improved, and overall performance is better; Simultaneously; Because the semantic information (former like semantic coding or basic meaning) of vocabulary is to carry out extensive to vocabulary to a certain extent; Therefore the relational tree that obtains can not exist but the identical relationship example of semantic information in the recognition training language material; Just reduced the quantity of the corpus that needs mark, alleviated and concern the degree of dependence of abstracting method extensive corpus based on machine learning; At last; Other kernel function method with adopting the lexical semantic similarity is compared; The root node that the present invention only need join the semantic information of Chinese entity and path verb syntax tree gets final product down; Need not calculate vocabulary semantic similarity between any two, thereby avoid the heavy shortcoming of calculated amount brought therefrom.

Description of drawings

In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art; To do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below; Obviously, the accompanying drawing in describing below is some embodiments of the present invention, for those of ordinary skills; Under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the process flow diagram of Chinese inter-entity semantic relation abstracting method disclosed by the invention;

Fig. 2 is for extracting the process flow diagram that shortest path between two Chinese entities comprises tree among the present invention in syntax tree fully;

Fig. 3 is for extracting the process flow diagram of path verb among the present invention;

Fig. 4 is for obtaining a process flow diagram of semantic information among the present invention;

Fig. 5 is for obtaining another process flow diagram of semantic information among the present invention;

Fig. 6 comprises the process flow diagram of tree for utilizing semantic information expansion shortest path among the present invention;

Fig. 7 is the complete syntax tree of a natural statement among the present invention;

Fig. 8 comprises tree for the shortest path between two Chinese entities that extract from complete syntax tree shown in Figure 7;

Fig. 9 is a synoptic diagram that utilizes the relational tree after semantic information is expanded;

Figure 10 is another synoptic diagram that utilizes the relational tree after semantic information is expanded.

Embodiment

For describe clear for the purpose of, english abbreviation and the term that hereinafter occurs described.

Syntax tree: Syntactic Parse Tree is meant (like vocabulary, part of speech, phrase and clause etc.) existing hierarchical structure relation between the heterogeneity of natural language sentences;

Relational tree: Relation Tree, can express the part of the structured message of entity relationship instance in syntax tree;

Shortest path comprises tree: Shortest Path-enclosed Tree, and SPT in syntax tree, connects the shortest path of two inter-entity and the part that is comprised thereof, and is also referred to as the SPT tree;

Accuracy rate: Precision is meant the correct shared number percent of relationship example in the inter-entity relationship example that system identification goes out;

Recall rate: Recall is meant that correct inter-entity relationship example that system identification goes out accounts for the number percent of all relationship example;

F1 performance: F1-measure is meant the mean value of accuracy rate and recall rate, and computing formula is F1=2*P*R/ (P+R);

PCFG:Probabilistic Context-Free Grammar, the probability context-free grammar;

MLE:Maximum Likely Estimation, maximal possibility estimation.

For the purpose, technical scheme and the advantage that make the embodiment of the invention clearer; To combine the accompanying drawing in the embodiment of the invention below; Technical scheme in the embodiment of the invention is carried out clear, intactly description; Obviously, described embodiment is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills are not making the every other embodiment that is obtained under the creative work prerequisite, all belong to the scope of the present invention's protection.

The invention discloses a kind of Chinese inter-entity semantic relation abstracting method; Utilize this method in comprising the natural statement of two Chinese entities, to extract the semantic relation of inter-entity; Can improve the performance that relation extracts; Compare with other kernel function methods simultaneously, can reduce calculated amount, thereby reduce system overhead.

Its principle is following: in natural statement, extract two Chinese entities and the path verb between these two Chinese entities; Obtain the semantic information of two Chinese entities and path verb afterwards; The semantic information that gets access to is joined the shortest path that from complete syntax tree, extracts to be comprised in the tree; Finally obtain a relational tree that generalization ability is stronger, utilize machine learning method to extract two semantic relations between the Chinese entity then based on the tree kernel function.

Below in conjunction with specific embodiment method disclosed by the invention is described.

Referring to Fig. 1, Fig. 1 is the process flow diagram of a kind of Chinese inter-entity semantic relation abstracting method disclosed by the invention.Comprise:

Step S1: natural statement is carried out syntactic analysis, confirm the complete syntax tree of this nature statement.

In Chinese language material, extract a natural statement, this nature statement comprises two Chinese entities.The position of two Chinese entities is different in the nature statement; In order to express easily; Be designated as the first Chinese entity with in natural statement, occurring the forward Chinese entity in position in two Chinese entities, the Chinese entity after in natural statement, the position occurring in two Chinese entities and leaning on is designated as the second Chinese entity.Utilize a kind of syntactic analysis method that this nature statement is carried out syntactic analysis, obtain the complete syntax tree of this nature statement.

(Probabilistic Context-Free Grammar, syntactic analysis method PCFG) obtain the complete syntax tree of nature statement to adopt the probability context-free grammar among the present invention.Its basic thought is; The probability of a syntax tree is to be determined by its production probability that comprises; The probability of production and the context-free of its appearance, and (Maximum Likely Estimation, method MLE) estimates can from training corpus, to adopt maximal possibility estimation.So-called production is meant in syntax tree the rule of being derived child node by father node, and in Fig. 7: production IP → NP VP representes that node IP (sentence) can be derived as NP (noun phrase) and VP (verb phrase).

Each natural statement can have different syntax trees, and the probability of each syntax tree is all inequality, therefore can choose the maximum syntax tree of posterior probability as correct result, that is:

T (s) = \underset{π}{\arg \max} \frac{p (π, s)}{p (s)} = \underset{π}{\arg \max} p (π, s)

Wherein s is a natural statement that is made up of word, and π is a possible syntax tree of this nature statement, the probability of p (s) expression nature statement s, and (π s) is the joint probability of nature statement s and syntax tree π to p.(π s) can be obtained by the method for the product of the probability of all production r that use among the syntax tree π, that is: joint probability value p

p (π, s) = \underset{r &Element; π}{Π} p (r)

Wherein r is a certain production of syntax tree π, and all possible production is provided by the grammar G of PCFG.

Confirm a grammar G and the natural statement s that comprises two Chinese entities; Adopt certain search strategy (as top-down or bottom-up) to produce all possible syntax tree of this nature statement; Calculate the posterior probability of each syntax tree afterwards respectively, therefrom select the complete syntax tree of the maximum syntax tree of posterior probability at last as this nature statement.

Fig. 7 is exactly the pairing complete syntax tree of nature statement " reporter is trapped in the airport ".

Step S2: the shortest path that in complete syntax tree, extracts between two Chinese entities comprises tree.

Utilize the method for coupling successively among the present invention, the shortest path that in complete syntax tree, extracts between two Chinese entities comprises tree, and its flow process is as shown in Figure 2, comprising:

Step S21: confirming the node that two Chinese entities belong in the syntax tree fully.

In the enforcement, can confirm the node at two Chinese entity places through the method that travels through all nodes in the complete syntax tree.Usually, the leafy node in the syntax tree is exactly the vocabulary in the nature statement fully, therefore can utilize the title of Chinese entity in each leafy node of complete syntax tree, to mate, and confirms the node at two Chinese entities places with this.

Step S22: the minimum common node of seeking two Chinese entity place nodes.

Because in complete syntax tree; Each child node only has a father node; Therefore can list two Chinese entity place node upper strata node tabulations separately respectively, successively mate from low layer more afterwards, just can find the minimum common node of two Chinese entity place nodes.

Step S23: confirming the path between two Chinese entity place nodes and the minimum common node in the syntax tree fully.

In order to explain for simplicity, the path between two Chinese entity place nodes and the minimum common node is designated as shortest path.

Step S24: the part that in complete syntax tree, keeps this shortest path and comprise, delete the outside other guide of this shortest path.

The shortest path that Fig. 8 is between two Chinese entities that from complete syntax tree shown in Figure 7, extract comprises tree.

Step S3: comprise the nearest path verb of the extraction distance second Chinese entity in the tree at shortest path.

In the implementation process; Can comprise the label of searching father node in the tree at shortest path is the leafy node of " VP "; When the leafy node that finds was merely one, then this leafy node was the distance second Chinese entity nearest path verb, when the leafy node that finds when being a plurality of; Distance between more a plurality of leafy nodes and the second Chinese entity place node, will and the second Chinese entity place node between the minimum leafy node of distance confirm as the path verb.

The invention also discloses a kind of process flow diagram that extracts the path verb, please participate in Fig. 3.Comprise:

Step S31: comprise definite residing node of the second Chinese entity in the tree at shortest path.

Step S32:, in the upper strata node of second Chinese entity node of living in, search the node of label for " VP " since second Chinese entity node of living in.

Step S33: search the child node of this label for the node of " VP ".

Step S34: when label be the label of child node of node of " VP " when " VV " or " VE ", obtain this label for the vocabulary at the child node place of the node of " VV " or " VE " as the path verb.

Comprising tree with shortest path shown in Figure 8 is example; At first search the residing node of the second Chinese entity " airport "; Search the node of label the layer node above that from this node afterwards for " VP "; This label comprises two child nodes " VV " and " PP " for the node of " VP ", and the vocabulary " delay " at child node place of node that obtains label afterwards and be " VV " is as the path verb.

Adopt method shown in Figure 3 to extract in the process of path verb, need not travel through shortest path and comprise the whole nodes in the tree, can reduce operand, thereby reduce system power dissipation.

Step S4: the semantic information of obtaining two Chinese entities and path verb respectively.

In the present invention, the semantic information of two Chinese entities and path verb can perhaps be obtained " knowledge net " from " synonym speech woods "." synonym speech woods " is a Chinese classified dictionary, and wherein each bar word all uses a coding to represent its semantic classes, is a kind of of Chinese lexical semantic resource." knowledge net ", the semantic description of each vocabulary is formed by a plurality of justice are former, and justice is former to be the base unit of describing notion, is a kind of of Chinese lexical semantic resource.Combine Fig. 4 and Fig. 5 to describe below respectively.

Referring to Fig. 4, Fig. 4 is a process flow diagram that obtains semantic information disclosed by the invention.Comprise:

Step S401: utilize the word and the former mapping table of basic meaning that prestore, it is former to search the basic meaning corresponding with two Chinese entities and path verb respectively.

Set up < word, first basic meaning is a former>mapping table for all words in " knowledge net " in advance, when a word has a plurality of basic meanings former, in this mapping table, only comprise this word and the former mapping relations of its first basic meaning.In search procedure, two Chinese entities and path verb respectively as key word, are searched or screened in this mapping table, former to obtain the basic meaning corresponding with it.

Step S402:, that the basic meaning that finds is former in semantic information when finding the basic meaning corresponding when former with two Chinese entities and path verb.

When two Chinese entities and path verb were everyday words, it was former in the mapping table that prestores, directly to find out the basic meaning corresponding with it.Afterwards, the basic meaning of Chinese entity that finds and path verb is former in its semantic information.

When Chinese entity was the specific term that is of little use, possibly can't directly find the basic meaning corresponding with it in the mapping table that prestores former.At this moment, can carry out word segmentation processing, this not found the former Chinese entity of basic meaning be split as a plurality of new Chinese fructifications this Chinese entity.

Step S403: when in mapping table, not finding the basic meaning corresponding when former, this is not found the former Chinese entity of basic meaning carry out word segmentation processing, obtain a plurality of new Chinese fructifications with Chinese entity.

Step S404: utilize mapping table, search with a plurality of new Chinese fructifications in the corresponding basic meaning of Chinese fructification after the position occurring and leaning on most former.

Step S405: this does not find the semantic information of the former Chinese entity of basic meaning the former conduct of the basic meaning of the Chinese fructification that will find.

For example: in phrase " big peace Forest Park, the Taibei ", have two Chinese entities; Promptly the first Chinese entity " Taibei " and the second Chinese entity " are pacified the Forest Park greatly "; Wherein the second Chinese entity is the specific term that is of little use; Can't in semantic dictionary, find corresponding clauses and subclauses, it is former therefore in the mapping table that prestores, also not comprise its basic meaning.At this moment, the second Chinese entity " is pacified the Forest Park greatly " carry out word segmentation processing, obtain three new Chinese fructifications " big peace ", " forest " and " park ".Afterwards, utilize mapping table search with three new Chinese fructifications in the corresponding basic meaning of Chinese fructification " park " after the position occurring and leaning on most former.When find the basic meaning corresponding with Chinese fructification " park " former after, with it as not finding the semantic information that the second former Chinese entity of basic meaning " is pacified the Forest Park greatly ".

Step S406:, this semantic information assignment that does not find the former Chinese entity of basic meaning is " NULL " when not finding the basic meaning corresponding when former with the Chinese fructification after the position is leaned on most.

Under extreme case; If to the Chinese fructification after the position occurring in a plurality of new Chinese fructifications and leaning on most; Still it is former in the mapping table that prestores, not find the basic meaning corresponding with it, then this semantic information assignment that does not find the former Chinese entity of basic meaning is " NULL ".

In flow process shown in Figure 4; Utilization based on " knowledge net " make up comprise word and its basic meaning former between the mapping table of mapping relations; The basic meaning that obtains Chinese entity and path verb is former, and with the former semantic information as Chinese entity and path verb of basic meaning that obtains.

Referring to Fig. 5, Fig. 5 is another process flow diagram that obtains semantic information disclosed by the invention.Comprise:

Step S411: the word that utilization prestores and the mapping table of semantic coding, search respectively and two Chinese entities and the corresponding semantic coding of path verb.

Set up < word, a semantic coding>mapping table for all words in " synonym speech woods " in advance, when a word is polysemant, when having a plurality of semantic coding, in this mapping table, only comprises the mapping relations of this word and its first semantic coding.In search procedure, two Chinese entities and path verb respectively as key word, are searched or screened in this mapping table, to obtain the semantic coding corresponding with it.

Step S412: when finding the semantic coding corresponding with two Chinese entities and path verb, from the character of the preset figure place of its high-order intercepting, the character of the preset figure place that intercepting is gone out is as semantic information in semantic coding.

When two Chinese entities and path verb are everyday words, can in the mapping table that prestores, directly find out the semantic coding corresponding with it.After finding the semantic coding corresponding with two Chinese entities and path verb; Need carry out intercept operation to the semantic coding that obtains; Concrete: from the character of the preset figure place of high-order intercepting of the semantic coding that finds, the character of the preset figure place that intercepting is gone out as with two Chinese entities and the corresponding semantic information of path verb.Should preset figure place can be 1,2,4,5 or 7.Preset figure place among the present invention is preferably 5, promptly after finding the semantic coding corresponding with two Chinese entities and path verb, this semantic coding of intercepting preceding 5 as semantic information.

In " synonym speech woods ", the figure place of the semantic coding that each word is corresponding is 8, and the semantic coding of each word to be divided into be 5 ranks; Be respectively " big type ", " middle type ", " group ", " clump ", " atom clump ", wherein, " big type " represented to L with a capitalization English letter A; " middle type " is to add a small letter English alphabet at the back at capitalization to represent, " group " is to add two decimal integers to represent in the back of small letter English alphabet, and " clump " is to add a capitalization English letter to represent in the back of two decimal integers; " atom clump " is to add two decimal integers to represent in the back of capitalization English letter; Last be marked with "=", " # " and " ", wherein "=" representative " equating " or " synonym ", " # " representative " does not wait " or " similar "; Belong to correlation word; " " representative " self-isolation " and " independence ", it had not both had synonym in dictionary, do not have related term yet.

Too careful in order to prevent semantic information; Lack generalization ability; Descend thereby cause extracting performance, need block extracting semantic coding, the principle of blocking is to carry out according to 5 ranks of " synonym speech woods " coding; Its preceding 1, preceding 2, preceding 4 of " synonym speech woods " the semantic coding C interceptings that promptly will obtain respectively, preceding 5 or preceding 7, new coding is called semantic coding C+.For example, the complete semantic coding of entity " park " in " synonym speech woods " is Bn20A01, preceding 1, preceding 2, preceding 4 of interceptings, preceding 5 or preceding 7, and the semantic coding that obtains is respectively B, Bn, Bn20, Bn20A and Bn20A01.When blocking length when being 5 (promptly the semantic coding in " park " is Bn20A), it is best to extract performance, this be because this moment semantic coding separating capacity and generalization ability best.For example, " China " and " U.S. " semantically are identical this aspect of country, and their codings in " synonym speech woods " are respectively " Di02A03=" and " Di02A23# ".Though these two codings are different, if preceding 5 codings of intercepting then all are " Di02A ".

When Chinese entity is the specific term that is of little use, possibly can't directly in the mapping table that prestores, find the semantic coding corresponding with it.At this moment, can carry out word segmentation processing, this Chinese entity that does not find semantic coding is split as a plurality of new Chinese fructifications this Chinese entity.

Step S413: when in mapping table, not finding the semantic coding corresponding, this Chinese entity that does not find semantic coding is carried out word segmentation processing, obtain a plurality of new Chinese fructifications with Chinese entity.

Step S414: utilize mapping table, search with a plurality of new Chinese fructifications in the corresponding semantic coding of Chinese fructification after the position occurring and leaning on most.

Step S415: in the semantic coding of the Chinese fructification that finds, from the character of the preset figure place of its high-order intercepting, the character of the preset figure place that intercepting is gone out does not find the semantic information of the Chinese entity of semantic coding as this.

Be that example describes still with phrase " big peace Forest Park, the Taibei ".In phrase " big peace Forest Park, the Taibei ", there are two Chinese entities; Promptly the first Chinese entity " Taibei " and the second Chinese entity " are pacified the Forest Park greatly "; Wherein the second Chinese entity is the specific term that is of little use; Can't in semantic dictionary, find corresponding clauses and subclauses, it is former therefore in the mapping table that prestores, also not comprise its basic meaning.At this moment, the second Chinese entity " is pacified the Forest Park greatly " carry out word segmentation processing, obtain three new Chinese fructifications " big peace ", " forest " and " park ".Afterwards, utilize mapping table search with three new Chinese fructifications in the corresponding semantic coding of Chinese fructification " park " after the position occurring and leaning on most.After finding the semantic coding corresponding with Chinese fructification " park ", from the character of the preset figure place of its high-order intercepting, the semantic information that the character of the preset figure place that intercepting is gone out " is pacified the Forest Park greatly " as the second Chinese entity that does not find semantic coding.

Step S416: when not finding the semantic coding corresponding, be " NULL " with the semantic information assignment of the Chinese entity that does not find semantic coding with the Chinese fructification after this position is leaned on most.

Under extreme case; If to the Chinese fructification after the position occurring in a plurality of new Chinese fructifications and leaning on most; Still in the mapping table that prestores, do not find the semantic coding corresponding, then this semantic information assignment that does not find the Chinese entity of semantic coding is " NULL " with it.

In flow process shown in Figure 5; The mapping table that comprises mapping relations between word and its semantic coding that utilization makes up based on " synonym speech woods "; Obtain the semantic coding of Chinese entity and path verb; Begin the character of the preset figure place of intercepting afterwards from the high position of semantic coding, with the character of the preset figure place that is truncated to semantic information as Chinese entity and path verb.

Step S5: three semantic informations adding shortest paths that will obtain according to preset rules comprise under the root node of tree, and the shortest path after confirming to expand comprises tree and is the relational tree of nature statement.

The semantic information of two Chinese entities that obtain and path verb is added shortest path comprise under the root node of tree, can adopt multiple mode.For example: according to preset order will with the first Chinese entity, the second Chinese entity and the corresponding semantic information of path verb; Adding shortest path comprises under the root node of tree; Such as; Semantic information that will be corresponding with the first Chinese entity, the second Chinese entity and path verb comprises under the root node of tree according to order adding shortest path from left to right, and semantic information that perhaps will be corresponding with the first Chinese entity, the second Chinese entity and path verb adds shortest path according to dextrosinistral order and comprises under the root node of tree.The invention discloses a kind of method of utilizing semantic information expansion shortest path to comprise tree, see also Fig. 6.

Fig. 6 comprises the process flow diagram of tree for utilizing semantic information expansion shortest path among the present invention.Comprise:

Step S51: comprise at shortest path and add three sign nodes under the root node of tree.

It is the semantic information of the first Chinese entity, the semantic information of the second Chinese entity and the semantic information of path verb that three sign nodes are respectively applied for the vocabulary that identifies its child node place.

Step S52: the child node place that three semantic informations is write its corresponding sign node respectively.

In the enforcement; The semantic information of two Chinese entities and path verb can be that to come from the basic meaning of " knowledge net " former; Also can be the semantic coding that comes from " synonym speech woods ",, but concern extraction process in order further to simplify though system can be the former semantic coding of basic meaning through this semantic information of form identification of semantic information; Among the present invention, preferably the type of semantic information is distinguished through different sign nodes.

Concrete: when the semantic information of two Chinese entities and path verb is to come from the basic meaning of " knowledge net " when former; Comprise at shortest path and add the first sign node, the second sign node and the 3rd sign node under the root node of tree; Wherein, The vocabulary that the first sign node is used to identify its child node place is the semantic information of the first Chinese entity; The vocabulary that the second sign node is used to identify its child node place is the semantic information of the second Chinese entity, and the vocabulary that the 3rd sign node is used to identify its child node place is the semantic information of path verb; When the semantic information of two Chinese entities and path verb is when coming from the semantic coding of " synonym speech woods "; Comprise at shortest path and add the 4th sign node, the 5th sign node and the 6th sign node under the root node of tree; Wherein, The vocabulary that the 4th sign node is used to identify its child node place is the semantic information of the first Chinese entity; The vocabulary that the 5th sign node is used to identify its child node place is the semantic information of the second Chinese entity, and the vocabulary that the 6th sign node is used to identify its child node place is the semantic information of path verb.

Be example still below, the process of utilizing semantic information expansion shortest path to comprise tree is described with natural statement " reporter is trapped in the airport ".

In natural statement " reporter is trapped in the airport ", the first Chinese entity is " reporter ", and the second Chinese entity is " airport ", and the path verb is " delay ", and the shortest path of this nature statement comprise the tree as shown in Figure 8.

When semantic information adopts the basic meaning that comes from " knowledge net " former; With " reporter ", " airport " and " delay " respectively as key word or search condition; Search in the mapping table of mapping relations between former comprising word and its basic meaning, obtain the basic meaning former " people " of the first Chinese entity " reporter ", the basic meaning former " facility " of the second Chinese entity " airport " and the basic meaning former " suffering " of path verb " delay ".Afterwards; Comprise at shortest path and add three sign node SHN1, SHN2 and SHNV under the root node IP of tree; Wherein, Sign node SHN1 is used to identify its child node place the semantic information that is the first Chinese entity, and sign node SHN2 is used to identify its child node place the semantic information that is the second Chinese entity, and sign node SHNV is used to identify its child node place the semantic information that is the path verb.At last; The basic meaning of the first Chinese entity " reporter " former " people " is write the child node place of sign node SHN1; The basic meaning of the second Chinese entity " airport " former " facility " is write the child node place of sign node SHN2; The basic meaning of path verb " delay " former " suffering " is write the child node place that identifies node SHNV, obtain relational tree as shown in Figure 9.

When the semantic information employing comes from the semantic coding of " synonym speech woods "; With " reporter ", " airport " and " delay " respectively as key word or search condition; In the mapping table that comprises mapping relations between word and its semantic coding, search the semantic coding of the first Chinese entity " reporter ", the second Chinese entity " airport " and path verb " delay "; Preceding 5 of three semantic codings that intercepting finds, the semantic information of confirming the first Chinese entity " reporter " is " Hj02A " for the semantic information of " Ae16C ", the second Chinese entity " airport " for the semantic information of " Cb08B ", path verb " delays ".Afterwards; Comprise at shortest path and add three sign node SC1, SC2 and SCV under the root node IP of tree; Wherein, Sign node SC1 is used to identify its child node place the semantic information that is the first Chinese entity, and sign node SC2 is used to identify its child node place the semantic information that is the second Chinese entity, and sign node SCV is used to identify its child node place the semantic information that is the path verb.At last; The semantic information " Ae16C " of the first Chinese entity " reporter " is write the child node place of sign node SC1; The semantic information " Cb08B " of the second Chinese entity " airport " is write the child node place of sign node SC2; The semantic information " Hj02A " of path verb " delay " is write the child node place that identifies node SCV, obtain relational tree shown in figure 10.

Step S6: utilize the disaggregated model that prestores that relational tree is concerned classification.

The present invention selects the svm classifier device of supporting convolution to set kernel function for use, and the svm classifier device is a binary classification device in essence, but in semantic relation extracts, except relationship detection, concerns big class and concern that the identification of group all is to belong to the multivariate classification problem.Therefore, adopt the method for one-to-many to convert the binary classification device to the multivariate classification device among the present invention, that is:, construct k sorter for k classification, wherein each sorter i (1,2 ..., k) all classification i and other class discrimination are come.Through k sorter of training test set is predicted, and selected wherein most probable predicting the outcome (concerning the svm classifier device, the interval maximum) as final classification.

In the above-mentioned disclosed a kind of Chinese inter-entity semantic relation abstracting method of the present invention, the step of extracting disaggregated model is set further.The process of extracting disaggregated model is specially:

Prepare the training instance { x of some _i, y _i(i=1 ... N) form training corpus, wherein x _iBe the relational tree of this training instance, y _iBe the label of this training instance, this label is 1 or-1, and wherein there is relation in 1 expression, and there is not relation in-1 expression.Confirm the description of the process of training instance relational tree referring to step S1 to S5.

Use SVM from the training instance, to learn.SVM is a kind of new general learning method that on the statistical theory basis, grows up, and it is the approximate realization of structural risk minimization principle, because it is the boundary that minimizes empiric risk and VC dimension simultaneously.The process of using SVM to learn is exactly in higher dimensional space, to seek a lineoid, makes it farthest cut apart the data set of two classifications.This problem can be exchanged into a quadratic programming optimization problem, promptly asks the maximum solution α of following formula:

W (α) = Σ_{i = 1}^{n} α_{i} - \frac{1}{2} Σ_{i = 1}^{n} Σ_{j = 1}^{n} α_{i} α_{j} y_{i} y_{j} K (x_{i}, x_{j})

Satisfy following condition simultaneously:

C &GreaterEqual; α_{i} &GreaterEqual; 0, Σ_{i = 1}^{n} α_{i} y_{i} = 0

Wherein, { x _i, y _iBe that relationship example (promptly training the relational tree of instance) and category label in the training corpus is right, α _iBe the weight of each training instance, C is a slack variable.Generally, the weight of most of instances is 0, and weight is not that 0 training instance is called support vector.

In learning process, calculate the similarity of two relational trees, promptly calculate kernel function K (x _i, x _j).

The present invention adopts convolution tree kernel function to calculate two similarities between the relational tree.So-called convolution tree kernel function (Convolution Tree Kernel CTK), promptly weighs two structural similarity between the tree through the number of the identical subtree between the computation tree, and its computing formula is:

K_{CTK} (T_{1}, T_{2}) = \underset{n_{1} &Element; N_{1}, n_{2} &Element; N_{2}}{Σ} Δ (n_{1}, n_{2})

Wherein, N ₁And N ₂Be respectively two relational tree T ₁And T ₂Node set, Δ (n ₁, n ₂) be used for calculating with n ₁And n ₂Be the similarity between the two stalks tree of root node, it can draw through the method for following recurrence:

1) if n ₁And n ₂Production (employing context-free grammar) difference, Δ (n then ₁, n ₂)=0; Otherwise change 2;

2) if n ₁And n ₂Be part of speech (POS) mark, then Δ (n ₁, n ₂)=λ; Otherwise change 3); 3) recursive calculation

Δ (n_{1}, n_{2}) = λ Π_{k = 1}^{# Ch (n_{1})} (1 + Δ (Ch (n_{1}, k), Ch (n_{2}, k))

Wherein #ch (n) is the child node number of node n, and (n k) is k child node of node n to #ch, and λ (0 < λ < 1) then is a decay factor, is used for preventing that the similarity of subtree is overly dependent upon the size of subtree.

Promptly from the training instance, the be supported process of vector and weights thereof of the process of obtaining disaggregated model, disaggregated model can be expressed as { x _i, y _i, α _i, i=1..S, wherein S is the number of the support vector that obtains of study, α _iWeights for this support vector.

In the Chinese inter-entity semantic relation abstracting method disclosed by the invention; Incorporated the semantic information of vocabulary; According to two Chinese entities and path verb between the two; From " synonym speech woods " or " knowledge net ", extract corresponding semantic information respectively and join shortest path and comprise in the tree, finally obtain the relational tree of relationship between expression instance, utilize machine learning method to extract the semantic relation between the Chinese entity afterwards again based on the tree kernel function.Because the relational tree among the present invention has comprised abundant structures information and lexical semantic information, its versatility is better, therefore compares with the structured message that only comprises syntax tree, and precision and recall rate that relation extracts all are improved, and overall performance is better.Simultaneously; Because the semantic information (former like semantic coding or basic meaning) of vocabulary is to carry out extensive to vocabulary to a certain extent; Therefore the relational tree that obtains can not exist but the identical relationship example of semantic information in the recognition training language material; Just reduced the quantity of the corpus that needs mark, alleviated and concern the degree of dependence of abstracting method extensive corpus based on machine learning.At last; Other kernel function method with adopting the lexical semantic similarity is compared; The root node that the present invention only need join the semantic information of Chinese entity and path verb syntax tree gets final product down; Need not calculate vocabulary semantic similarity between any two, thereby avoid the heavy shortcoming of calculated amount brought therefrom.

Each embodiment adopts the mode of going forward one by one to describe in this instructions, and what each embodiment stressed all is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.To the above-mentioned explanation of the disclosed embodiments, make this area professional and technical personnel can realize or use the present invention.Multiple modification to these embodiment will be conspicuous concerning those skilled in the art, and defined General Principle can realize under the situation that does not break away from the spirit or scope of the present invention in other embodiments among this paper.Therefore, the present invention will can not be restricted to these embodiment shown in this paper, but will meet and principle disclosed herein and features of novelty the wideest corresponding to scope.

Claims

1. a Chinese inter-entity semantic relation abstracting method is characterized in that, is used for extracting at the natural statement that comprises two Chinese entities the semantic relation of inter-entity, and said method comprises:

2. method according to claim 1 is characterized in that, comprises at said shortest path and extracts the nearest path verb of the distance second Chinese entity in the tree, specifically comprises:

Search the child node of said label for the node of " VP ";

3. method according to claim 1 and 2 is characterized in that, said semantic information of searching said two Chinese entities and path verb respectively specifically comprises:

4. method according to claim 3 is characterized in that, when in said mapping table, not finding the basic meaning corresponding with Chinese entity when former, said method also comprises:

5. method according to claim 4 is characterized in that, when not finding the basic meaning corresponding with the Chinese fructification after said position is leaned on most when former, the said semantic information assignment that does not find the former Chinese entity of basic meaning is " NULL ".

6. method according to claim 1 and 2 is characterized in that, said semantic information of searching said two Chinese entities and path verb respectively specifically comprises:

7. method according to claim 6 is characterized in that, when in said mapping table, not finding the semantic coding corresponding with Chinese entity, said method also comprises:

8. method according to claim 7 is characterized in that, when not finding the semantic coding corresponding with the Chinese fructification after said position is leaned on most, the said semantic information assignment that does not find the Chinese entity of semantic coding is " NULL ".

9. method according to claim 1 is characterized in that, said three semantic informations will obtaining according to preset rules add said shortest path and comprise under the root node of tree, specifically comprise: