CN101799802B

CN101799802B - Method and system for extracting entity relationship by using structural information

Info

Publication number: CN101799802B
Application number: CN200910000499.5A
Authority: CN
Inventors: 许洪志; 胡长建; 沈国阳
Original assignee: NEC China Co Ltd
Current assignee: NEC China Co Ltd
Priority date: 2009-02-05
Filing date: 2009-02-05
Publication date: 2014-04-23
Anticipated expiration: 2029-02-05
Also published as: CN101799802A

Abstract

The invention provides a method and a system for extracting an entity relationship by using structural information. The method comprises the following steps: acquiring a text which comprises a plurality of sentences with marked relationships; acquiring a group of dependency tree modes related to sentence structures in the text; extracting the characteristic of each sentence in the text by referring to the dependency tree modes, wherein the characteristic comprises the structural characteristic of the sentence; collecting the extracted characteristics to train a relationship marking model; and applying the relationship marking model to an unmarked sentence to extract a relationship example. Furthermore, the invention also discloses a process for automatically extracting the dependency tree modes. Compared with the prior art, the relationship extracting system and the relationship extracting method of the invention can realize high performance.

Description

Utilize structural information to carry out the method and system of entity relationship extraction

Technical field

Relate generally to natural language processing of the present invention, more specifically, relates to the method and system that utilizes structural information to carry out entity relationship extraction

Background technology

Along with the sustainable growth of amount of digital information and the continuous enhancing of availability thereof, user is more and more higher for the intelligentized requirement of information analysis, and traditional information retrieval technique has become and is difficult to meet these demands.User wishes that computer system can play the part of more importantly role understanding expressly aspect text.For example, user needs automatically to extract the system of the relation between the entity in text.

Relation is extracted (Relation Extraction, RE) can be used to a lot of fields.For example, by detecting open field text and therefrom extracting cause-effect relationship, can contribute to the exploitation of question and answer (Q-A) system.For another example, the relation of gene and disease of can finding from Biomedical literature is for disease risks mark, diagnosis and prognosis, or can from on-line communities website extract social networks and accordingly day rear line better information recommendation is provided.

Performance based on being related to the application of knowledge greatly depends on for the selected algorithm of relation extraction or the quality of method.Final user can greatly benefit from high-quality relationship example.Therefore,, in order to realize high performance application, the accuracy that how raising relation is extracted becomes a common problem.

Meanwhile, only by for example, to text (sentence) application syntactic analysis, the problem of cannot solution relation extracting, finds some semantic information because the solution of this problem also depends on.But the performance of semantic analysis of the prior art is good not, therefore, how maximally utilises and exist the semantic technology of defect to be also one to have challenging problem.

In prior art, develop a lot of methods and extracted problem for solution relation.But these existing methods performance is in actual applications also unsatisfactory.Basic scheme is for example, from the training collected works study plane Text Mode (regular expression) through mark, and carrys out extraction relation by the pattern extracting.Regular expression can obtain from having marked the sentence learning of Relation Parameters.For example, the article " Extracting Relations fromLarge Plain-Text Collections " of delivering at EugeneAgichtein and Luis Gravano.Snowball (is shown in Proc.of the 5 ^thaCM conference on Digitallibraries, 2000) in, propose a kind of for extracting " mechanism-place " right algorithm.This algorithm carrys out generate pattern by concluding the context of Relation Parameters.Then, the candidate pattern extracting is by automatic Evaluation, and only has those patterns with high confidence level to be retained, for finding new relationship example.New-found relationship example will be used to extract more candidate pattern.By iteration, this algorithm can obtain a large amount of relationship example with reasonable accuracy.The content of this article is incorporated into this by reference on the whole for all objects.

Because relation extraction can be regarded as an order mark problem, therefore existing order mask method (for example hidden Markov model (HMM), maximum entropy (ME) and conditional random fields (CRF)) can be used to solution relation and extract problem.Current widely used feature comprises the part of speech (part-of-speech of upper and lower cliction, upper and lower cliction, POS) label, judge a pair of entity (being labeled as accordingly a pair of role in dependency tree, also referred to as parameter (argument)) the whether window feature in the same window, feature of extracting from dependency tree (dependency tree) or syntax analytic tree etc.For example, the article that K.Nanda delivers " Combining lexical, syntacticand semantic features with maximum entropy models for extracting relations " (is shown in Proc.of the 42 ^ndanniversary Meeting of the Association forComputational Linguistics (ACL ' 04), 2004) in, just used the feature extracting from dependency tree or syntax analytic tree.Adopted feature is included on analytic tree or dependency tree path, father node, upper and lower cliction and the POS label thereof etc. of the first and second parameters on dependency tree from the first parameter of relation to the second parameter.The method is used the maximum entropy (ME) based on training collected works to carry out training pattern, and uses a model to extract new relationship example.The content of this article is incorporated into this by reference on the whole for all objects.

In addition, relation is extracted and can also be regarded as a classification problem, is therefore another kind ofly related to that extractive technique can be based on kernel method (Kernel Method).Kernel method is a kind of non-parametric density estimation technique, the kernel function between its computational data example, and its Kernel Function can be understood to a kind of similarity measurement.Relevant kernel function can define for language string (the word bag of sentence) or dependency tree (structural information of sentence).Use the core in this support vector machine (SVM), can detect and extract relationship example.The article that Aron Culotta and Jeffrey Sorensen deliver " DependencyTree Kernels for Relation Extraction " (is shown in Proc.of the 42 ^ndanniversaryMeeting of the Association for Computational Linguistics (ACL ' 04), 2004) in, the kernel function for dependency tree has been proposed.Corresponding feature comprises the POS label of tree node, interdependent type, entity type (for example " people " or " mechanism ") and role (for example " parameter A RG-A " and " parameter A RG-B ").First this function checks that whether the root of two dependency trees is identical.If two root differences, the similarity score of two dependency trees should be 0.Otherwise function is by the similarity of calculating between child node.Finally, this kernel function is used in SVM, to train the sorter extracting for relation.The content of this article is incorporated into this by reference on the whole for all objects.

Yet, above-mentioned existing method has all been ignored semantic information during relation is extracted, and only pay close attention to the sentence structure information of shallow-layer, the POS label of the word that for example current word depends on or on dependency tree from the path of Relation Parameters " ARG-A " to " ARG-B ".But these " father nodes " or " path " information cannot comprise enough useful semantic information and carry out differentiation relation, so the existing poor-performing that is related to extraction scheme.

In fact, relation can utilize a certain minor structure on dependency tree with complete semantic meaning to determine.This means by limb on the dependency tree of inspection sentence and be just enough to detect relation.But, in current existing prior art, not yet propose effective method and can be used to find these crucial minor structures.

Summary of the invention

In view of the above problems, the present invention is devoted to provide a kind of more accurate and efficient entity relationship extracting method and system.Particularly, first technology of the present invention extracts from the dependency tree that comprises actual relationship example the crucial minor structure that is referred to as " dependency tree pattern ".Then, the dependency tree pattern extracting can be used to the degree of accuracy that raising relation is extracted.

According to first aspect present invention, a kind of method of extracting for relation is provided, comprising: obtain collected works, described collected works comprise a plurality of sentences that marked relation; Obtain the one group dependency tree pattern relevant to sentence structure in described collected works; With reference to described dependency tree pattern, extract the feature of each sentence in described collected works, the architectural feature that described feature comprises this sentence; The described feature that collection extracts is trained relationship marking model; And described relationship marking model is applied to without the sentence marking to extract relationship example.

According to second aspect present invention, a kind of system of extracting for relation is provided, comprising: collected works acquisition device, for obtaining collected works, described collected works comprise a plurality of sentences that marked relation; Dependency tree pattern acquisition device, for obtaining the one group dependency tree pattern relevant to the sentence structure of described collected works; Feature deriving means, for extracting the feature of each sentence of described collected works, the architectural feature that described feature comprises this sentence with reference to the described dependency tree pattern extracting; Relationship marking model trainer, trains relationship marking model for collecting the described feature that described feature deriving means extracts; And model application apparatus, for described relationship marking model being applied to without the sentence marking to extract relationship example.

As can be seen here, system of the present invention can be divided into two stages: model training stage and model application stage.

During the model training stage, can obtain highly accurate relationship marking model by following operation:

1. first,, for model training, need to provide a collected works C who has marked relation _r; Meanwhile, also need to be from these collected works C _rthe one group of dependency tree pattern that is prepared in advance, is denoted as TPs.

2. then, can utilize ready dependency tree pattern to extract and collected works C _rin the corresponding required feature of each sentence, comprise architectural feature and traditional characteristic (for example contextual feature).

3. the feature extracting is collected subsequently, and is used to train relationship marking model.About the training method of relationship marking model, can use traditional machine learning techniques.

4. the relationship marking model generating is stored, for future use.

In the model application stage, system of the present invention can be extracted relationship example effectively by following operation:

5. user inputs the text without mark of wishing to extract relationship example, and text be take sentence as unit.

6. associated dependency tree resolved to obtain in the sentence of pair input.

7. in the set of ready dependency tree pattern of model training stage, now can be used to the corresponding feature of sentence of extraction and this input.

8. the sentence mark Relation Parameters of the feature extracting according to these to input.

9. last, the relationship marking model having generated is applied to the sentence that has marked Relation Parameters, to extract relationship example.

The set of above-mentioned dependency tree pattern can be pre-created by user, also can be from collected works C _rin automatically extract.From collected works C _rin automatically extract in the situation of dependency tree pattern, the following dependency tree pattern extracting method the present invention proposes:

1. each sentence having marked in collected works is resolved to corresponding dependency tree.Dependency tree can be created automatically by system, and ideally, dependency tree also can be by user's manual creation.

2. all dependency trees are clustered into different groups, so that the dependency tree in same group structurally has high similarity.For example, in an embodiment of the present invention, can embed the similarity function that subtree pattern (LEST) defines dependency tree based on minimum.For two dependency tree t ₁and t ₂, its similarity function Sim (t ₁, t ₂) there are two values, as dependency tree t ₁and t ₂sim (t while thering is identical LEST ₁, t ₂)=1, and as dependency tree t ₁and t ₂while thering is different LEST, Sim (t ₁, t ₂)=0.

3. the mining algorithm that uses subtree Also extracts one or more closed dependency tree patterns.For example, can extract by following iterative manner:

3.1 use the LEST of each group as the initial sets S of spermotype _p;

3.2 to S _pin spermotype add an additional nodes to generate the set of new candidate seed pattern;

3.3 supports that check each candidate seed pattern are to delete useless candidate seed pattern, and the deletion principle here for example can be defined as follows:

If the support of all candidate seed patterns that 3.3.1 a spermotype produces is all less than

The support of this spermotype, exports this spermotype as closed dependency tree pattern, and

And each candidate seed pattern producing for this spermotype:

If the support of this candidate seed pattern is less than an assign thresholds, delete this candidate seed pattern,

If the support of this candidate seed pattern is more than or equal to described assign thresholds, retain this candidate seed pattern; Otherwise

If 3.3.2 the support of a candidate seed pattern equals to generate the support of the spermotype of this candidate seed pattern, retain this candidate seed pattern, and for each other candidate seed pattern of this spermotype:

If the support of this candidate seed pattern is less than described assign thresholds, delete this candidate seed pattern,

If the support of this candidate seed pattern is more than or equal to described assign thresholds, retain this candidate seed pattern;

3.4 use the candidate seed pattern retaining as new spermotype S _prepeat above-mentioned steps 3.2 and 3.3, until the set of spermotype is sky.

Utilize system and method for the present invention can excavate useful sentence structure information and use it for relation extraction.And, compared with the conventional method, be of the present inventionly related to that extraction system and method can realize better performance.

Particularly, sentence structure information is the fine indication to actual relationship.In some sentence, the word of indexical relation is usually located on the fixed position on relevant dependency tree.That is, in one group of dependency tree, conventionally comprise some potential public subtree pattern.These subtree patterns can be indicated actual entity relationship well.

In addition, the sentence structure information extracting is also very useful for filter out spurious relationship example.Utilize the grammatical relation between the word in sentence can extract structural information (that is, dependency tree pattern).These dependency tree patterns can easily separate the structural area of the structure of correct relation and false relation.For example, in sentence " Tom, the brother of Kate, works in Microsoft now. ", may form the false < of relation people-tissue, Kate, Microsoft>.Utilize classic method (for example regular expression), this falseness relation probably also can be identified.Yet, utilize system of the present invention, so false relation can be filtered out effectively, because " the brother of Kate " can be resolved the subtree into node " Tom ".From structure angle, between " Kate " and " Microsoft ", be difficult to generate relationship example.On the other hand, if traditional regular expression is not carefully built, may omit such as < people-tissue Tom, the correct relation of Microsoft> and so on.But, utilizing system of the present invention, such correct relation can detect at an easy rate according to the dependency tree pattern extracting.

Also have, the present invention has adopted a kind of more effective method, is used for integrated sentence structure feature and traditional characteristic.Because sentence structure may be very complicated and during resolving sentence, some mistake may occur, so some dependency tree pattern may comprise noise.Therefore the dependency tree model, extracting can not be used for extraction relation directly, independently.Method proposed by the invention has been set up some binary features, is used for reflecting whether the dependency tree of a certain sentence meets a certain dependency tree pattern.By the machine learning algorithm (such as CRF, SVM etc.) of application based on feature, these features and other traditional characteristics can together be used to train relationship marking model.

Accompanying drawing explanation

By reference to the accompanying drawings, from the detailed description to the embodiment of the present invention below, will understand better the present invention, in accompanying drawing similarly with reference to the similar part of mark indication, wherein:

Figure 1A and Figure 1B are for assisting to describe the schematic diagram of key concept used in the present invention;

Fig. 2 is the block diagram that the inner structure that is related to according to an embodiment of the invention extraction system 200 is shown;

Fig. 3 is the process flow diagram that the operation example of system shown in Figure 2 200 is shown;

Fig. 4 is the block diagram that the concrete structure of the feature deriving means that system 200 comprises is shown;

Fig. 5 is the schematic diagram that the example of the dependency tree relevant to sentence obtaining by parsing is shown;

Fig. 6 carries out the schematic diagram of the process of feature extraction for description references dependency tree pattern;

Fig. 7 is the block diagram that the concrete structure of the dependency tree pattern extraction apparatus that system 200 comprises is shown;

Fig. 8 is the block diagram of concrete structure that an example of the dependency tree pattern extraction unit that Fig. 7 comprises is shown;

Fig. 9 illustrates the process flow diagram of dependency tree pattern leaching process according to an embodiment of the invention; And

Figure 10 is for the schematic diagram of an example of the candidate seed pattern trimming operation that dependency tree pattern leaching process carries out is described.

Embodiment

In order to describe better the leaching process of dependency tree pattern proposed by the invention, first in describing, the basic concepts of using is briefly described below.

Relation is extracted: it is a kind of for finding the technology of two relations between entity that relation is extracted.For example, for english sentence " Tom works for Microsoft in Seattle. ", relation is extracted can detect following two kinds of relations: (1) is related to 1:< people-tissue, Tom, Microsoft>; Or (2) are related to 2:< tissue-place, Microsoft, Seattle>.

Dependency tree: dependency tree is a kind of for presenting the method for expressing of the grammatical relation between sentence element.For example, the above-mentioned sentence " Tom works for Microsoft in Seattle. " of take is example, and the structure of its dependency tree can as shown in Figure 1A, wherein also mark the part of speech of sentence element (POS) and embedded relation.

Crossover node: two node n1 on a dependency tree t and the crossover node n of n2 are denoted as crs (n1, n2, t)=n, and it is defined as first common node between path n1 → root (t) and path n2 → root (t).For example, as shown in Figure 1B, on dependency tree T, the crossover node of node E and node P is node p, i.e. crs (E, P, T)=P, and the crossover node of node P and node A is Node B, i.e. crs (P, A, T)=B.

Dependency tree pattern: according to the present invention, dependency tree pattern is defined as the closed subtree on dependency tree, it retains all crossover nodes and suggests the relation between a pair of entity.

The support of dependency tree pattern (support): the support of dependency tree pattern p is denoted as supp (p), the total number of the dependency tree that it can be defined as comprising this dependency tree pattern p.If dependency tree t comprises dependency tree pattern p, can say that t meets p.

Frequent dependency tree pattern: if the support of a dependency tree pattern is greater than a predetermined threshold " min_supp ", can say that this dependency tree pattern is frequently, be referred to as frequent dependency tree pattern.

Maximum dependency tree pattern: if a dependency tree pattern p is frequently, and do not have other frequent modes p ', make p ' comprise p, claim that this dependency tree pattern p is maximum dependency tree pattern.

Closed dependency tree pattern: if a dependency tree pattern p is frequently, and do not exist other and p to there is the pattern p ' of identical support, make p ' comprise p, claim that this dependency tree pattern p is closed dependency tree pattern.

The minimum subtree pattern (LEST) that embeds: the LEST of dependency tree t is a kind of dependency tree pattern p of minimum dimension of relation of inclusion, and all crossover nodes in this pattern p all should be retained in LEST.For example, for the every couple of node n1 in pattern p and n2, all should meet crs (n1, n2, p)=crs (n1, n2, t).With reference to Figure 1B, for the dependency tree T in left side, suppose node " P " representative,, between node " P " and " A ", there is relation in node " A " representative tissue.Therefore, the LEST of t can be as shown in (1) in Figure 1B.Yet (2) in Figure 1B can not be as the LEST of t, because the crossover node between node " G " and " A " is " D " but not " B ".

Fig. 2 is the block diagram that the inner structure that is related to according to an embodiment of the invention extraction system 200 is shown.As shown in Figure 2, system 200 mainly comprises collected works acquisition device 201, dependency tree pattern acquisition device 202, feature deriving means 203, relationship marking model trainer 204 and model application apparatus 205.Alternatively, system 200 also comprises dependency tree pattern extraction apparatus 206, for automatically extracting required dependency tree pattern.As mentioned above, except automatic extraction dependency tree pattern, user also can be in advance prepares dependency tree pattern with manual mode, and by ready dependency tree pattern storage in dependency tree mode memory 208.Proposed by the invention will be described in more detail below for automatically extracting the method for dependency tree pattern.

As mentioned above, of the present inventionly be related to that extraction system 200 mainly comprises two stages, it is model training stage and model application stage, wherein the model training stage is mainly carried out by collected works acquisition device 201, dependency tree pattern acquisition device 202, feature deriving means 203 and relationship marking model trainer 204, and the model application stage is realized by model application apparatus 205.

The process flow diagram of Fig. 3 illustrates the operation example of system shown in Figure 2 200.This process starts from step 301, and wherein collected works acquisition device 201 obtains collected works from collected works storer 207, and this article is concentrated and for example included a plurality of sentences that marked relation.In step 303, dependency tree pattern acquisition device 202 obtains preprepared dependency tree pattern from dependency tree mode memory 208.Before step 303, can comprise optional step 302 (illustrating with dotted line frame), this step is for automatically extracting required dependency tree pattern from obtaining collected works.Concrete dependency tree pattern leaching process will be described hereinafter.Then, in step 304, feature deriving means 203 can extract with reference to the dependency tree pattern of having obtained the feature of each sentence in obtained collected works, and this feature can comprise architectural feature and the traditional characteristic of sentence.As example, architectural feature can be dependency tree feature, and traditional characteristic can be contextual feature.In step 305, the feature of each sentence being extracted by feature deriving means 203 is collected, and is provided to relationship marking model trainer 204.Relationship marking model trainer 204 can be trained relation mark model with the machine learning techniques of standard.The relation mark model generating can be stored in relationship marking model storer 209.Subsequently, when sentence when input having without mark, in step 306, model application apparatus 205 can obtain pre-stored relation mark model in relationship marking model storer 209, and applies it to without the sentence marking to extract required relationship example.Then, process 300 finishes.

In Fig. 2, also show in detail the inner structure of model application apparatus 205.Model application apparatus 205 for example can comprise sentence input block 2051, resolution unit 2052, dependency tree pattern acquiring unit 2053, feature extraction unit 2054, relationship marking unit 2055 and relationship example extraction unit 2056.About the detailed process of model application stage, describe to some extent hereinbefore.Specifically, first, user wishes to extract the sentence without mark of relationship example by 2051 inputs of sentence input block.Then, associated dependency tree resolved to obtain in the sentence of 2052 pairs of inputs of resolution unit.Dependency tree pattern acquiring unit 2053 can obtain the model training stage ready dependency tree pattern set and be provided to feature extraction unit 2054.Feature extraction unit 2054 can extract the corresponding feature of sentence with this input with reference to dependency tree pattern subsequently.The sentence mark Relation Parameters of the feature that relationship marking unit 2055 extracts according to feature extraction unit 2054 to input.Subsequently, be stored in relationship marking model in relationship marking model storer 209, that generated in the model training stage and be provided to relationship example extraction unit 2056.This relationship example extraction unit 2056 is applied to the relationship marking model of obtaining the sentence that has marked Relation Parameters, to extract relationship example.Because model application process is not the present invention's innovative point place, therefore it is not repeated.

To first describe according to characteristic extraction procedure of the present invention below.Fig. 4 is the block diagram that the concrete structure of the feature deriving means 203 that system shown in Figure 2 200 comprises is shown.

Mention above, the present invention is directed to each sentence except extracting traditional contextual feature, also need to extract the dependency tree feature relevant to dependency tree pattern.As shown in Figure 4, feature deriving means 203 mainly comprise contextual feature extraction unit 401 for extracting contextual feature, for extracting the characteristic storage unit 403 of dependency tree feature extraction unit 402 and the storage feature of dependency tree feature.Extracting method for contextual feature and dependency tree feature, hereinafter will illustrate in more detail.It should be explicitly made clear at this point, although hereinafter will provide the concrete extracting method of contextual feature and dependency tree feature, the present invention is not limited to described embodiment.Various other feature extracting methods well known by persons skilled in the art and that it is contemplated that according to description of the invention are all answered within the scope of the present invention involved.

As shown in Figure 4, in this example, contextual feature extraction unit 401 for example comprises part-of-speech tagging unit 4011 and contextual feature extraction apparatus 4012, and storer 4013 is mainly used in storing the intermediate result that part-of-speech tagging unit 4011 produces, passed through the sentence of part of speech (POS) mark.Contextual feature extraction apparatus 4012 can extract traditional contextual feature by the sentence of analyzing through POS mark.This part belongs to techniques well known, and therefore therefore not to repeat here.

Dependency tree feature extraction unit 402 can comprise resolution unit 4021, dependency tree feature extractor 4022 and for the storer 4023 of the result of storing and resolving unit 4021.First resolution unit 4021 resolves the sentence in obtained collected works, to generate relevant dependency tree.The relevant dependency tree generating is stored in storer 4023 subsequently.In the present invention, part of speech (POS) label of each node can also conclude and add all nodes on dependency tree to resolution unit 4021, except generating the relevant dependency tree of each sentence.Why use in the present invention the part of speech of word in sentence but not word itself is cannot therefrom find the commonality schemata between dependency tree because word itself is too special.Alternatively, user can also add to the node on dependency tree other attributes (for example interdependent type to its father node) and institute's role (for example " parameter A RG-1 ", " parameter A RG-2 " or " crucial son ") aspect hint relationship example.

For example, Fig. 5 illustrates the example of two dependency trees relevant to sentence that obtain by parsing.In this example, suppose that collected works comprise two sentences through marking: sentence (1) " Tom works forMicrosoft in Seattle. " and sentence (2) " Kate; once a leader of ACB; now worksin her sister ' s company BCA. ", wherein sentence (1) has the < of relation people-tissue, Tom, Microsoft>, sentence (2) has the < of relation people-tissue, Kate, BCA>.By resolving sentence (1) and (2), resolution unit 4021 can obtain the dependency tree relevant with (2) to sentence (1), as shown in Figure 5.In the example of Fig. 5, also to all nodes on dependency tree marked part of speech and hint relationship example aspect institute's role, wherein " per " and " aff " is respectively writing a Chinese character in simplified form of " people " and " tissue ".And in Fig. 5, by grey box, indicate two parameters of " people-tissue " relation.

Dependency tree through above-mentioned processing can be stored in storer 4023.Subsequently, the dependency tree pattern that dependency tree feature extractor 4022 can obtain with reference to dependency tree pattern acquisition device 202, extracts the dependency tree feature of this sentence according to the relevant dependency tree of each sentence.

Fig. 6 illustrates an example of carrying out the process of feature extraction according to the present invention with reference to dependency tree pattern.Suppose that dependency tree pattern that dependency tree feature extractor 4022 obtains is as shown in the example of Fig. 6 left side.Fig. 6 right side shows respectively the feature extraction result for above-mentioned sentence (1) and sentence (2), wherein feature is listed as corresponding to traditional contextual feature at 1-4, and the m shown in dotted line frame is listed as corresponding to dependency tree feature, wherein m represents the number of the dependency tree pattern obtained.Owing to only having provided a dependency tree pattern in the example of Fig. 6, so the row of the m in dotted line frame only show the first row of answering in contrast.

Feature in the given example of Fig. 6 is defined as foloows:

(1) the 1st row: the part of speech label of current word;

(2) the 2nd row: this word represents people? (be to be 1, no is 0);

(3) the 3rd row: this word represents tissue? (be to be 1, no is 0);

(4) the 4th row: have a people in the scope of 4 words in the front and back of current word? (be to be 1, no is 0);

(5) dotted line frame (dependency tree feature): this word can correspond to the node in dependency tree pattern? (be to be 1, no is 0).

Above-mentioned feature is as just example, and user can define different features according to the actual requirements.

Return to Fig. 2, after extracting feature, relationship marking model trainer 204 is collected the feature extracting, and the features training relationship marking model extracting with any machine learning techniques utilization.Here, we adopt CRF to come brief description how to use the feature of collection as example.

For CRF training process, its key component is the selection of feature.In carrying out the practical application of relationship marking, as search engine system, degree of accuracy is often more important than recall rate (recall).System does not need to return all relevant information, and only need to provide most important information to user.Therefore, user can select the dependency tree pattern with pinpoint accuracy for extracting new relation.If user wishes to obtain high recall rate or F tolerance (F-measure), can as binary feature, construct CRF model by dependency tree pattern.Particularly, if a sentence s meets pattern p, binary feature f (p, s)=1, otherwise be 0.Therefore, this feature can be described to " dependency tree of this sentence meets pattern p? "For example, show in the example depicted in fig. 6 the concrete example of this thought.Then, utilize treated data, CRF can learning model and model is used for extracting new relation.

Below with reference to 7-10, describe another importance of the present invention in detail, i.e. the automatic leaching process of dependency tree pattern.It should be noted that below given embodiment is only an example as dependency tree pattern leaching process, scope of the present invention should not be limited to this.Dependency tree pattern can be prepared in advance by hand by user, or obtains in advance in other modes well known by persons skilled in the art.But the in the situation that of manual creation dependency tree pattern, user need to check a large amount of dependency trees, and these dependency trees are reduced into some dependency tree patterns.This is a job very consuming time.On the contrary, utilize the present invention can eliminate the problems referred to above, because dependency tree pattern can be extracted automatically.

Fig. 7 is the block diagram that the concrete structure of the dependency tree pattern extraction apparatus 206 that system shown in Figure 2 200 comprises is shown.As shown in Figure 7, dependency tree pattern extraction apparatus 206 can comprise resolution unit 701, cluster cell 702, dependency tree pattern extraction unit 703, dependency tree storer 704 and cluster storer 705.First, 701 pairs of resolution unit are resolved from each sentence that has marked relation in the collected works of collected works storer 207, to generate corresponding dependency tree.The dependency tree relevant from each sentence that cluster cell 702 can generate resolution unit 701 is clustered into different groups, and wherein the dependency tree in same group is structurally similar.Cluster result can be stored in cluster storer 705 subsequently.Then, dependency tree pattern extraction unit 703 can be applied subtree mining algorithm and excavate the subtree in each dependency tree group, then picks up and meets subtree that dependency tree pattern requires as output.

Traditional subtree mining algorithm attempts to extract all possible subtree.But due to shot array problem, the number of subtree is exponential increase by the size with subtree pattern.Therefore,, if minimum support " min_supp " is set to a little value, will there are a large amount of patterns.This may cause the failure of mining process.In order to address this problem, first the present invention carries out cluster to dependency tree, similar dependency tree in structure is formed to group, and then carry out pattern extraction from each group.

As example, the present invention can the LEST based on dependency tree defines the similarity function of dependency tree.For two dependency tree t ₁and t ₂, its similarity function Sim (t ₁, t ₂) there are two values, as dependency tree t ₁and t ₂sim (t while thering is identical LEST ₁, t ₂)=1, and as dependency tree t ₁and t ₂while thering is different LEST, Sim (t ₁, t ₂)=0.The dependency tree with identical LEST by cluster in same group.The advantage of this definition is that user need to not pre-define for dependency tree cluster number or the similarity threshold value of group.And, utilizing this definition, the time complexity of clustering algorithm is O (N).Algorithm only needs to dependency tree scan database once.When new dependency tree t (having LEST (t)) arrives, user only needs LEST (t) to compare with the LEST of each current group.If find its LEST to equal the group of LEST (t), t added to this group.Otherwise, for t creates a new group.In order further to improve and to find the efficiency with t with the group of identical LEST, user can represent LEST with the combination of the character string sequence that the postorder traversal of LEST and preorder traversal are produced, normally, the sequence of postorder and preorder traversal to together with can determine one tree.And then come this character string sequence that each LEST of index is corresponding to improve relative efficiency with Hash table.After dependency tree is clustered into different groups, can carries out subtree mining algorithm to each group and extract dependency tree pattern.

In an embodiment of the present invention, the dependency tree that dependency tree pattern extraction unit 703 can comprise according to each group similarity structurally, extracts one or more closed dependency tree patterns, as output.About the definition of closed dependency tree pattern, introduce to some extent above.; for a dependency tree pattern p; if there is no another dependency tree pattern p ', makes this dependency tree pattern p ' comprise described dependency tree pattern p and has identical support with described dependency tree pattern p, and this dependency tree pattern p is called as closed dependency tree pattern.The dependency tree pattern that all closed dependency tree pattern being extracted by dependency tree pattern extraction unit 703 can be used as final needs is stored in dependency tree mode memory 208, for feature extraction subsequently and the training of relationship marking model.

About the extraction of closed dependency tree pattern, the present invention proposes a kind of alternative manner.Fig. 8 is the inner structure example when dependency tree pattern extraction unit 603 that Fig. 7 comprises being shown working with iterative manner.In the case, dependency tree pattern extraction unit 603 comprises spermotype gatherer 801, candidate seed pattern generator 802 and spermotype cropping tool 803.First spermotype gatherer 801 collects the LEST of each group, as the initial sets of spermotype.Then, in each iteration, candidate seed pattern generator 802 adds an additional nodes to each spermotype, to generate the set of new candidate seed pattern.Spermotype cropping tool 803 is adjusted the set of candidate seed pattern according to preassigned, therefrom deletes some useless candidate seed patterns.Then, remaining candidate seed pattern is provided to candidate seed pattern generator 802 again as new spermotype, for next iteration.This process is constantly repeated, until the set of spermotype is sky.

Mention above, in an embodiment of the present invention, dependency tree pattern extraction unit 703 attempts to extract closed dependency tree pattern, as the reference of final feature extraction.Fig. 9 illustrates an example of dependency tree pattern leaching process, wherein takes equally the mode of iteration to carry out adjusting by wheel to spermotype set.

As shown in Figure 9, this process starts from step 901, and each sentence that wherein resolution unit 701 is resolved in the collected works that obtain, to generate relevant dependency tree.In step 902, cluster cell 702 for example according to LEST to dependency tree cluster to generate different groups.In step 903, the LEST of each group is collected by spermotype gatherer 801, as the initial sets of spermotype.Then, in step 904, candidate seed pattern generator 802 adds an additional nodes to each spermotype p, to generate the set { p of new candidate seed pattern ₁, p ₂... p _n.Subsequently, spermotype cropping tool 803 set to spermotype in step 905-915 is adjusted.Particularly, in step 905, all candidate seed pattern { p that judgement is generated by spermotype p ₁, p ₂... p _nsupport whether be all less than the support of spermotype p.If so, in step 906, spermotype p is exported as closed dependency tree pattern.Each the candidate seed pattern p generating for spermotype p ₁, p ₂... p _n, continue in step 907, to judge whether this candidate seed pattern is frequently, i.e. support S (p _i) (i=1,2 ... n) whether be less than a predetermined threshold Th.If there is a candidate seed pattern p _isupport be less than the support of spermotype p, illustrate that this candidate seed pattern is not frequently, it is deleted from the set of candidate seed pattern (step 909).Otherwise, retain this candidate seed pattern p _i(step 908).At step 905 place, if determine the not all candidate seed pattern { p being generated by spermotype p ₁, p ₂... p _nsupport be all less than the support of spermotype p, process proceeds to step 910.In step 910, judged whether a candidate seed pattern p _m, make this candidate seed pattern p _msupport equate with the support that generates its spermotype p.If so, retain this candidate seed pattern p _m(step 911).If not, judge this candidate seed pattern p _mwhether be frequently.I.e. this candidate seed pattern p _msupport whether be less than predetermined threshold Th (step 912).If so, in step 913 by this candidate seed pattern p _mfrom the set of candidate seed pattern, delete.If not, retain this candidate seed pattern p _m(step 914).Then, in step 915, all candidate seed patterns that are retained in current iteration are collected, and as the set of new spermotype, are used to next iteration.In step 916, judge whether the set of spermotype afterwards of this iteration has been sky.If not, process is returned to the processing of step 904 repeating step 904-915.If the set of spermotype has been empty, process finishes.

In order further to improve the extraction efficiency of closed dependency tree pattern, above-mentioned cutting process can also comprise following processing: in each iteration, except by each candidate seed pattern p _ioutside comparing with the spermotype p that generates it, can also be by this candidate seed pattern p _icompare with other spermotypes except generating its spermotype p, if having a spermotype k by described candidate seed pattern p in other spermotypes _icomprise and both have identical support, all candidate seed pattern { k that delete this other spermotypes k and generated by it ₁, k ₂... k _n.For example, Figure 10 shows an example of this cutting process.

In Figure 10, suppose two kinds of spermotypes of spermotype set-inclusion (1) and (2) in certain iteration.After having added node, spermotype (1) obtains two kinds of candidate seed patterns (11) and (12), and spermotype (2) obtains candidate seed pattern (21).The candidate seed pattern (21) that all nodes of spermotype (1) have all been generated by spermotype (2) thus comprises, therefore according to above-mentioned algorithm, all will delete the set by from spermotype spermotype (1) and all candidate seed patterns (11) thereof and (12).Do like this and can improve the efficiency of extracting closed dependency tree pattern, and can not lose any closed dependency tree pattern.To specifically prove this point below.

Supposing to exist m size in this iteration of N is the spermotype of N, and each spermotype is extended to big or small N+1 subsequently.Suppose that again spermotype p (i, N) can generate new candidate seed pattern p (i, j, N+1).Then, check all p (i, j, N+1) and p (k, N), i<>k.If p (k, N) is comprised by p (i, j, N+1) and supp (p (k, N))=supp (p (i, j, N+1)), delete p (k, N) and by all candidate seed pattern p (k, l, N+1) of its generation.Now, need proof not do so and can lose any one closed mode.In order to prove this point, need proof " if there is a closed mode p, this closed mode p by p (k; N) directly or indirectly generate, this closed mode p must be comprised by another pattern p ', and this pattern p ' can be from p (i; j, N+1) generation ".

Proof: first, we use ext (p, p ') to represent to obtain from p the expansion of p '.Then, because p (k, N) is comprised by p (i, j, N+1), therefore certainly exist expansion ext (p (k, N), p (i, j, N+1)).Here need to consider two kinds of situations: (1), if p has carried out expansion ext (p (k, N), p (i, j, N+1)), p must comprise p (i, j, N+1), and one generates from p (i, j, N+1) surely; Otherwise (2) if p cannot comprise p (i, j, N+1), can carry out expansion ext (p (k to p, N), p (i, j, N+1)) to obtain p ', it must meet supp (p)=supp (p '), thereby p ' must comprise p (k, N) and p (i, j, N+1).

Through above proof, can find out, by the illustrated cutting process going out of Figure 10, can not miss any closed dependency tree pattern.

More than be described in detail with reference to the attached drawings according to entity relationship extraction system of the present invention and method and the dependency tree pattern leaching process that wherein utilized.As previously mentioned, compared with the conventional method, be of the present inventionly related to that extraction system and method can realize better performance.

In addition, the sentence structure information extracting is also very useful for filter out spurious relationship example.Utilize the grammatical relation between the word in sentence can extract structural information (that is, dependency tree pattern).These dependency tree patterns can easily separate the structural area of the structure of correct relation and false relation.

Although above described according to a particular embodiment of the invention,, the present invention is not limited to the customized configuration shown in figure and processing.In addition, for brevity, omit the detailed description to known method technology here.In the above-described embodiments, describe and show some concrete steps as example.But procedure of the present invention is not limited to the concrete steps describing and illustrate, those skilled in the art can make various changes, modification and interpolation after understanding spirit of the present invention, or changes the order between step.

Element of the present invention can be implemented as hardware, software, firmware or their combination, and can be used in their system, subsystem, parts or subassembly.When realizing with software mode, element of the present invention is program or the code segment that is used to carry out required task.Program or code segment can be stored in machine readable media, or send at transmission medium or communication links by the data-signal carrying in carrier wave." machine readable media " can comprise and can store or any medium of transmission information.The example of machine readable media comprises electronic circuit, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disk, CD-ROM, CD, hard disk, fiber medium, radio frequency (RF) link, etc.Code segment can be downloaded via the computer network such as the Internet, Intranet etc.

The present invention can realize with other concrete form, and does not depart from its spirit and essential characteristic.For example, the algorithm described in specific embodiment can be modified, and system architecture does not depart from essence spirit of the present invention.Therefore, current embodiment is counted as exemplary but not determinate in all respects, scope of the present invention by claims but not foregoing description define, and, thereby the whole changes that fall in the implication of claim and the scope of equivalent are all included among scope of the present invention.

Claims

1. a method of extracting for relation, comprising:

Obtain collected works, described collected works comprise a plurality of sentences that marked relation;

Obtain the one group dependency tree pattern relevant to sentence structure in described collected works;

With reference to described dependency tree pattern, extract the feature of each sentence in described collected works, the contextual feature that described feature comprises this sentence and dependency tree feature, the step of wherein extracting described dependency tree feature comprises:

Resolve each sentence in described collected works, to obtain relevant dependency tree;

By more described relevant dependency tree and the described dependency tree pattern that extracted, construct the dependency tree feature for this sentence, wherein the described dependency tree feature for each sentence consists of one n * m matrix, the number of the dependency tree pattern that wherein m extracts described in being, n is the number of the node that comprises of the relevant dependency tree of this sentence, and for every kind of dependency tree pattern, if the relevant dependency tree of this sentence meets this dependency tree pattern, in described n * m matrix in row corresponding with this dependency tree pattern, the matrix element corresponding with each node of this dependency tree pattern is set to 1, other element is set to 0,

The described feature that collection extracts is trained relationship marking model; And

Described relationship marking model is applied to without the sentence marking to extract relationship example.

2. the method for claim 1, also comprises:

From described collected works, automatically extract described dependency tree pattern.

3. method as claimed in claim 2, the step of wherein said extraction dependency tree pattern comprises:

Resolve in described collected works, marked relation each sentence to generate corresponding dependency tree;

The described a plurality of dependency trees that generate are clustered into different groups, and wherein the dependency tree in same group is structurally similar;

Extract one or more closed dependency tree patterns, wherein a dependency tree pattern p is called as closed dependency tree pattern, if there is no another dependency tree pattern p ', makes this dependency tree pattern p ' comprise described dependency tree pattern p and has identical support with described dependency tree pattern p; And

Collect and store the described closed dependency tree pattern extracting.

4. method as claimed in claim 3, wherein the dependency tree in same group has the minimum that structure is identical and embeds subtree pattern LEST, the described minimum subtree pattern that embeds comprises and is a pair ofly related to that node and this are to being related to all crossover nodes of node, and described a pair of the relation has known relation between node.

5. method as claimed in claim 4, the wherein said step that extracts one or more closed dependency tree patterns comprises:

(a) collect the described LEST of each group, as the initial sets of spermotype;

(b) to spermotype described in each, add an additional nodes, to generate the set of new candidate seed pattern;

(c) as follows the set of described candidate seed pattern is adjusted:

If the support of all candidate seed patterns that a spermotype produces is all less than the support of this spermotype, this spermotype is exported as closed dependency tree pattern, and each the candidate seed pattern producing for this spermotype:

If the support of a candidate seed pattern equals to generate the support of the spermotype of this candidate seed pattern, retain this candidate seed pattern, and each other candidate seed pattern producing for this spermotype:

If the support of this candidate seed pattern is more than or equal to described assign thresholds, retain this candidate seed pattern; And

(d) using the candidate seed pattern that retained as new spermotype, repeat above-mentioned steps (b) and (c), until the set of spermotype is sky.

6. method as claimed in claim 5, also comprises:

Each candidate seed pattern is compared with other spermotypes except generating the spermotype of this candidate seed pattern, if one of described other spermotypes are comprised by described candidate seed pattern and both have identical support, all candidate seed patterns of deleting these other spermotypes and being generated by it.

7. the method for claim 1, the step of wherein extracting described contextual feature comprises:

Utilize each sentence in collected works described in part of speech label for labelling; And

The sentence that has marked part of speech by analysis extracts the contextual feature of this sentence.

8. a system of extracting for relation, comprising:

Collected works acquisition device, for obtaining collected works, described collected works comprise a plurality of sentences that marked relation;

Dependency tree pattern acquisition device, for obtaining the one group dependency tree pattern relevant to the sentence structure of described collected works;

Feature deriving means, for extract the feature of each sentence of described collected works with reference to the described dependency tree pattern extracting, described feature deriving means comprises contextual feature extraction unit, for extracting the contextual feature of each sentence, dependency tree feature extraction unit, for extracting the dependency tree feature of each sentence, wherein said dependency tree feature extraction unit comprises:

Resolution unit, for resolving each sentence of described collected works, to obtain relevant dependency tree; And

Dependency tree feature extractor, for constructing the dependency tree feature for this sentence by more described relevant dependency tree and the described dependency tree pattern that extracted, wherein the described dependency tree feature for each sentence consists of one n * m matrix, the number of the dependency tree pattern that wherein m extracts described in being, n is the number of the node that comprises of the relevant dependency tree of this sentence, and for every kind of dependency tree pattern, if the relevant dependency tree of this sentence meets this dependency tree pattern, in described n * m matrix in row corresponding with this dependency tree pattern, the matrix element corresponding with each node of this dependency tree pattern is set to 1, other element is set to 0,

Relationship marking model trainer, trains relationship marking model for collecting the described feature that described feature deriving means extracts; And

Model application apparatus, for being applied to described relationship marking model without the sentence marking to extract relationship example.

9. system as claimed in claim 8, also comprises:

Dependency tree pattern extraction apparatus, for automatically extracting described dependency tree pattern from described collected works.

10. system as claimed in claim 9, wherein said dependency tree pattern extraction apparatus comprises:

Resolution unit, for resolving each sentence that described collected works have marked relation to generate corresponding dependency tree;

Cluster cell, for the described a plurality of dependency trees that generate are clustered into different groups, wherein the dependency tree in same group is structurally similar; And

Dependency tree pattern extraction unit, be used for extracting one or more closed dependency tree patterns, wherein a dependency tree pattern p is called as closed dependency tree pattern, if there is no another dependency tree pattern p ', makes this dependency tree pattern p ' comprise described dependency tree pattern p and has identical support with described dependency tree pattern p.

11. systems as claimed in claim 10, wherein the dependency tree in same group has the minimum embedding subtree pattern LEST that structure is identical, the described minimum subtree pattern that embeds comprises and is a pair ofly related to that node and this are to being related to all crossover nodes of node, and described a pair of the relation has known relation between node.

12. systems as claimed in claim 11, wherein said dependency tree pattern extraction unit comprises:

Spermotype gatherer, for collecting the described LEST of each group, as the initial sets of spermotype;

Candidate seed pattern generator, for adding an additional nodes to spermotype described in each, to generate the set of new candidate seed pattern; And

Spermotype cropping tool, for as follows the set of described candidate seed pattern being adjusted:

If the support of all candidate seed patterns that a spermotype produces is all less than the support of this spermotype, this spermotype is outputed to dependency tree mode memory as closed dependency tree pattern, and each the candidate seed pattern producing for this spermotype:

If the support of this candidate seed pattern is less than assign thresholds, delete this candidate seed pattern,

If the support of this candidate seed pattern is more than or equal to described assign thresholds, retain this candidate seed pattern, and

Wherein said candidate seed pattern generator and described spermotype cropping tool are with iterative manner work, and the candidate seed pattern retaining in each iteration is used to next iteration as new spermotype set, until the set of spermotype is empty.

13. systems as claimed in claim 12, wherein said spermotype cropping tool is also arranged to:

14. systems as claimed in claim 8, wherein said contextual feature extraction unit comprises:

Part-of-speech tagging unit, for utilizing each sentence of collected works described in part of speech label for labelling; And

Contextual feature extraction apparatus, for analyze marked part of speech sentence to extract the contextual feature of this sentence.

15. systems as claimed in claim 8, wherein said model application apparatus comprises:

Sentence input block, for inputting the sentence without mark;

Resolution unit, for resolving the described sentence without mark, to obtain relative dependency tree;

Dependency tree pattern acquiring unit, for obtaining the set of the dependency tree pattern having extracted from described collected works;

Feature extraction unit, for extracting the feature of the described sentence without mark with reference to the described dependency tree pattern of obtaining;

Relationship marking unit, for marking Relation Parameters to extracting the sentence of feature; And

Relationship example extraction unit, for described relationship marking model being applied to the sentence that has marked Relation Parameters, to extract relationship example.