CN101799802A

CN101799802A - Method and system for extracting entity relationship by using structural information

Info

Publication number: CN101799802A
Application number: CN200910000499A
Authority: CN
Inventors: 许洪志; 胡长建; 沈国阳
Original assignee: NEC China Co Ltd
Current assignee: NEC China Co Ltd
Priority date: 2009-02-05
Filing date: 2009-02-05
Publication date: 2010-08-11
Anticipated expiration: 2029-02-05
Also published as: CN101799802B

Abstract

The invention provides a method and a system for extracting an entity relationship by using structural information. The method comprises the following steps: acquiring a text which comprises a plurality of sentences with marked relationships; acquiring a group of dependency tree modes related to sentence structures in the text; extracting the characteristic of each sentence in the text by referring to the dependency tree modes, wherein the characteristic comprises the structural characteristic of the sentence; collecting the extracted characteristics to train a relationship marking model; and applying the relationship marking model to an unmarked sentence to extract a relationship example. Furthermore, the invention also discloses a process for automatically extracting the dependency tree modes. Compared with the prior art, the relationship extracting system and the relationship extracting method of the invention can realize high performance.

Description

Utilize structural information to carry out the method and system that entity relationship is extracted

Technical field

Relate generally to natural language processing of the present invention more specifically, relates to and utilizes structural information to carry out the method and system that entity relationship is extracted

Background technology

Along with the sustainable growth of amount of digital information and the continuous enhancing of availability thereof, the user is more and more higher for the intelligentized requirement of information analysis, and traditional information retrieval technique has become and is difficult to satisfy these demands.The user wishes that computer system can be in play more important role aspect the understanding plaintext text.For example, the user needs to extract automatically the system of the relation between the entity in the text.

(Relation Extraction RE) can be used to a lot of fields in the relation extraction.For example, by detecting the open field text and therefrom extracting cause-effect relationship, can help the exploitation of question and answer (Q-A) system.For another example, the relation that can find gene and disease from biomedical document perhaps can extract social networks and provide better information recommendation to the user in view of the above in the future to be used for disease risks mark, diagnosis and prognosis from the on-line communities website.

Greatly depend on based on the performance of the application that concerns knowledge and to be used to concern the selected algorithm of extraction or the quality of method.The final user can greatly benefit from high-quality relationship example.Therefore, in order to realize high performance application, the accuracy that how to improve the relation extraction becomes a common problem.

Simultaneously, only, can't solve the problem that relation is extracted, find some semantic information because the solution of this problem also depends on by text (for example sentence) is used syntactic analysis.But the performance of semantic analysis of the prior art is good inadequately, therefore, how to maximally utilise the semantic technology that has defective and also is one and has challenging problem.

Having developed a lot of methods in the prior art is used for solving and concerns the extraction problem.But these existing methods performance in actual applications is also unsatisfactory.Basic scheme is from the training collected works study plane Text Mode (for example regular expression) through mark, and extracts relation with the pattern that extracts.Regular expression can obtain from the sentence study that has marked Relation Parameters.For example, the article of delivering at EugeneAgichtein and Luis Gravano.Snowball " Extracting Relations fromLarge Plain-Text Collections " (is seen Proc.of the 5 ^ThACM conference on Digitallibraries, 2000) a kind of extraction " mechanism-place " right algorithm that is used for of middle proposition.This algorithm generates pattern by the context of concluding Relation Parameters.Then, the candidate pattern that extracts is by automatic Evaluation, and has only those patterns with high confidence level to be retained, to be used to seek new relationship example.New-found relationship example will be used to extract more candidate pattern.By iteration, this algorithm can obtain to have a large amount of relationship example of reasonable accuracy.The content of this article is incorporated into this by reference on the whole to be used for all purposes.

Can be regarded as an order mark problem because relation is extracted, therefore existing order mask method (for example hidden Markov model (HMM), maximum entropy (ME) and condition random territory (CRF)) can be used to solution and concerns the extraction problem.Current widely used feature comprises up and down cliction, the part of speech (part-of-speech of cliction up and down, POS) label, judge feature that whether a pair of entity (be labeled as a pair of role accordingly in dependency tree, be also referred to as parameter (argument)) is in window feature in the same window, extract from dependency tree (dependency tree) or sentence structure analytic tree or the like.For example, the article that K.Nanda delivered " Combining lexical, syntacticand semantic features with maximum entropy models for extracting relations " (is seen Proc.of the 42 ^NdAnniversary Meeting of the Association forComputational Linguistics (ACL ' 04), 2004) just used the feature that extracts from dependency tree or sentence structure analytic tree in.Adopted feature is included on analytic tree or the dependency tree father node on dependency tree of path from first parameter of relation to second parameter, first and second parameters, cliction and POS label etc. thereof up and down.This method is used based on the maximum entropy (ME) of training collected works and is come training pattern, and uses a model and extract new relationship example.The content of this article is incorporated into this by reference on the whole to be used for all purposes.

In addition, relation is extracted can also be regarded as a classification problem, and therefore another kind concerns that extractive technique can be based on kernel method (Kernel Method).Kernel method is a kind of non-parametric density estimation technique, the kernel function between its computational data example, and wherein kernel function can be understood that a kind of similarity measurement.Relevant kernel function can define at language string (the word bag of sentence) or dependency tree (structural information of sentence).Use the nuclear in this support vector machine (SVM), can detect and extract relationship example.The article that Aron Culotta and Jeffrey Sorensen are delivered " DependencyTree Kernels for Relation Extraction " (is seen Proc.of the 42 ^NdKernel function at dependency tree has been proposed AnniversaryMeeting of the Association for Computational Linguistics (ACL ' 04), 2004).Corresponding feature comprises the POS label of tree node, interdependent type, entity type (for example " people " or " mechanism ") and role (for example " parameter A RG-A " and " parameter A RG-B ").This function checks at first whether the root of two dependency trees is identical.If two root differences, then the similarity score of two dependency trees should be 0.Otherwise function will calculate the similarity between the child node.At last, this kernel function is used in SVM, is used to concern the sorter of extraction with training.The content of this article is incorporated into this by reference on the whole to be used for all purposes.

Yet, above-mentioned existing method has all been ignored semantic information during relation is extracted, and only pay close attention to the sentence structure information of shallow-layer, for example the POS label of the word that current speech depended on or on dependency tree from the path of Relation Parameters " ARG-A " to " ARG-B ".But these " father nodes " or " path " information can't comprise enough useful semantic information and distinguish relation, therefore the existing poor-performing that concerns extraction scheme.

In fact, relation can utilize a certain minor structure that has complete semantic meaning on the dependency tree to determine.This means by limb on the dependency tree of checking sentence and just be enough to detect relation.But, do not propose effective method in the current existing prior art as yet and can be used to find these crucial minor structures.

Summary of the invention

In view of the above problems, the present invention is devoted to provide accurately a kind of and entity relationship extracting method and system efficiently.Particularly, technology of the present invention at first extracts the crucial minor structure that is referred to as " dependency tree pattern " from the dependency tree that comprises the actual relationship example.Then, the dependency tree pattern that extracts can be used to improve the degree of accuracy that relation is extracted.

According to first aspect present invention, a kind of method that is used to concern extraction is provided, comprising: obtain collected works, described collected works comprise a plurality of sentences that marked relation; Obtain and one group of relevant dependency tree pattern of sentence structure in the described collected works; The described dependency tree pattern of reference is extracted the feature of each sentence in the described collected works, and described feature comprises the architectural feature of this sentence; The described feature that collection extracts is trained relationship marking model; And with described relationship marking model be applied to without the mark sentence to extract relationship example.

According to second aspect present invention, a kind of system that is used to concern extraction is provided, comprising: the collected works deriving means, be used to obtain collected works, described collected works comprise a plurality of sentences that marked relation; Dependency tree pattern deriving means is used for obtaining the one group dependency tree pattern relevant with the sentence structure of described collected works; Feature deriving means is used for extracting with reference to the described dependency tree pattern that extracts the feature of each sentence of described collected works, and described feature comprises the architectural feature of this sentence; The relationship marking model trainer is used to collect the described feature that described feature deriving means extracts and trains relationship marking model; And the model application apparatus, be used for described relationship marking model is applied to without the sentence that marks to extract relationship example.

This shows that system of the present invention can be divided into two stages: model training stage and model application stage.

During the model training stage, can obtain highly accurate relationship marking model by following operation:

1. at first,, need provide one and mark the collected works C that concerns for model training _rSimultaneously, also need from these collected works C _rThe one group of dependency tree pattern that is prepared in advance, note is made TPs.

2. then, can utilize ready dependency tree pattern to extract and collected works C _rIn the corresponding required feature of each sentence, comprise architectural feature and traditional characteristic (for example contextual feature).

3. the feature that extracts is collected subsequently, and is used to train relationship marking model.About the training method of relationship marking model, can use traditional machine learning techniques.

4. the relationship marking model of Sheng Chenging is stored, for future use.

In the model application stage, system of the present invention can extract relationship example effectively by following operation:

5. the user imports the text without mark of wishing to extract relationship example, and text is unit with the sentence.

6. the sentence of input is resolved to obtain associated dependency tree.

7. the set in ready prepd dependency tree pattern of model training stage can be used to extract and the corresponding feature of the sentence of this input this moment.

8. according to the sentence mark Relation Parameters of these features that extract to input.

9. last, the relationship marking model that has generated is applied to the sentence that has marked Relation Parameters, to extract relationship example.

The set of above-mentioned dependency tree pattern can be pre-created by the user, also can be from collected works C _rIn extract automatically.From collected works C _rIn automatically extract under the situation of dependency tree pattern the following dependency tree pattern extracting method that the present invention proposes:

1. each sentence that has marked in the collected works is resolved to corresponding dependency tree.Dependency tree can be created automatically by system, and ideally, dependency tree also can be by user's manual creation.

2. all dependency trees are clustered into different groups, so that the dependency tree in the same group structurally has high similarity.For example, in an embodiment of the present invention, can embed the similarity function that subtree pattern (LEST) defines dependency tree based on minimum.For two dependency tree t ₁And t ₂, its similarity function Sim (t ₁, t ₂) have two values, as dependency tree t ₁And t ₂Sim (t when having identical LEST ₁, t ₂)=1, and as dependency tree t ₁And t ₂When having different LEST, Sim (t ₁, t ₂)=0.

3. the mining algorithm that uses subtree Also extracts one or more closed dependency tree patterns.For example, can extract by following iterative manner:

3.1 use the initial sets S of the LEST of each group as spermotype _p

3.2 to S _pIn spermotype add an additional nodes to generate the set of new candidate seed pattern;

3.3 the support of checking each candidate seed pattern is to delete useless candidate seed pattern, the deletion principle here for example can be defined as follows:

If 3.3.1 the support of all candidate seed patterns that a spermotype produces then this spermotype is exported as closed dependency tree pattern all less than the support of this spermotype, and for each candidate seed pattern that this spermotype produced:

If the support of this candidate seed pattern is then deleted this candidate seed pattern less than an assign thresholds,

If the support of this candidate seed pattern then keeps this candidate seed pattern more than or equal to described assign thresholds; Otherwise

If 3.3.2 the support of a candidate seed pattern equals to generate the support of the spermotype of this candidate seed pattern, then keep this candidate seed pattern, and for each other candidate seed pattern of this spermotype:

If the support of this candidate seed pattern is then deleted this candidate seed pattern less than described assign thresholds,

If the support of this candidate seed pattern then keeps this candidate seed pattern more than or equal to described assign thresholds;

3.4 use the candidate seed pattern that keeps as new spermotype S _pRepeat above-mentioned steps 3.2 and 3.3, till the set of spermotype is sky.

Utilize system and method for the present invention can excavate useful sentence structure information and use it for and concern extraction.And, to compare with existing method, extraction system and the method for concerning of the present invention can realize more performance.

Particularly, sentence structure information is the fine indication to actual relationship.The speech of indexical relation is usually located on the fixed position on the relevant dependency tree in some sentence.That is, in one group of dependency tree, comprise some potential public subtree pattern usually.These subtree patterns can be indicated actual entity relationship well.

In addition, the sentence structure information that extracts also is very useful for the filter out spurious relationship example.Utilize the grammatical relation between the speech in the sentence can extract structural information (that is dependency tree pattern).These dependency tree patterns can easily make a distinction the structure of correct relation and the structure of false relation.For example, in sentence " Tom, the brother of Kate, works in Microsoft now. ", may form false relation＜people-tissue, Kate, Microsoft 〉.Utilize classic method (for example regular expression), this falseness relation probably also can be identified.Yet, utilize system of the present invention, such falseness relation can be filtered out effectively, is the subtree of node " Tom " because " the brother of Kate " understands resolved.From the structure angle, be difficult to generate relationship example between " Kate " and " Microsoft ".On the other hand, if traditional regular expression is not carefully made up, then may omit such as＜people-tissue Tom, Microsoft〉and so on correct relation.But, utilizing system of the present invention, such correct relation can detect at an easy rate according to the dependency tree pattern that extracts.

Also have, the present invention has adopted a kind of more efficient methods, is used for integrated sentence structure feature and traditional characteristic.Because sentence structure is may be very complicated and during resolving sentence some mistake may take place, so some dependency tree pattern may comprise noise.Therefore, the dependency tree model that is extracted can not be used for extracting relation directly, independently.Method proposed by the invention has been set up some binary features, is used for reflecting whether the dependency tree of a certain sentence satisfies a certain dependency tree pattern.By using the machine learning algorithm (for example CRF, SVM etc.) based on feature, these features can be used to train relationship marking model with other traditional characteristics.

Description of drawings

In conjunction with the accompanying drawings,, will understand the present invention better, similarly indicate similar part in the accompanying drawing with reference to mark from following detailed description to the embodiment of the invention, wherein:

Figure 1A and Figure 1B are the synoptic diagram that is used to assist to describe key concept used in the present invention;

Fig. 2 is the block diagram that the inner structure that concerns extraction system 200 according to an embodiment of the invention is shown;

Fig. 3 is the process flow diagram that the operation example of system shown in Figure 2 200 is shown;

Fig. 4 is the block diagram that the concrete structure of the feature deriving means that system 200 comprised is shown;

Fig. 5 is the synoptic diagram that illustrates by the example of resolving the dependency tree relevant with sentence that obtains;

Fig. 6 is used for the synoptic diagram that description references dependency tree pattern is carried out the process of feature extraction;

Fig. 7 is the block diagram that the concrete structure of the dependency tree pattern extraction apparatus that system 200 comprised is shown;

Fig. 8 is the block diagram of concrete structure that an example of the dependency tree pattern extraction unit that Fig. 7 comprises is shown;

Fig. 9 illustrates the process flow diagram of dependency tree pattern leaching process according to an embodiment of the invention; And

Figure 10 is the synoptic diagram of an example of the candidate seed pattern trimming operation that is used for illustrating that dependency tree pattern leaching process carries out.

Embodiment

In order to describe the leaching process of dependency tree pattern proposed by the invention better, at first in describing, the basic concepts of using is made brief description below.

Relation is extracted: it is a kind of technology that is used to find two relations between the entity that relation is extracted.For example, for english sentence " Tom works for Microsoft in Seattle. ", relation is extracted can detect following two kinds of relations: (1) concerns 1:＜people-tissue, Tom, Microsoft 〉; Or (2) concern 2:＜tissue-place, Microsoft, Seattle 〉.

Dependency tree: dependency tree is a kind of method for expressing that is used to present the grammatical relation between the sentence element.For example, " Tom works for Microsoft in Seattle. " is example with above-mentioned sentence, and the structure of its dependency tree can be shown in Figure 1A, wherein also the part of speech (POS) and the embedded relation of sentence element marked.

Crossover node: two node n1 on a dependency tree t and the crossover node n of n2 are made crs by note (t)=n, it are defined as first common node between path n1 → root (t) and the path n2 → root (t) for n1, n2.For example, shown in Figure 1B, on dependency tree T, the crossover node of node E and node P is node p, promptly crs (T)=P, and the crossover node of node P and node A is a Node B for E, P, promptly crs (P, A, T)=B.

The dependency tree pattern: according to the present invention, the dependency tree pattern is defined as the closed subtree on the dependency tree, and it keeps all crossover nodes and hints out relation between a pair of entity.

The support of dependency tree pattern (support): the support of dependency tree pattern p is made supp (p) by note, and it can be defined as comprising the total number of the dependency tree of this dependency tree pattern p.If dependency tree t comprises dependency tree pattern p, we can say that then t satisfies p.

Frequent dependency tree pattern:, promptly be referred to as frequent dependency tree pattern if the support of a dependency tree pattern, we can say then that this dependency tree pattern is frequent greater than a predetermined threshold " min_supp ".

Maximum dependency tree pattern: if a dependency tree pattern p is frequent, and do not have other frequent modes p ', make p ' comprise p, claim that then this dependency tree pattern p is maximum dependency tree pattern.

Closed dependency tree pattern: if a dependency tree pattern p is frequent, and the pattern p ' that does not exist other and p to have identical support, make p ' comprise p, claim that then this dependency tree pattern p is closed dependency tree pattern.

The minimum subtree pattern (LEST) that embeds: the LEST of dependency tree t is a kind of dependency tree pattern p of minimum dimension of relation of inclusion, and all crossover nodes among this pattern p all should be retained among the LEST.For example, for every couple of node n1 among the pattern p and n2, all should satisfy crs (n1, n2, p)=crs (n1, n2, t).With reference to Figure 1B, for the dependency tree T in left side, suppose node " P " representative, then there is relation in node " A " representative tissue between node " P " and " A ".Therefore, the LEST of t can be shown in (1) among Figure 1B.Yet (2) among Figure 1B can not be as the LEST of t, because the crossover node between node " G " and " A " is " D " but not " B ".

Fig. 2 is the block diagram that the inner structure that concerns extraction system 200 according to an embodiment of the invention is shown.As shown in Figure 2, system 200 mainly comprises collected works deriving means 201, dependency tree pattern deriving means 202, feature deriving means 203, relationship marking model trainer 204 and model application apparatus 205.Alternatively, system 200 also comprises dependency tree pattern extraction apparatus 206, is used for extracting automatically required dependency tree pattern.As mentioned above, except automatic extraction dependency tree pattern, the user also can be in advance prepares the dependency tree pattern with manual mode, and with ready dependency tree pattern storage in dependency tree mode memory 208.The method that is used for extracting automatically the dependency tree pattern proposed by the invention will be described in more detail below.

As mentioned above, the extraction system 200 that concerns of the present invention mainly comprises two stages, it is model training stage and model application stage, wherein the model training stage is mainly carried out by collected works deriving means 201, dependency tree pattern deriving means 202, feature deriving means 203 and relationship marking model trainer 204, and the model application stage is then realized by model application apparatus 205.

The process flow diagram of Fig. 3 illustrates the operation example of system shown in Figure 2 200.This process starts from step 301, and wherein collected works deriving means 201 obtains collected works from collected works storer 207, and this article is concentrated and for example included a plurality of sentences that marked relation.In step 303, dependency tree pattern deriving means 202 obtains preprepared dependency tree pattern from dependency tree mode memory 208.Before step 303, can comprise optional step 302 (the with dashed lines frame illustrates), this step is used for extracting required dependency tree pattern automatically from obtaining collected works.Concrete dependency tree pattern leaching process will be described hereinafter.Then, in step 304, feature deriving means 203 can extract the feature of each sentence in the collected works that obtained with reference to the dependency tree pattern of having obtained, and this feature can comprise the architectural feature and the traditional characteristic of sentence.As example, architectural feature can be the dependency tree feature, and traditional characteristic can be a contextual feature.In step 305, the feature of each sentence that is extracted by feature deriving means 203 is collected, and is provided to relationship marking model trainer 204.Relationship marking model trainer 204 can use the machine learning techniques of standard to train the relation mark model.The relation mark model that generates can be stored in the relationship marking model storer 209.Subsequently, when importing without the sentence of mark, in step 306, model application apparatus 205 can obtain the relation mark model that is stored in advance in the relationship marking model storer 209, and applies it to without the sentence that marks to extract required relationship example.Then, process 300 finishes.

Also show in detail the inner structure of model application apparatus 205 among Fig. 2.Model application apparatus 205 for example can comprise sentence input block 2051, resolution unit 2052, dependency tree pattern acquiring unit 2053, feature extraction unit 2054, relationship marking unit 2055 and relationship example extraction unit 2056.About the detailed process of model application stage, describe to some extent hereinbefore.Specifically, at first, the user wishes to extract the sentence without mark of relationship example by 2051 inputs of sentence input block.Then, the sentence of 2052 pairs of inputs of resolution unit is resolved to obtain associated dependency tree.Dependency tree pattern acquiring unit 2053 can obtain in the set of ready prepd dependency tree pattern of model training stage and with it and be provided to feature extraction unit 2054.Feature extraction unit 2054 can extract and the corresponding feature of the sentence of this input with reference to the dependency tree pattern subsequently.The sentence mark Relation Parameters of feature that relationship marking unit 2055 extracts according to feature extraction unit 2054 to importing.Subsequently, be stored in relationship marking model in the relationship marking model storer 209, that generated in the model training stage and be provided to relationship example extraction unit 2056.This relationship example extraction unit 2056 is applied to the sentence that has marked Relation Parameters with the relationship marking model of obtaining, to extract relationship example.Because the model application process is not the present invention's innovative point place, therefore it is not given unnecessary details.

To at first describe below according to characteristic extraction procedure of the present invention.Fig. 4 is the block diagram that the concrete structure of the feature deriving means 203 that system shown in Figure 2 200 comprised is shown.

Mention above, the present invention is directed to each sentence except extracting traditional contextual feature, also need to extract the dependency tree feature relevant with the dependency tree pattern.As shown in Figure 4, feature deriving means 203 mainly comprises the contextual feature extraction unit 401 that is used to extract contextual feature, is used to extract the characteristic storage unit 403 of the dependency tree feature extraction unit 402 and the storage feature of dependency tree feature.For contextual feature and dependency tree Feature Extraction method, hereinafter will illustrate in more detail.It should be explicitly made clear at this point that though hereinafter will provide the concrete extracting method of contextual feature and dependency tree feature, the present invention is not limited to described embodiment.Well known by persons skilled in the art and all answer within the scope of the present invention involved according to various other feature extracting methods that description of the invention it is contemplated that.

As shown in Figure 4, in this example, contextual feature extraction unit 401 for example comprises part-of-speech tagging unit 4011 and contextual feature extraction apparatus 4012, and storer 4013 is mainly used in the intermediate result that storage part-of-speech tagging unit 4011 is produced, and has promptly passed through the sentence of part of speech (POS) mark.Contextual feature extraction apparatus 4012 can extract traditional contextual feature by the sentence of analyzing through the POS mark.This part belongs to techniques well known, does not therefore give unnecessary details at this.

Dependency tree feature extraction unit 402 can comprise resolution unit 4021, dependency tree feature extractor 4022 and be used for the storer 4023 of the result of storing and resolving unit 4021.Resolution unit 4021 is at first resolved the sentence in the collected works that obtained, to generate relevant dependency tree.The relevant dependency tree that is generated is stored in the storer 4023 subsequently.In the present invention, part of speech (POS) label of each node can also be concluded and add to resolution unit 4021 to all nodes on the dependency tree except the relevant dependency tree that generates each sentence.Why use the part of speech of word in the sentence in the present invention but not word itself is the commonality schemata that can't therefrom find because word itself is too special between the dependency tree.Alternatively, the user can also the node on dependency tree adds other attributes interdependent type of its father node (for example to) and at institute's role aspect the hint relationship example (for example " parameter A RG-1 ", " parameter A RG-2 " or " crucial son ").

For example, Fig. 5 illustrates by resolving the example of two dependency trees relevant with sentence that obtain.In this example, suppose that collected works comprise two sentences through mark: sentence (1) " Tom works forMicrosoft in Seattle. " and sentence (2) " Kate; once a leader of ACB; now worksin her sister ' s company BCA. ", wherein sentence (1) has relation＜people-tissue, Tom, Microsoft 〉, sentence (2) has relation＜people-tissue, Kate, BCA 〉.By resolving sentence (1) and (2), resolution unit 4021 can obtain the dependency tree relevant with (2) with sentence (1), as shown in Figure 5.In the example of Fig. 5, also to all nodes on the dependency tree marked part of speech and the hint relationship example aspect institute's role, wherein " per " and " aff " is respectively writing a Chinese character in simplified form of " people " and " tissue ".And two parameters that in Fig. 5, concern with grey box indication " people-tissue ".

Dependency tree through above-mentioned processing can be stored in the storer 4023.Subsequently, the dependency tree pattern that dependency tree feature extractor 4022 can be obtained with reference to dependency tree pattern deriving means 202, the dependency tree feature of coming to extract this sentence according to the relevant dependency tree of each sentence.

Fig. 6 illustrates an example of carrying out the process of feature extraction according to the present invention with reference to the dependency tree pattern.Suppose that dependency tree pattern that dependency tree feature extractor 4022 obtains is shown in the example of Fig. 6 left side.Fig. 6 right side shows the feature extraction result for above-mentioned sentence (1) and sentence (2) respectively, wherein feature is listed as corresponding to traditional contextual feature at 1-4, and the m shown in the frame of broken lines is listed as corresponding to the dependency tree feature, and wherein m represents the number of the dependency tree pattern obtained.Owing in the example of Fig. 6, only provided a dependency tree pattern, so the row of the m in the frame of broken lines only show corresponding with it first row.

Feature in the given example of Fig. 6 is defined as follows:

(1) the 1st row: the part of speech label of current word;

(2) the 2nd row: the people represented in this word? (being to be 1, is not 0);

(3) the 3rd row: tissue represented in this word? (being to be 1, is not 0);

(4) the 4th row: in the scope of 4 words in the front and back of current word, have a people? (being to be 1, is not 0);

Can (5) frame of broken lines (dependency tree feature): this word correspond to the node on the dependency tree pattern? (being to be 1, is not 0).

Above-mentioned feature is as just example, and the user can define different features according to the actual requirements.

Return Fig. 2, after extracting feature, relationship marking model trainer 204 is collected the feature that extracts, and the features training relationship marking model of using any machine learning techniques utilization to extract.Here, we adopt CRF to come brief description how to use the feature of collection as example.

For the CRF training process, its key component is the selection of feature.In the practical application of carrying out relationship marking, as search engine system, degree of accuracy is often more important than recall rate (recall).System does not need to return all relevant information, and only need provide most important information to the user.Therefore, the user's dependency tree pattern that can select to have pinpoint accuracy is used to extract new relation.If the user wishes to obtain high recall rate or F tolerance (F-measure), then can use the dependency tree pattern to construct the CRF model as binary feature.Particularly, if a sentence s satisfies pattern p, then (p s)=1, otherwise is 0 to binary feature f.Therefore, this feature can be described to " dependency tree of this sentence satisfies pattern p? "For example, show the concrete example of this thought in the example depicted in fig. 6.Then, utilize treated data, CRF can learning model and model is used to extract new relation.

Describe another importance of the present invention in detail below with reference to 7-10, i.e. the automatic leaching process of dependency tree pattern.It should be noted that hereinafter given embodiment only is an example as dependency tree pattern leaching process, scope of the present invention should not be limited to this.The dependency tree pattern can be prepared in advance by hand by the user, perhaps obtains in advance in other modes well known by persons skilled in the art.But under the situation of manual creation dependency tree pattern, the user need check a large amount of dependency trees, and these dependency trees are reduced into some dependency tree patterns.This is a job very consuming time.On the contrary, utilize the present invention then can eliminate the problems referred to above, because the dependency tree pattern can be extracted automatically.

Fig. 7 is the block diagram that the concrete structure of the dependency tree pattern extraction apparatus 206 that system shown in Figure 2 200 comprised is shown.As shown in Figure 7, dependency tree pattern extraction apparatus 206 can comprise resolution unit 701, cluster cell 702, dependency tree pattern extraction unit 703, dependency tree storer 704 and cluster storer 705.At first, 701 pairs of resolution unit are resolved from each sentence that has marked relation in the collected works of collected works storer 207, to generate corresponding dependency tree.Cluster cell 702 can be clustered into different groups with resolution unit 701 dependency trees relevant with each sentence that generate, and the dependency tree in the wherein same group is structurally similar.Cluster result can be stored in the cluster storer 705 subsequently.Then, dependency tree pattern extraction unit 703 can be used the subtree mining algorithm and excavate subtree in each dependency tree group, picks up then and satisfies subtree that the dependency tree pattern requires as output.

Traditional subtree mining algorithm attempts to extract all possible subtree.But because the shot array problem, the number of subtree will be exponential increase with the size of subtree pattern.Therefore, if minimum support " min_supp " is set to a little value, then will there be a large amount of patterns.This may cause the failure of mining process.In order to address this problem, the present invention at first carries out cluster to dependency tree, similar dependency tree on the structure is formed group, and then carry out pattern from each group and extract.

As example, the present invention can define the similarity function of dependency tree based on the LEST of dependency tree.For two dependency tree t ₁And t ₂, its similarity function Sim (t ₁, t ₂) have two values, as dependency tree t ₁And t ₂Sim (t when having identical LEST ₁, t ₂)=1, and as dependency tree t ₁And t ₂When having different LEST, Sim (t ₁, t ₂)=0.Dependency tree with identical LEST by cluster in same group.The advantage of this definition be the user need be for the dependency tree cluster number or the similarity threshold value of pre-defined group.And, utilizing this definition, the time complexity of clustering algorithm is O (N).Algorithm only needs to the dependency tree scan database once.When new dependency tree t (having LEST (t)) arrived, the user only needed LEST (t) is compared with the LEST of each current group.If find its LEST to equal the group of LEST (t), then t added to this group.Otherwise, for t creates a new group.Find the efficient that has the group of identical LEST with t in order further to improve, the user can use LEST is represented in the combination of the postorder traversal of LEST and the character string sequence that preorder traversal produced, normally, the sequence of postorder and preorder traversal is to determining one tree together.And then use Hash table to come this character string sequence of each LEST correspondence of index to improve relative efficiency.After dependency tree is clustered into different groups, can carries out the subtree mining algorithm to each group and extract the dependency tree pattern.

In an embodiment of the present invention, dependency tree pattern extraction unit 703 can extract one or more closed dependency tree patterns, as output according to the dependency tree similarity structurally that each group comprised.About the definition of closed dependency tree pattern, preamble is introduced to some extent.Promptly, for a dependency tree pattern p, if there is no another dependency tree pattern p ' makes this dependency tree pattern p ' comprise described dependency tree pattern p and has identical support with described dependency tree pattern p that then this dependency tree pattern p is called as closed dependency tree pattern.The dependency tree pattern that all closed dependency tree patterns that extracted by dependency tree pattern extraction unit 703 can be used as ultimate demand is stored in the dependency tree mode memory 208, be used for subsequently feature extraction and the training of relationship marking model.

About the extraction of closed dependency tree pattern, the present invention proposes a kind of alternative manner.Fig. 8 is the inner structure example when dependency tree pattern extraction unit 603 that Fig. 7 comprises being shown working with iterative manner.In the case, dependency tree pattern extraction unit 603 comprises spermotype gatherer 801, candidate seed pattern generator 802 and spermotype cropping tool 803.Spermotype gatherer 801 is at first collected the LEST of each group, as the initial sets of spermotype.Then, in each iteration, candidate seed pattern generator 802 adds an additional nodes to each spermotype, to generate the set of new candidate seed pattern.Spermotype cropping tool 803 is adjusted the set of candidate seed pattern according to preassigned, therefrom deletes some useless candidate seed patterns.Then, remaining candidate seed pattern is provided to candidate seed pattern generator 802 once more as new spermotype, to be used for next iteration.This process is constantly repeated, till the set of spermotype is sky.

Mention above, in an embodiment of the present invention, dependency tree pattern extraction unit 703 attempts to extract closed dependency tree pattern, as the reference of final feature extraction.Fig. 9 illustrates an example of dependency tree pattern leaching process, the mode of wherein taking iteration equally to spermotype set pursue the wheel adjustment.

As shown in Figure 9, this process starts from step 901, and wherein resolution unit 701 is resolved each sentence in the collected works that obtained, to generate relevant dependency tree.In step 902, cluster cell 702 for example according to LEST to the dependency tree cluster to generate different groups.In step 903, the LEST of each group is collected by spermotype gatherer 801, as the initial sets of spermotype.Then, in step 904, candidate seed pattern generator 802 adds an additional nodes to each spermotype p, to generate the set { p of new candidate seed pattern ₁, p ₂... p _n.Subsequently, spermotype cropping tool 803 set to spermotype in step 905-915 is adjusted.Particularly, in step 905, judge all candidate seed pattern { p that generate by spermotype p ₁, p ₂... p _nSupport whether all less than the support of spermotype p.If then in step 906, spermotype p is exported as closed dependency tree pattern.Each the candidate seed pattern p that is generated for spermotype p ₁, p ₂... p _n, continue to judge in step 907 whether this candidate seed pattern is frequent, i.e. support S (p _i) (i=1,2 ... n) whether less than a predetermined threshold Th.If a candidate seed pattern p is arranged _iSupport less than the support of spermotype p, illustrate that then this candidate seed pattern is not frequent, then it is deleted (step 909) from the set of candidate seed pattern.Otherwise, then keep this candidate seed pattern p _i(step 908).At step 905 place, if determine the not all candidate seed pattern { p that generates by spermotype p ₁, p ₂... p _nSupport all less than the support of spermotype p, then process proceeds to step 910.In step 910, judged whether a candidate seed pattern p _m, make this candidate seed pattern p _mSupport equate with the support of the spermotype p that generates it.If then keep this candidate seed pattern p _m(step 911).If not, then judge this candidate seed pattern p _mWhether be frequent.I.e. this candidate seed pattern p _mSupport whether less than predetermined threshold Th (step 912).If, then in step 913 with this candidate seed pattern p _mFrom the set of candidate seed pattern, delete.If not, then keep this candidate seed pattern p _m(step 914).Then, in step 915, all candidate seed patterns that are retained in current iteration are collected, and as the set of new spermotype, are used to next iteration.Judge in step 916 whether the set of spermotype afterwards of this iteration has been sky.If not, process is returned the processing of step 904 and repeating step 904-915.If the set of spermotype has been empty, then process finishes.

In order further to improve the extraction efficiency of closed dependency tree pattern, above-mentioned cutting process can also comprise following processing: in each iteration, except with each candidate seed pattern p _iOutside the spermotype p that generates it compares, can also be with this candidate seed pattern p _iCompare with other spermotypes except the spermotype p that generates it, if a spermotype k is arranged in other spermotypes by described candidate seed pattern p _iComprise and both have identical support, then delete this other spermotypes k and by all candidate seed pattern { k that it generated ₁, k ₂... k _n.For example, Figure 10 shows an example of this cutting process.

In Figure 10, suppose that the spermotype set comprises two kinds of spermotypes (1) and (2) in certain iteration.After having added node, spermotype (1) obtains two kinds of candidate seed patterns (11) and (12), and spermotype (2) obtains candidate seed pattern (21).The candidate seed pattern (21) that all generated by spermotype (2) of all nodes of spermotype (1) comprises thus, therefore according to above-mentioned algorithm, spermotype (1) and all candidate seed patterns (11) thereof and (12) all will be deleted from the set of spermotype.Do like this and can improve the efficient of extracting closed dependency tree pattern, and can not lose any closed dependency tree pattern.To specifically prove this point below.

Suppose to exist in this iteration of N m size to be the spermotype of N, each spermotype is extended to big or small N+1 subsequently.Suppose again spermotype p (i, N) can generate new candidate seed pattern p (i, j, N+1).Then, check all p (i, j, N+1) and p (k, N), i＜＞k.If p (k, N) by p (i, j, N+1) comprise and supp (p (k, N))=supp (and p (i, j, N+1)), then delete p (k, N) and by all candidate seed pattern p of its generation (k, l, N+1).Now, need proof not do so and can lose any one closed mode.In order to prove this point, then need proof " if there is a closed mode p, this closed mode p by p (k N) directly or indirectly generates, and then this closed mode p must be comprised by another pattern p ', and this pattern p ' can from p (i, j, N+1) generation ".

Proof: at first, we use ext (p, p ') expression to obtain the expansion of p ' from p.Then because p (k, N) by p (i, j N+1) comprise, therefore certainly exist expand ext (p (and k, N), p (i, j, N+1)).Here need to consider two kinds of situations: (1) if p carried out expansion ext (p (and k, N), p (i, j, N+1)), then p must comprise p (N+1), then (i, j N+1) generate one from p surely for i, j; Otherwise (2) if p can't comprise p (i, j, N+1), then can to p carry out expansion ext (p (and k, N), to obtain p ', it must satisfy supp (p)=supp (p ') to p (i, j, N+1)), thereby p ' must comprise p (k, N) and p (i, j, N+1).

Through above proof, as can be seen, can not miss any closed dependency tree pattern by the illustrated cutting process that goes out of Figure 10.

More than be described in detail with reference to the attached drawings according to entity relationship extraction system of the present invention and method and the dependency tree pattern leaching process that wherein utilized.As previously mentioned, compare with existing method, extraction system and the method for concerning of the present invention can realize more performance.

In addition, the sentence structure information that extracts also is very useful for the filter out spurious relationship example.Utilize the grammatical relation between the speech in the sentence can extract structural information (that is dependency tree pattern).These dependency tree patterns can easily make a distinction the structure of correct relation and the structure of false relation.

Though above described according to a particular embodiment of the invention,, the present invention is not limited to customized configuration shown in the figure and processing.In addition, for brevity, omit detailed description here to the known method technology.In the above-described embodiments, describe and show some concrete steps as example.But procedure of the present invention is not limited to the concrete steps that institute describes and illustrates, and those skilled in the art can make various changes, modification and interpolation after understanding spirit of the present invention, perhaps change the order between the step.

Element of the present invention can be implemented as hardware, software, firmware or their combination, and can be used in their system, subsystem, parts or the subassembly.When realizing with software mode, element of the present invention is program or the code segment that is used to carry out required task.Program or code segment can be stored in the machine readable media, perhaps send at transmission medium or communication links by the data-signal that carries in the carrier wave." machine readable media " can comprise any medium that can store or transmit information.The example of machine readable media comprises electronic circuit, semiconductor memory devices, ROM, flash memory, can wipe ROM (EROM), floppy disk, CD-ROM, CD, hard disk, fiber medium, radio frequency (RF) link, or the like.Code segment can be downloaded via the computer network such as the Internet, Intranet etc.

The present invention can realize with other concrete form, and do not break away from its spirit and essential characteristic.For example, the algorithm described in the specific embodiment can be modified, and system architecture does not break away from essence spirit of the present invention.Therefore, current embodiment is counted as exemplary but not determinate in all respects, scope of the present invention is by claims but not foregoing description definition, and, thereby the whole changes that fall in the scope of the implication of claim and equivalent all are included among the scope of the present invention.

Claims

1. method that is used to concern extraction comprises:

Obtain collected works, described collected works comprise a plurality of sentences that marked relation;

Obtain and one group of relevant dependency tree pattern of sentence structure in the described collected works;

The described dependency tree pattern of reference is extracted the feature of each sentence in the described collected works, and described feature comprises the architectural feature of this sentence;

The described feature that collection extracts is trained relationship marking model; And

Described relationship marking model is applied to without the sentence that marks to extract relationship example.

2. the method for claim 1 also comprises:

From described collected works, extract described dependency tree pattern automatically.

3. method as claimed in claim 2, the step of wherein said extraction dependency tree pattern comprises:

Each sentence of resolving the relation that marked in the described collected works is to generate corresponding dependency tree;

The described a plurality of dependency trees that generate are clustered into different groups, and the dependency tree in the wherein same group is structurally similar;

Extract one or more closed dependency tree patterns, wherein a dependency tree pattern p is called as closed dependency tree pattern, if there is no another dependency tree pattern p ' makes this dependency tree pattern p ' comprise described dependency tree pattern p and has identical support with described dependency tree pattern p; And

Collect and store the described closed dependency tree pattern that extracts.

4. method as claimed in claim 3, dependency tree in the wherein same group has the identical minimum of structure and embeds subtree pattern (LEST), the described minimum subtree pattern that embeds comprises a pair of node and this of concerning to concerning all crossover nodes of node, and described a pair of the relation has known relation between the node.

5. method as claimed in claim 4, the wherein said step that extracts one or more closed dependency tree patterns comprises:

(a) collect the described LEST of each group, as the initial sets of spermotype;

(b) add an additional nodes to each described spermotype, to generate the set of new candidate seed pattern;

(c) as follows the set of described candidate seed pattern is adjusted:

If the support of all candidate seed patterns that a spermotype produces is then exported this spermotype all less than the support of this spermotype as closed dependency tree pattern, and for each candidate seed pattern that this spermotype produced:

If the support of a candidate seed pattern equals to generate the support of the spermotype of this candidate seed pattern, then keep this candidate seed pattern, and for each other candidate seed pattern that this spermotype produced:

If the support of this candidate seed pattern then keeps this candidate seed pattern more than or equal to described assign thresholds; And

(d) with the candidate seed pattern that kept as new spermotype, repeat above-mentioned steps (b) and (c), be sky up to the set of spermotype.

6. method as claimed in claim 5 also comprises:

Each candidate seed pattern is compared with other spermotypes except the spermotype that generates this candidate seed pattern, if one of described other spermotypes are comprised by described candidate seed pattern and both have identical support, then delete these other spermotypes and by all candidate seed patterns that it generated.

7. method as claimed in claim 2, the step of wherein said extraction feature comprises:

Extract the contextual feature of each sentence;

Extract the dependency tree feature of each sentence; And

Store described contextual feature and described dependency tree feature.

8. method as claimed in claim 7, the step of wherein extracting described contextual feature comprises:

Utilize each sentence in the described collected works of part of speech label for labelling; And

The sentence that has marked part of speech by analysis extracts the contextual feature of this sentence.

9. method as claimed in claim 7, the step of wherein extracting described dependency tree feature comprises:

Resolve each sentence in the described collected works, with the dependency tree that obtains being correlated with;

Obtain all dependency tree patterns that extracted; And

Construct dependency tree feature by more described relevant dependency tree and described dependency tree pattern at this sentence.

10. method as claimed in claim 9, wherein the described dependency tree feature at each sentence is made of one n * m matrix, wherein m is the number of the described dependency tree pattern that extracts, n is the number of the node that relevant dependency tree comprised of this sentence, and for every kind of dependency tree pattern, if the relevant dependency tree of this sentence satisfies this dependency tree pattern, then in described n * m matrix with the corresponding row of this dependency tree pattern in, be set to 1 with the corresponding matrix element of each node of this dependency tree pattern, other element is set to 0.

11. a system that is used to concern extraction comprises:

The collected works deriving means is used to obtain collected works, and described collected works comprise a plurality of sentences that marked relation;

Dependency tree pattern deriving means is used for obtaining the one group dependency tree pattern relevant with the sentence structure of described collected works;

Feature deriving means is used for extracting with reference to the described dependency tree pattern that extracts the feature of each sentence of described collected works, and described feature comprises the architectural feature of this sentence;

The relationship marking model trainer is used to collect the described feature that described feature deriving means extracts and trains relationship marking model; And

The model application apparatus is used for described relationship marking model is applied to without the sentence that marks to extract relationship example.

12. system as claimed in claim 11 also comprises:

The dependency tree pattern extraction apparatus is used for extracting described dependency tree pattern automatically from described collected works.

13. system as claimed in claim 12, wherein said dependency tree pattern extraction apparatus comprises:

Resolution unit is used for resolving described collected works and has marked each sentence of relation to generate corresponding dependency tree;

Cluster cell, the described a plurality of dependency trees that are used for generating are clustered into different groups, and the dependency tree in the wherein same group is structurally similar; And

Dependency tree pattern extraction unit, be used to extract one or more closed dependency tree patterns, wherein a dependency tree pattern p is called as closed dependency tree pattern, if there is no another dependency tree pattern p ' makes this dependency tree pattern p ' comprise described dependency tree pattern p and has identical support with described dependency tree pattern p.

14. system as claimed in claim 13, dependency tree in the wherein same group has the identical minimum of structure and embeds subtree pattern (LEST), the described minimum subtree pattern that embeds comprises a pair of node and this of concerning to concerning all crossover nodes of node, and described a pair of the relation has known relation between the node.

15. system as claimed in claim 14, wherein said dependency tree pattern extraction unit comprises:

The spermotype gatherer is used to collect the described LEST of each group, as the initial sets of spermotype;

The candidate seed pattern generator is used for adding an additional nodes to each described spermotype, to generate the set of new candidate seed pattern; And

The spermotype cropping tool is used for as follows the set of described candidate seed pattern being adjusted:

If the support of all candidate seed patterns that a spermotype produces is all less than the support of this spermotype, then this spermotype is outputed to the dependency tree mode memory as closed dependency tree pattern, and for each candidate seed pattern that this spermotype produced:

If the support of this candidate seed pattern is then deleted this candidate seed pattern less than assign thresholds,

If the support of this candidate seed pattern then keeps this candidate seed pattern more than or equal to described assign thresholds, and

Wherein said candidate seed pattern generator and described spermotype cropping tool be with iterative manner work, and the candidate seed pattern that is kept in each iteration is used to next iteration as new spermotype set, be sky up to the set of spermotype.

16. system as claimed in claim 15, wherein said spermotype cropping tool also is arranged to:

17. system as claimed in claim 12, wherein said feature deriving means comprises:

The contextual feature extraction unit is used to extract the contextual feature of each sentence;

The dependency tree feature extraction unit is used to extract the dependency tree feature of each sentence; And

The characteristic storage unit is used to store described contextual feature and described dependency tree feature.

18. system as claimed in claim 17, wherein said contextual feature extraction unit comprises:

The part-of-speech tagging unit is used for utilizing each sentences of the described collected works of part of speech label for labelling; And

The contextual feature extraction apparatus, be used to analyze marked part of speech sentence to extract the contextual feature of this sentence.

19. system as claimed in claim 17, wherein said dependency tree feature extraction unit comprises:

Resolution unit is used for resolving each sentences of described collected works, with the dependency tree that obtains being correlated with; And

The dependency tree feature extractor is used for constructing dependency tree feature at this sentence by more described relevant dependency tree and the described dependency tree pattern that has extracted.

20. system as claimed in claim 11, wherein said model application apparatus comprises:

The sentence input block is used to import the sentence without mark;

Resolution unit is used to resolve described sentence without mark, to obtain relative dependency tree;

Dependency tree pattern acquiring unit, the set that is used to obtain the dependency tree pattern that extracts from described collected works;

Feature extraction unit is used for extracting described feature without the sentence that marks with reference to the described dependency tree pattern of obtaining;

The relationship marking unit is used for extracting the sentence mark Relation Parameters of feature; And

The relationship example extraction unit is used for described relationship marking model is applied to the sentence that has marked Relation Parameters, to extract relationship example.