WO2009017464A1

WO2009017464A1 - Relation extraction system

Info

Publication number: WO2009017464A1
Application number: PCT/SG2008/000281
Authority: WO
Inventors: Stanely Wai Keong Yong; Jian Su; Xiao Feng Yang
Original assignee: Agency For Science, Technology And Research
Priority date: 2007-07-31
Filing date: 2008-07-31
Publication date: 2009-02-05
Also published as: WO2009017464A9

Abstract

A classification system comprising a first supervised classifier module configured to access an annotated corpus during a training mode, a second semi-supervised module configured to access a raw text and index at least one pseudo document extracted from the raw text according to the location of a skip bigram within the at least one pseudo document during the training mode, a third classifier module configured to receive the output of the first supervised classifier module and a plurality of skip bigram similarity features derived from the skip bigram from the second semi-supervised module during a validation mode, and to receive a raw text document for relation extraction during a normal operation mode. Also a method.

Description

Relation Extraction System Field of the Invention

The invention relates to a system and method of relation extraction, particularly though not solely to a system and method of relation extraction, drawing upon resources from the internet, for use in a document classifier. Background of the Invention

The age of computers, and the internet, has meant a huge growth in the amount of digital documents and files that must be managed. This may take the form of organising private documents or search / analysis of documents on the internet.

One way of saving time for such tasks is to use automated methods. For example the algorithm used by the search engine Google provides an automated guess at documents available on the web relevant to the input query. Deciding what is relevant may require that the algorithm understand the meaning of words or the context in which they are used. Machine learning or automated classification is one way in which a computer can learn the meaning of words in documents and is therefore becoming of increasing importance.

One of the characteristics that can be exploited by machine learning is that the relationships between words can give an indication of the particular meaning of a given word. For example, in the sentence "... the car travelled over the bridge ...", the relationship between "car" and "bridge" means that this sentence is relevant to road bridges, but not dental bridges. That is to say the relationships between name entities that affect the particular meaning of each of the name entities are learnt. This is otherwise known as relation extraction.

There are a number of different categories of machine learning. For example supervised learning uses a database of annotated documents to learn how to predict a class label for an input document. Unsupervised learning is where the documents used as the trained set are unannotated or raw text. Semi-supervised learning may use a combination of annotated and unannotated documents.

Annotated documents can be best thought of as where certain sequences of characters have been annotated with attribute labels from a predefined vocabulary. The attribute labels in a annotated document are usually prepared manually. Plain documents are often said to be "unannotated", even when they follow strict "structural" conventions in their original contexts.

Supervised or inductive machine learning can be done using a number of different techniques. Each technique has different performance in terms of speed of learning, speed of classifying and accuracy of classification. For example Support Vector Machines (SVM) are becoming popular due to good speed and accuracy.

A SVM works on the principal that each data point (the data point is a feature vector that represents parts of a sentence in document in the training data) belongs to one of a number of classes. The goal of the SVM is to determine a test criterion for deciding which class a new data point belongs to. In general if a data point is a p-dimensional vector, then the SVM will determine a p - 1 -dimensional hyperplane which achieves maximum separation (margin) between the classes. A simple 2D example in Figure 1 shows a first class 100 of data points separated from a second class 102 of data points. Of the 1D hyperplanes that could be used to separate the first class 100 from the second class 102, the first hyperplane 106 clearly doesn't separate them at all, and the second hyperplane 108 clearly gives much higher separation that the third hyperplane 110. Therefore the second rjyperplane 108 would be used as the test criterion to determine which class a new data point should belong to.

An example of a method 200 that might be used for supervised training of a SVM is shown in Figure 2. At 202 a corpus of text document with relation annotations are compiled. At 204 the documents are input to a Feature Extraction Engine (FEE) to tag the parts-of- speech (POS), divide each sentence into chunks, encode information about the relationships between chunks, and recognise Name Entities (NER). At 206 the feature vectors from the FEE are used to train the SVM₁ which determines a hyperplane to separate the various classes.

When training a classifier for relation extraction, sometimes there are problems, which may be split into two categories: data sparseness and domain dependence.

Extracting a real instance of a given type of relation from unstructured texts may not always be possible. Even with domain experts, accurate statistics for a given type of relation may require very large datasets "data sparseness". Large datasets may be expensive to acquire and/or may require a longer training time.

Writers may change their vocabulary and tone to suit the topic or audience they wish to communicate with. Therefore meaning of a word may depends on the domain in which it is used "domain dependence". Therefore a classifier may not be very accurate unless it is trained across a range of domains. Providing annotated datasets for a range of domains may be expensive to acquire.

III. Summary of the Invention

In general terms the invention proposes a hybrid method of training of a document classifier for relation extraction. The results of a supervised training approach using an annotated or structured text corpus may be combined with results of semi-supervised learning approach using an unannotated or raw text corpus. The supervised training approach may use a multi-class SVM learner. The unsupervised learning approach may use hyponym expansion and/or thematic clustering. Combining the results may be done by training a final combination meta classifier using an estimate of the relation type of an entity pair instance in a validation document using both the semi-supervised approach and the supervised approach. The relation type estimate from the semi-supervised learning approach may include generating a validation pseudo document from the entity pair in the validation document and comparing the validation pseudo document to previously generated pseudo documents grouped by relation type. The previously generated pseudo documents grouped by relation type may be generated by hyponym expansion of an entity pair of a relation type and thematic clustering of extracts from the raw text corpus. If instances of any relation type are lacking in the training data, then a set of instances for that relation type can be gathered from a conceptual database, such as Wikipedia, which holds concepts, definitions and instances.

One advantage may be that using the hybrid method may reduce the problems of data sparseness and domain dependence and improve accuracy. A further advantage is that the weighting of the semi-supervised learning and the supervised learning can be optimised by the final combination meta classifier.

Disparate sources of information like Wikipedia and the Web may be integrated into an existing state-of-the-art relation extraction system.

An information retrieval system may be used to map relation definitions to definitions to concept nodes (such as Wikipedia documents), instead of retrieve documents from a database. Instead of simply stopping with concept nodes, relation instances may be extracted from Wikipedia by exploiting the graphical structure of the online encyclopaedia. The relation instances gathered are used in a semi-supervised framework to boost performance when training data is sparse. Wikipedia's categories may be used as root nodes in exploring the link graph for exemplary pairs. The relation set may be mapped to category nodes directly via KL divergence.

Web-based information sources may be combined with traditional lexico-semantic features from the raw text to produce demonstrably better results. This may be done by capturing contextual information from the web using a novel application of skip-bigram similarity. Skip-bigrams may be used as a means of condensing sentence excerpts into learnable features for statistical models.

In a first particular expression of the invention there is provided a classifier system according to claim 1.

In a second particular expression of the invention there is provided a method according to claim 7.

IV. Brief description of the Figures

One or more example embodiments of the invention will now be described, with reference to the following figures, in which:

Figure 1 is a graph illustrating SVM;

Figure 2 is a flow diagram of a supervised method of training a classifier;

Figure 3 is a flow diagram of a method of training a classifier according to an exemplary embodiment;

Figure 4 is a flow diagram of the semi-supervised method of training in Figure 3;

Figure 5 shows a system architecture diagram of the proposed invention; and

Figure 6 shows a diagram of the part of the parser's output in the form of a tree representation. Detailed description

Figure 3 shows a method of training a classifier for relation extraction 300 according to an exemplary embodiment. At 302 supervised training of a classifier is carried out using an annotated text corpus. At 304 semi-supervised training is carried out using raw text. At 306 validation documents are used to combine the results of the supervised training 302 and the semi-supervised training 304 and train a combination meta classifier. At 308 the combination meta classifier is tested.

With reference to Figure 3, the workflow may be divided into three main phases or modes, the learning or training phase 310, the validation phase 312 and the usage or testing phase 314. The annotated text corpus may be provided by the user and may be split into three parts, 80% is used for the learning phase 310, 10% is used for the validation phase 312 during construction of the combined model, and the remaining 10% is used for the testing phase 314. The operation during the test phase or mode is the same as normal operation to classify raw text.

The user provides the annotated text corpus. The supervised training 302 may be implemented with a multiclass SVM learner on the annotated text corpus according to the method 200 of Figure 2, and/or as described in Zhou G; Su J; Zhang J; Zhang M. Exploring Various Knowledge in Relation Extraction Proceedings of the 43rd Annual Meeting of the ACL, pages 427-434 "Zhou" or Cortes C, and Vapnik V. 1995. Support- Vector Networks. Machine Learning 20(3):273-297. The user provides definitions of relation types. The semi-supervised learning 304 may comprise a method 400 according to Figure 4. In summary the method 400 generates pseudo documents, grouped according to relation type. Each pseudo document may be a compilation of sentences which include a given relation sub type.

An entity pair may be supplied from the annotated corpus at 402, or the user may decide to generate entity pairs for each of the relation types provided at 404. In the later case, the most relevant concept node in a conceptual database is identified for each relation sub type at 406 when there's not enough training data for certain relation sub types. Example entity pairs for each relation type are then gathered from the indexed conceptual database corresponding to the most relevant concept node at 408.

Each entity pair provided at 410 is expanded into a set of entity pairs at 412. The set of entity pairs may be provided by determining hyponyms or synonyms of each of the entities. A set of sentences or excerpts are gathered from raw text or the web using web search at 414, which include any of the set of entity pairs. The set of excerpts is filtered at 416, using thematic clustering. For example the clustering may be principal components analysis (PCA), and/or K-medoids clustering. The filtered set of excerpts forms a single pseudo document at 418 which is grouped for that relation sub type.

The method 300 may be implemented on the system 500 shown in Figure 5. A server or processor 502 is programmed to execute a number of function or software modules and store data. An annotated corpus database 504 is accessed by a Feature Extraction Engine (FEE) 506. A multi-class SVM learner 508 accesses the feature vectors from the FEE 506. A relation definition database 510 is accessed by a WikilRR module 512, and the WikilRR module 512 also receives the feature vectors from the FEE 506. A meta classifier SVM_C 514 is trained using a validation document database 516, and the output of the WikilRR module 512 and the multi-class SVM 508. A user provides input and receives results via terminal 518.

The method 300 is now explained in more detail using an example. Firstly during the training phase 310 the user provides definitions (404 in Figure 4) for each of the relation types, such as given below:

The user will also supply the annotated corpus (504 in Figure 5), which in this example is the ACE 2004 corpus. An example sentence for the corpus is: As president of Sotheby's, she often conducted the biggest, the highest profile auctions.

The sentence enters the FEE (506 in Figure 5). The FEE 506 includes a part-of- speech (POS) tagger which applies a grammatical classification to each word by their function in the sentence. For instance, one familiar catalogue of parts-of-speech tags will be the set {verbs, nouns, pronouns, adjectives, adverbs, prepositions, conjunctions, punctuation, personal pronoun}, which we might abbreviate as {VB, NN, PN, JJ, IN, CC, PUNC, PRP}. For our example, we show the potential output using our tags:

The sentence is then processed by a chunker into chunks or short phrases that link continuous spans of words in a sentence which would otherwise not make sense on their own. For instance, the three words "give", "up" and "on" in isolation are not very informative, but the chunk "give up on" implies "surrendered". Proper noun phrases like "National University of Singapore" are chunks as well. Breaking up a sentence into chunks avoids the need to analyse the internal details of each chunk, and helps us find patterns more easily.

The chunks are then processed by a parser which determines a structure between the chunks. The parser may construct syntactic trees for sentences. These syntax trees may be hierarchical and may be thought of as encoding information about the relationships between chunks. Examples of syntactic parsers that could be used are Chamiak E, A Maximum-Entropy-Inspired Parser Proceedings of NAACL-2000 and M. Collins. 1999. Head- Driven Statistical Models for Natural Language Parsing. PhD Dissertation, University of Pennsylvania.

A name entity recognizer may process the chunks and label them.

Figure 6 shows a visualization of the output 600 from the FEE 506. The bottommost row of boxes 602 contains the words and their POS tags separated by a slash. The rest of the boxes 604 above are parts of the syntax tree 606. The prepositional phrase "As president of Sotheby's," is broken down into the noun phrase "president of Sotheby's" and "As". The three name entities are president, Sotheby's, and she.

After the above preprocessing, FEE will extract useful features for relation extraction, as eg. used in Zhou. These features are then passed on to the WikilRR module (Figure 512 in Figure 5, at 402 in Figure 4), and the multi class SVM learner (508 in Figure 5).

The name entities in the sentence are passed to the WikilRR module 512 in the form:

That is the relation type is EMP-ORG, and this instance is sub classified as "employee-executive" subtype, with the instance pair being {president.sotherby's}

The WikilRR module 512 takes the generic term president and attempts first to look for hyponyms using Wordnet (412 in Figure 4). The most direct hyponyms for president are corporate executive and business executive. Instead of the single pair of entities, three pairs of entities are available in the same relation type for training since the president of Sotheby's is also a corporate or business executive of Sotheby's.

A Boolean query "president * Sotheby's" is issued to keyword based search engines (414 in Figure 4). The search engines will return documents containing the terms in the query in the same order, with the wildcard "*" character standing in for any number of other words between them. Queries are issued for the two hyponyms pairs as well. Edited examples of some results are:

.president travels incognito to Sotheby's.

■ president of Rego Sotheby's...

■ President Controller at Sotheby's...

. president of Sotheby's...

.chief executive of Sotheby's... Hyponym expansion is particularly useful when at least one of the original name entities is very specific. Use of the original term alone may result in few or no results. The search engine results are normalized into plain text sentence fragments by a specialized Web parser. In the process, all extraneous html tags and non-character sequences are removed. The normalized text moves into the next sub-module. This expansion could also be expanded to include, or alternatively use, synonym expansion. And co-reference resolution could be used to generate more instances, especially when processing a large raw corpus.

Thematic clustering (416 in Figure 4) acts as a filter, primarily to counteract the problem of noise in the Web extracts. Conceptually, even though the same name entities are present in the Web extracts, they might be associated in a way that is different from the actual relation type we wish to discover. In the example, the first entity is an executive in the second entity. Clearly, this relation type does not hold in the following excerpt:

The 40 year old former president travels incognito to Sotheby's.

Such excerpts should be removed as they are not the relation type we wish to train the classifier for. Starting from the premise that the relation type in a sentence is correlated with its thematic content, a cluster based algorithm is used to group sentences by theme.

Two clustering algorithms are used, one based on the idea of principal components analysis (PCA), and the other on distances in a vector space (K-medoids). The text is converted into a matrix format for both algorithms, and the frequency of words appearing in each sentence extract is recorded as a numerical value in a table. Thus each cell contains the number of times a certain word appears in the given sentence.

Through some preliminary experiments, K-medoids based clustering with 5 clusters produced the most consistent results. From a list of 100 excerpts, the 4 irrelevant clusters are culled and the four excerpts which capture the executive-of relationship best are selected, eg:

Charting the tension between auction houses and galleries Le Monde's Roxana Azimi takes time out to interview William Ruprecht, president of Sotheby's

The earnings are the highest since YEAR said Sotheby's president Diana Sotheby's sold the two most expensive paintings of the spring auction season

As president of Sotheby's , she often conducted the biggest , the highest profile auctions , like the Kennedy family treasures , herself .

Pollock reports that Sotheby's CEO William Ruprecht has sold stock worth about while the president of Sotheby's Financial Services

The best excerpts are stored with the meta-data derived from their provenance information into what we call pseudo-documents. We group the pseudo-documents in our database by relation type. The grouped pseudo documents are indexed using an inverted hash index including the relation type and the skip bigram position in the document.

The WikilRR module 512 could be used to incorporate sentential context covering the two entities with large raw text collection as well.

In Figure 4 if the user decided to generate entity pairs, the WikilRR module 512 may resort to the use of a snapshot of Wikipedia (which is an example of a concept node database) stored locally to gather named entity pairs. Wikipedia is an example of a conceptual database.

Wikipedia is a large encyclopaedia written by volunteers who collaborate with online editing tools. Consequently, pages in Wikipedia can be very heterogeneous. They range from dictionary definitions for concepts, Who's Who biographies, to the gossip pages of a tabloid. Wikipedia has a few characteristics that make it a useful resource for relation extraction.

Many proper names and obscure references are cross linked to the corresponding Wikipedia pages. These cross-linkages form a graphical structure that exposes semantic information. For instance, the topic page "Subsidiary" is linked to pages that contain references to the concept. These pages are predominantly articles about actual corporations. By exploiting the syntactical clues of Wikipedia's syntax, we can gather the names of companies and their subsidiaries.

A certain amount of preprocessing is done prior to the training phase 410. First of all, a snapshot of Wikipedia must be stored locally. Generic information retrieval (IR) tools like Lemur or Lucene are used to index the encyclopaedia.

The WikilRR module 512 begins with the definition of relation types, which the user provides, and tries to find matching topic pages in the Wikipedia. The actual steps taken to match definitions to topics are largely dependent on the type of IR tool chosen.

From the collection of topic pages, we gather the name entity instances for the relation type subsidiary as shown below:

"' '^<- ' " ' Relation ^»- ^*" IDi \ Entity! * s - .5 " 5" Entty 2= ^* ~>~ -,i " t~ subsidiary Uβtstar Always QANTAS subsidiary * jjststar Asia Airways QANTAS subsidiary lACCBaπk Rabobank subsidiary lResona Holdings Resoπa Bank subsidiary lKinta Osaka Bank Resoπa Holdings

As shown in Figure 4, each entity pair from Wikipedia is then used to generate a set of entity pairs at 412.

The validation phase 312 involves 10% of the annotated corpus we left aside earlier, being sent through the FEE 506 with the relation type annotations removed. The relation type of each entity pair in the validation document is estimated using the multi class SVM learner at 310 and using the skip bigram similarity features estimated at 312.

Pseudo validation documents are generated (402/404-418 in Figure 4) using the WikilRR module 512 from entity pairs in each of the validation documents. Each pseudo validation document is compared to the pseudo document sets derived in the training mode to derive skip bigram similarity features (312 in Figure 3). A bigram is a pair of words, and skip-bigrams are just bigrams in the same sentence that allow for any number of "skips" between the two words. This is very similar to the idea of the wildcard Boolean query used for the search engines. Indeed, skip-bigrams can be represented in the same way as Boolean queries, except using an underscore ('_') character instead of the wildcard to distinguish them. Thus the rows of the pseudo document labelled EMP-ORG_employee_executive each contain one instance of the skip-bigram "president_Sotheby's".

However, due to the way pseudo-documents are constructed, the skip-bigram measure is a feasible metric for assessing the similarity of two pseudo-documents, and by extension, of assigning relation labels to unlabeled documents. Each pseudo-document is really an extractive summary of online articles about the same theme. The degree of overlap between two pseudo-documents maybe a good measure of their thematic overlap.

Accordingly, the following is done for each validation instance, V.

For each validation instance / entity pair V a validation pseudo document is generated by the WikilRR module 512. The skip-bigram similarity of the validation pseudo document is computed with every single sentence in the database of pseudo-documents from our training set retrieved using the inverted hash index. A skip-bigram similarity score is obtained for every single sentence in the database with respect to V. The scores are collated according to the relation classes.

For each relation type three numbers are generated, TopS, Matching docs, and Total. TopS is the score of the sentence with the highest skip-bigram similarity. The number of documents where the number of skip-bigrams is non-zero is recorded as Matching docs. Finally, the total number of matching skip-bigrams is summed for the entire class and record this as Total.

The six values for an example with two relational classes are shown below:

For each extracted entity pair in V, the multiclass SVM learner 508 will estimate the relation type V_sτ (310 in Figure 3).

V_5T and the skip bigram triplets are provided to train the final SVMc meta classifier

514.

The testing phase 314 involves the last 10% of the annotated corpus.

A series of experiments were conducted, using five-fold cross-validation using a meta classifier according to the exemplary embodiment. In cross-validation, non-overlapping partitions of the entire dataset are used. Thus for five fold cross-validation, there are five approximately equal subsets of the data. Four of those parts are used for training, and the resulting system is tested on the fifth. The testing set was rotated five times until all partitions have been tested.

Due to randomness, some subset of the data may be easier to work with than others. Cross-validation is useful for deriving estimates of the amount of variance that might occur in a system's reported performance in actual use.

Four measures of the effectiveness for the 6 main relation types were computed. Simple accuracy is the ratio of correct labels against the number of tested pairs. The other two measures are computed for each relation type. Recall is the ratio of pairs correctly assigned to the relation against the correct number. Precision is the ratio of pairs correctly assigned to the relation against the number we predicted in total. Thus recall indicates how well the system does at finding relations, while precision indicates if it is overeager to do so. Last of all, the harmonic mean of recall and precision as the F-score are reported.

For simple accuracy, the meta classifier gave a range of between 0.6 and 0.8. The accuracy of the meta classifier is compared to the baseline system by Zhou .

The meta classifier performs consistently better than Zhou's baseline, which shows that the WikilRR enrichment module does accumulate useful contextual information beyond the features that conventional systems use.

Moreover, drilling down to the level of individual relation classes the meta-classifier performs better than the baseline on all but one of the relations, shown below. This might be due to the inherent ambiguity of the OTHER-AFF class. The meta-classifier system has slightly lower precision on the two largest relation classes, EMP-ORG and PHYS, but higher recall, resulting in better F-scores on both types. On the three intermediate sized classes, ART, PER-SOC, and GPE-AFF, the recall and precision were both higher. This suggests that the meta-classifier system does improve recall significantly, but affects precision where there is already a substantial amount of training data.

B Precision

0.48 0.73 0.59

0.78 0.75 0.76

0.49 0.59 0.53

0.14 0.18 0.16

0.63 0.80 0.70

0.75 0.59 0.66 a. Meta-classifier

0.29 0.43 0.34

0.67 0.83 0.73

0.36 0.56 0.45 ϋiiiiRϋiβ 0.18 0.59 0.28

0.32 0.5 0.39

0.40 0.64 0.49 b. Baseline

Whilst exemplary embodiments of the invention have been described in detail, many variations are possible within the scope of the invention as will be clear to a skilled reader. For example while SVM classifiers are described, other types of classifiers may be applicable.

Claims

1. A classification system comprising a first supervised classifier module configured to access an annotated corpus during a training mode, a second semi-supervised module configured to access raw text and index at least one pseudo document extracted from the raw text according to the location of a skip bigram within the at least one pseudo document during the training mode, a third classifier module configured to receive the output of the first supervised classifier module and a plurality of skip bigram similarity features derived from the skip bigram from the second semi-supervised module during a validation mode, and to receive a raw text document for relation extraction during a normal operation mode.

2. The classification system of claim 1 wherein the second semi-supervised module is further configured to, during the training mode,: hyponym and/or synonym expand at least one entity pair, and/or thematic cluster a plurality of excerpts containing at least one entity pair from the raw text corpus.

3. The classification system of claim 2 wherein hyponym and/or synonym expansion comprises receiving an entity pair from an annotated corpus, and generating a plurality of entity pairs using hyponyms and/or synonyms for at least one of the entities.

4. The classification system of claims 2 or 3 wherein thematic clustering comprises extracting the plurality of excerpts from the raw text, and clustering the plurality of excerpts using principal components analysis (PCA), and/or K- medoids clustering, and selecting one or more of the clusters relevant to a selected relation type.

5. The classification system of claim 4 wherein the second semi-supervised module is further configured to receive a relation definition for the at least one relation type, select a concept node relevant to the at least one relation type and/or a relation type definition from a indexed conceptual database, and provide a set of entity pairs relating to the selected concept node.

6. The classification system of claim 5 wherein hyponym expansion comprises generating a plurality of entity pairs using hyponyms and/or synonyms for at least one of the entities for each of the set of entity pairs.

7. A method comprising getting a relation tag for an entity pair from a validation document using a supervised trained classifier, getting a set of features for the same entity pair using a semi-supervised learning approach from raw text, training a meta classifier for relation extraction by providing the relation tag and the set of features to the meta classifier, and extracting relations from raw text using the trained meta classifier.

8. The method of claim 7 wherein the semi supervising learning approach comprises at least one of: hyponym and/or synonym expansion at least one entity pair, thematic clustering of a plurality of excerpts containing a selected entity pair from the raw text, and inverted hash indexing at least one pseudo document according to the location of a skip bigram.

9. The method of claim 8 wherein hyponym and/or synonym expansion comprises receiving an entity pair from an annotated corpus, and generating a plurality of entity pairs using hyponyms and/or synonyms for at least one of the entities.

10. The method of claims 7 or 8 wherein thematic clustering comprises extracting the plurality of excerpts from the raw text, and clustering the plurality of excerpts using principal components analysis (PCA), and/or K- medoids clustering, and selecting one or more of the clusters relevant to a selected relation type.

11. The method of claim 8 wherein the semi supervising learning approach further comprises receiving a relation definition for the at least one relation type, selecting a concept node relevant to the at least one relation type and/or a relation type definition from a indexed conceptual database, and providing a set of entity pairs relating to the selected concept node.

12. The method of claim 11 wherein hyponym expansion comprises generating a plurality of entity pairs using hyponyms and/or synonyms for at least one of the entities for each of the set of entity pairs.

13. The method of any one of claims 7 to 12 wherein getting the set of features comprises: generating a pseudo validation document, using the semi-supervised learning approach on the unannotated entity pair, wherein the second set of features are skip bigram similarity features between relation type groups of pseudo documents previously generated from the raw text using entity pairs from the annotated corpus and the pseudo validation document.